# `advertools` v0.12.0 New Features Overview

* A new function `logs_to_df` to parse and compress any log file
* Crawling has a new option to `skip_url_params`, which only follows links that don't contain any URL parameters
* Crawling extracts all `<img>` tag attributes ('alt', 'crossorigin', 'height', 'ismap', 'loading', 'longdesc', 'referrerpolicy', 'sizes', 'src', 'srcset', 'usemap', and 'width' )
* The `url_to_df` function  extracts a new column `last_dir` showing the last directory of the path part of each URL, typically the title of the page, or product name


In [1]:
import advertools as adv
import pandas as pd
pd.options.display.max_columns = None

for p in [adv, pd]:
    print(f'{p.__name__:-<15} v{p.__version__}')

advertools----- v0.12.2
pandas--------- v1.2.1


# `logs_to_df`

This function converts a log file to a `.parquet` format DataFrame. As an additional side effect, it also compresses the file to a smaller size.

We start with a regular log file in the "common" format, and display the first ten rows:

In [2]:
!head data/common.log

207.46.13.35 - - [01/Jan/2017:01:30:58 +0000] "GET / HTTP/1.1" 200 5267
74.125.145.96 - - [01/Jan/2017:02:53:11 +0000] "GET /used_to_work HTTP/1.1" 200 4583
64.233.160.5 - - [01/Jan/2017:04:06:38 +0000] "GET /dowload/press.pdf HTTP/1.1" 200 11048346
157.55.112.243 - - [01/Jan/2017:05:23:00 +0000] "GET / HTTP/1.1" 200 5713
74.125.145.96 - - [01/Jan/2017:06:06:20 +0000] "GET /used_to_work.php HTTP/1.1" 404 6158
207.46.13.35 - - [01/Jan/2017:07:33:51 +0000] "GET /inconsistent.html HTTP/1.1" 100 1297
207.46.13.35 - - [01/Jan/2017:08:54:53 +0000] "GET / HTTP/1.1" 200 5713
66.102.15.135 - - [01/Jan/2017:10:28:59 +0000] "GET / HTTP/1.1" 200 5973
157.55.108.202 - - [01/Jan/2017:12:01:02 +0000] "GET /dowload/press.pdf HTTP/1.1" 200 11048346
207.46.13.35 - - [01/Jan/2017:12:51:44 +0000] "GET /alwaysredirects.html HTTP/1.1" 301 249


Run the `logs_to_df` function as follows:

In [3]:
adv.logs_to_df(log_file='data/common.log',
               output_file='common.parquet',
               errors_file='common_errors.csv',
               log_format='common')

Parsed             500 lines.

In [4]:
common_df = pd.read_parquet('common.parquet')
common_df

Unnamed: 0,client,userid,datetime,method,request,status,size
0,207.46.13.35,-,01/Jan/2017:01:30:58 +0000,GET,/,200,5267
1,74.125.145.96,-,01/Jan/2017:02:53:11 +0000,GET,/used_to_work,200,4583
2,64.233.160.5,-,01/Jan/2017:04:06:38 +0000,GET,/dowload/press.pdf,200,11048346
3,157.55.112.243,-,01/Jan/2017:05:23:00 +0000,GET,/,200,5713
4,74.125.145.96,-,01/Jan/2017:06:06:20 +0000,GET,/used_to_work.php,404,6158
...,...,...,...,...,...,...,...
495,66.102.15.135,-,30/Jan/2017:13:07:30 +0000,GET,/js/bigfoot.js,200,3425
496,157.55.108.202,-,30/Jan/2017:13:33:57 +0000,GET,/sitemap.xml,200,9562
497,157.55.112.243,-,30/Jan/2017:14:40:32 +0000,GET,/used_to_work,200,4583
498,72.14.192.15,-,30/Jan/2017:15:28:26 +0000,GET,/used_to_work.php,404,4583


Another example in the "combined" (or extended) format, containing additional `referer` and `user_agent` fields:

In [5]:
!head data/combined.log

40.77.167.24 - - [27/Nov/2021:00:17:27 +0000] "GET / HTTP/1.1" 200 13270 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
52.36.30.61 - - [27/Nov/2021:00:22:32 +0000] "GET /wp-admin/css/ HTTP/1.1" 200 13270 "binance.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"
207.46.13.91 - - [27/Nov/2021:00:27:33 +0000] "GET /Mali HTTP/1.1" 200 13270 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
192.241.213.134 - - [27/Nov/2021:00:27:57 +0000] "GET /ecp/Current/exporttool/microsoft.exchange.ediscovery.exporttool.application HTTP/1.1" 200 13270 "-" "Mozilla/5.0 zgrab/0.x"
194.48.199.78 - - [27/Nov/2021:00:30:03 +0000] "GET /favicon.ico HTTP/1.1" 200 13270 "-" "curl/7.64.1"
45.146.164.110 - - [27/Nov/2021:00:32:44 +0000] "GET /index.php?s=/Index/\x5Cthink\x5Capp/invokefunction&function=call_user_func_array&vars[0]=md5&vars[1][]=HelloThinkPHP21 HTTP/1.1" 200 13270 "-"

In [6]:
adv.logs_to_df(log_file='data/combined.log', 
               output_file='combined.parquet',
               errors_file='combined_errors.csv',
               log_format='combined')

Parsed           9,479 lines.

In [None]:
combined_df = pd.read_parquet('combined.parquet')
combined_df

Unnamed: 0,client,userid,datetime,method,request,status,size,referer,user_agent
0,40.77.167.24,-,27/Nov/2021:00:17:27 +0000,GET,/,200,13270,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...
1,52.36.30.61,-,27/Nov/2021:00:22:32 +0000,GET,/wp-admin/css/,200,13270,binance.com,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
2,207.46.13.91,-,27/Nov/2021:00:27:33 +0000,GET,/Mali,200,13270,-,Mozilla/5.0 (compatible; bingbot/2.0; +http://...
3,192.241.213.134,-,27/Nov/2021:00:27:57 +0000,GET,/ecp/Current/exporttool/microsoft.exchange.edi...,200,13270,-,Mozilla/5.0 zgrab/0.x
4,194.48.199.78,-,27/Nov/2021:00:30:03 +0000,GET,/favicon.ico,200,13270,-,curl/7.64.1
...,...,...,...,...,...,...,...,...,...
9272,66.249.64.251,-,18/Nov/2021:23:40:10 +0000,GET,/Equatorial%20Guinea,200,13270,-,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...
9273,66.249.64.169,-,18/Nov/2021:23:55:10 +0000,GET,/Italy,200,13270,-,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...
9274,66.249.64.167,-,18/Nov/2021:23:59:00 +0000,GET,/_dash-component-suites/dash_renderer/dash_ren...,200,59217,http://172.104.245.213/Italy,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...
9275,66.249.64.163,-,18/Nov/2021:23:59:00 +0000,GET,/_dash-dependencies,200,430,http://172.104.245.213/Italy,Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Bu...


Many times we have lines that do not conform to the format that we asked for, and these cannot be properly parsed, so they are saved in a separate `errors_file`  so we can see what they are and how/if we need to fix anything. 

If we open as a CSV file with the separator as "@@" we get three columns: 

* `lineno`: the line number of the error line (in the original log file)
* `line`: the actual line
* `error_msg`

In [8]:
combined_errors = pd.read_csv('combined_errors.csv', sep='@@', names=['lineno', 'line', 'error_msg'], engine='python')
combined_errors

Unnamed: 0,lineno,line,error_msg
0,7,94.232.43.63 - - [27/Nov/2021:00:36:31 +0000] ...,list index out of range
1,8,94.232.43.63 - - [27/Nov/2021:00:36:31 +0000] ...,list index out of range
2,126,5.188.210.227 - - [27/Nov/2021:06:35:27 +0000]...,list index out of range
3,128,5.188.210.227 - - [27/Nov/2021:06:36:50 +0000]...,list index out of range
4,138,178.128.194.144 - - [27/Nov/2021:07:00:46 +000...,list index out of range
...,...,...,...
197,9475,2021/11/26 05:32:35 [crit] 60954#60954: *19712...,list index out of range
198,9476,2021/11/26 18:46:42 [crit] 60954#60954: *19768...,list index out of range
199,9477,2021/11/26 20:36:56 [crit] 60954#60954: *19775...,list index out of range
200,9478,2021/11/26 21:43:17 [crit] 60954#60954: *19780...,list index out of range


In [71]:
combined_errors['line'].sample(10).values

array(['142.93.34.237 - - [14/Nov/2021:18:35:34 +0000] "" 400 0 "-" "-"',
       '66.240.205.34 - - [23/Nov/2021:12:47:34 +0000] "H\\x00\\x00\\x00tj\\xA8\\x9E#D\\x98+\\xCA\\xF0\\xA7\\xBBl\\xC5\\x19\\xD7\\x8D\\xB6\\x18\\xEDJ\\x1En\\xC1\\xF9xu[l\\xF0E\\x1D-j\\xEC\\xD4xL\\xC9r\\xC9\\x15\\x10u\\xE0%\\x86Rtg\\x05fv\\x86]%\\xCC\\x80\\x0C\\xE8\\xCF\\xAE\\x00\\xB5\\xC0f\\xC8\\x8DD\\xC5\\x09\\xF4" 400 166 "-" "-"',
       '94.232.43.63 - - [24/Nov/2021:08:20:57 +0000] "\\x03\\x00\\x00,\'\\xE0\\x00\\x00\\x00\\x00\\x00Cookie: mstshash=Domain" 400 166 "-" "-"',
       '178.128.194.144 - - [26/Mar/2021:07:53:05 +0000] "238\\x00ll|\'|\'|SGFjS2VkX0Q3NUU2QUFB|\'|\'|WIN-QZN7FJ7D1O|\'|\'|Administrator|\'|\'|18-11-28|\'|\'||\'|\'|Win 7 Ultimate SP1 x64|\'|\'|No|\'|\'|S17|\'|\'|..|\'|\'|SW5ib3ggLSBPdXRsb29rIERhdGEgRmlsZSAtIE1pY3Jvc29mdCBPdXRsb29rAA==|\'|\'|" 400 166 "-" "-"',
       '185.254.31.134 - - [19/Nov/2021:21:43:48 +0000] "\\x16\\x03\\x01\\x01C\\x01\\x00\\x01?\\x03\\x03\\x14O\\xC1\\xD0\\xB6=\\x8D

It looks like we have some bytes data that weren't parsed properly, and so we have them in the `errors_file`.

In [13]:
%ls -lh data/combined.log combined.parquet 

-rw-r--r--  1 me  staff   142K Nov 27 19:54 combined.parquet
-rw-r--r--  1 me  staff   2.0M Nov 27 14:48 data/combined.log


The new file (in the `parquet` format) is 142k vs. 2M previously, which is about 7% of the original size. 

# `skip_url_params`

By default when you set `follow_links=True` the crawler follows all links on the page (provided they are in the `allowed_domains` list provided and not already crawled). 

Sometimes you might have URLs with parameters that only display variants of a product (typically in ecommerce websites), and don't want to waste time and bandwidth on cralwing those, so you set `skip_url_params=True`.

We crawl twice using the two options and compare: 

In [15]:
adv.crawl(url_list='https://nytimes.com',
          output_file='nyt_crawl_with_params.jl',
          follow_links=True,
          skip_url_params=False,  # <- default
          custom_settings={'CLOSESPIDER_PAGECOUNT': 300,
                           'LOG_FILE': 'nyt_crawl_with_params.log'})

In [16]:
nyt_params = pd.read_json('nyt_crawl_with_params.jl', lines=True)
nyt_params.head()

Unnamed: 0,url,title,meta_desc,viewport,charset,h2,h3,canonical,alt_href,alt_hreflang,og:url,og:type,og:title,og:description,og:image,twitter:site,jsonld_@context,jsonld_@type,jsonld_image,jsonld_name,jsonld_mainEntity.@context,jsonld_mainEntity.@type,jsonld_mainEntity.itemListElement,jsonld_mainEntity.numberOfItems,jsonld_publisher.@id,jsonld_1_@context,jsonld_1_@type,jsonld_1_name,jsonld_1_url,jsonld_1_@id,jsonld_1_diversityPolicy,jsonld_1_ethicsPolicy,jsonld_1_masthead,jsonld_1_foundingDate,jsonld_1_sameAs,jsonld_1_alternateName,jsonld_1_subOrganization,jsonld_1_logo.@context,jsonld_1_logo.@type,jsonld_1_logo.url,jsonld_1_logo.height,jsonld_1_logo.width,body_text,size,download_timeout,download_slot,download_latency,redirect_times,redirect_ttl,redirect_urls,redirect_reasons,depth,status,links_url,links_text,links_nofollow,nav_links_url,nav_links_text,nav_links_nofollow,header_links_url,header_links_text,header_links_nofollow,footer_links_url,footer_links_text,footer_links_nofollow,img_loading,img_src,img_alt,ip_address,crawl_time,resp_headers_server,resp_headers_content-type,resp_headers_x-nyt-data-last-modified,resp_headers_last-modified,resp_headers_x-pagetype,resp_headers_x-xss-protection,resp_headers_x-content-type-options,resp_headers_cache-control,resp_headers_x-nyt-route,resp_headers_x-origin-time,resp_headers_accept-ranges,resp_headers_date,resp_headers_age,resp_headers_x-served-by,resp_headers_x-cache,resp_headers_x-cache-hits,resp_headers_x-timer,resp_headers_vary,resp_headers_set-cookie,resp_headers_x-nyt-app-webview,resp_headers_x-gdpr,resp_headers_x-frame-options,resp_headers_onion-location,resp_headers_x-api-version,resp_headers_content-security-policy,resp_headers_strict-transport-security,resp_headers_x-nyt-edge-cache,request_headers_accept,request_headers_accept-language,request_headers_user-agent,request_headers_accept-encoding,request_headers_cookie,h1,og:image:alt,jsonld_description,jsonld_mainEntityOfPage,jsonld_url,jsonld_inLanguage,jsonld_author,jsonld_dateModified,jsonld_datePublished,jsonld_headline,jsonld_copyrightYear,jsonld_isAccessibleForFree,jsonld_copyrightHolder.@id,jsonld_sourceOrganization.@id,jsonld_hasPart.@type,jsonld_hasPart.isAccessibleForFree,jsonld_hasPart.cssSelector,jsonld_isPartOf.@type,jsonld_isPartOf.name,jsonld_isPartOf.productID,resp_headers_x-scoop-last-modified,request_headers_referer,h4,jsonld_alternativeHeadline,jsonld_author.@id,img_height,img_width,twitter:app:name:googleplay,twitter:app:id:googleplay,twitter:app:url:googleplay,jsonld_2_@context,jsonld_2_@type,jsonld_2_itemListElement,img_srcset,img_sizes,resp_headers_x-datadome-timer,resp_headers_fastly-restarts,resp_headers_cf-chl-bypass,resp_headers_permissions-policy,resp_headers_expires,resp_headers_expect-ct,resp_headers_report-to,resp_headers_nel,resp_headers_cf-ray,jsonld_author.@context,jsonld_author.@type,jsonld_author.description,jsonld_author.url,jsonld_author.name,jsonld_author.sameAs,jsonld_2_headline,jsonld_2_description,jsonld_2_image,jsonld_2_coverageStartTime,jsonld_2_coverageEndTime,jsonld_2_datePublished,jsonld_2_dateModified,jsonld_2_articleBody,jsonld_2_copyrightYear,jsonld_2_liveBlogUpdate,jsonld_2_url,jsonld_2_mainEntityOfPage,jsonld_2_author.@id,jsonld_2_publisher.@id,jsonld_2_copyrightHolder.@id,jsonld_2_sourceOrganization.@id,jsonld_author.image,jsonld_author.image.@context,jsonld_author.image.@type,jsonld_author.image.url,jsonld_2_@id,jsonld_2_name,jsonld_2_thumbnailUrl,jsonld_2_embedUrl,jsonld_2_uploadDate,jsonld_2_transcript,jsonld_2_duration,jsonld_3_@context,jsonld_3_@type,jsonld_3_@id,jsonld_3_description,jsonld_3_url,jsonld_3_name,jsonld_3_thumbnailUrl,jsonld_3_embedUrl,jsonld_3_uploadDate,jsonld_3_transcript,jsonld_3_duration,jsonld_2_inLanguage,jsonld_2_author,jsonld_2_isAccessibleForFree,jsonld_2_hasPart.@type,jsonld_2_hasPart.isAccessibleForFree,jsonld_2_hasPart.cssSelector,jsonld_2_isPartOf.@type,jsonld_2_isPartOf.name,jsonld_2_isPartOf.productID,jsonld_3_diversityPolicy,jsonld_3_ethicsPolicy,jsonld_3_masthead,jsonld_3_foundingDate,jsonld_3_sameAs,jsonld_3_logo.@context,jsonld_3_logo.@type,jsonld_3_logo.url,jsonld_3_logo.height,jsonld_3_logo.width,jsonld_4_@context,jsonld_4_@type,jsonld_4_itemListElement,jsonld_5_@context,jsonld_5_@type,jsonld_5_@id,jsonld_5_description,jsonld_5_url,jsonld_5_name,jsonld_5_thumbnailUrl,jsonld_5_embedUrl,jsonld_5_uploadDate,jsonld_5_duration,jsonld_6_@context,jsonld_6_@type,jsonld_6_@id,jsonld_6_description,jsonld_6_url,jsonld_6_name,jsonld_6_thumbnailUrl,jsonld_6_embedUrl,jsonld_6_uploadDate,jsonld_6_transcript,jsonld_6_duration,jsonld_4_@id,jsonld_4_description,jsonld_4_url,jsonld_4_name,jsonld_4_thumbnailUrl,jsonld_4_embedUrl,jsonld_4_uploadDate,jsonld_4_transcript,jsonld_4_duration,jsonld_5_transcript,jsonld_7_@context,jsonld_7_@type,jsonld_7_@id,jsonld_7_description,jsonld_7_url,jsonld_7_name,jsonld_7_thumbnailUrl,jsonld_7_embedUrl,jsonld_7_uploadDate,jsonld_7_transcript,jsonld_7_duration,jsonld_video,jsonld_8_@context,jsonld_8_@type,jsonld_8_@id,jsonld_8_description,jsonld_8_url,jsonld_8_name,jsonld_8_thumbnailUrl,jsonld_8_embedUrl,jsonld_8_uploadDate,jsonld_8_transcript,jsonld_8_duration,jsonld_image.@context,jsonld_image.@type,jsonld_image.url,jsonld_image.height,jsonld_image.width,h5,og:site_name,resp_headers_via,resp_headers_x-amz-cf-pop,resp_headers_x-amz-cf-id,h6,resp_headers_x-powered-by,resp_headers_pragma,twitter:card,jsonld_audio,jsonld_3_contentUrl,jsonld_image.caption,resp_headers_x-datadome,twitter:title,twitter:description,twitter:image,twitter:creator,twitter:url,jsonld_2_alternativeHeadline,resp_headers_x-guploader-uploadid,resp_headers_etag,resp_headers_x-goog-generation,resp_headers_x-goog-metageneration,resp_headers_x-goog-stored-content-encoding,resp_headers_x-goog-stored-content-length,resp_headers_x-goog-hash,resp_headers_x-goog-storage-class,resp_headers_access-control-allow-headers,resp_headers_access-control-allow-methods,resp_headers_access-control-allow-origin,resp_headers_access-control-expose-headers,resp_headers_x-cloud-trace-context
0,https://www.nytimes.com/,"The New York Times - Breaking News, US News, W...","Live news, investigations, opinion, photos and...","width=device-width, initial-scale=1",utf-8,Tracking the Coronavirus ›@@Site Information N...,"As New Variant Circles the Globe, African Nati...",https://www.nytimes.com,https://www.nytimes.com@@https://www.nytimes.c...,x-default@@en-US@@en@@en-GB@@en-AU@@en-CA@@es@...,https://www.nytimes.com,website,"The New York Times - Breaking News, US News, W...","Live news, investigations, opinion, photos and...",https://static01.nyt.com/newsgraphics/images/i...,@nytimes,http://schema.org,WebPage,"[{'@context': 'http://schema.org', '@type': 'I...",The New York Times,http://schema.org,ItemList,[],0.0,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,"[https://www.facebook.com/nytimes/, https://tw...","[NYT, new york times, nytimes, ny times]","[{'@type': 'Organization', 'name': 'NYT Cookin...",http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. International Canada Espa...,1518082,180,nytimes.com,2.376272,1.0,19.0,https://nytimes.com,301.0,0,200,https://www.nytimes.com/#after-dfp-ad-top@@htt...,Continue reading the main story@@Skip to conte...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/#site-content@@https:/...,Skip to content@@Skip to site index@@@@U.S.@@I...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,lazy@@@@@@lazy@@lazy@@lazy@@lazy@@@@@@@@@@@@@@...,https://static01.nyt.com/images/2021/11/27/wor...,Travelers checking a flight information board ...,151.101.1.164,2021-11-27 18:02:34,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 17:59:45 GMT","Sat, 27 Nov 2021 17:59:45 GMT",vi-homepage,1; mode=block,nosniff,"s-maxage=30,no-cache",homepage,2021-11-27 18:02:34 UTC,bytes,"Sat, 27 Nov 2021 18:02:34 GMT",0.0,"cache-lga13625-LGA, cache-fjr7922-FJR","MISS, MISS","0, 0","S1638036152.076859,VS0,VE1984","Accept-Encoding, Fastly-SSL","nyt-a=wM76KxArTHJUL4YFlSpE_O; Expires=Sun, 27 ...",0.0,0.0,DENY,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,MISS-MISS,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,https://www.nytimes.com/interactive/2020/us/co...,How Full Are Hospital I.C.U.s Near You? - The ...,See how many Covid-19 patients are being treat...,"width=device-width, initial-scale=1",utf-8,Comments @@\n\tTracking the Coronavirus\n@@Sit...,United States@@Latest Maps and Data@@Vaccinati...,https://www.nytimes.com/interactive/2020/us/co...,,,https://www.nytimes.com/interactive/2020/us/co...,article,How Full Are Hospital I.C.U.s Near You?,See how many Covid-19 patients are being treat...,https://static01.nyt.com/images/2020/12/15/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. | How Full Are Hospital I...,283694,180,www.nytimes.com,0.393619,,,,,1,200,https://www.nytimes.com/interactive/2020/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/us/co...,Latest Maps and Data@@Vaccinations by State@@C...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2020/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://carlsonschool.umn.edu/mili-misrc-covid...,Covid-19 Hospitalization Tracking Project@@© 2...,False@@False@@False@@False@@False@@False@@Fals...,,,,151.101.1.164,2021-11-27 18:02:35,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 13:15:18 GMT","Sat, 27 Nov 2021 13:15:18 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=300,no-cache",vi-interactive,2021-11-27 13:23:55 UTC,bytes,"Sat, 27 Nov 2021 18:02:35 GMT",16977.0,"cache-lga21947-LGA, cache-fjr7922-FJR","HIT, HIT","1, 1","S1638036155.079572,VS0,VE2","Accept-Encoding, Fastly-SSL","nyt-a=wM76KxArTHJUL4YFlSpE_O; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,HIT-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=wM76KxArTHJUL4YFlSpE_O; nyt-...,How Full Are Hospital I.C.U.s Near You?,,See how many Covid-19 patients are being treat...,https://www.nytimes.com/interactive/2020/us/co...,https://www.nytimes.com/interactive/2020/us/co...,en-US,"[{'@context': 'http://schema.org', '@type': 'P...",2021-11-15T16:37:40.400Z,2020-12-15T16:33:01.000Z,How Full Are Hospital I.C.U.s Near You?,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-15T16:37:40.400Z,https://www.nytimes.com/,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,https://www.nytimes.com/interactive/2021/world...,Covid World Vaccination Tracker - The New York...,More than 4.24 billion people worldwide have r...,"width=device-width, initial-scale=1",utf-8,\n\tTracking the Coronavirus\n@@Site Index@@Si...,The Coronavirus Pandemic@@Vaccinations by coun...,https://www.nytimes.com/interactive/2021/world...,,,https://www.nytimes.com/interactive/2021/world...,article,Tracking Coronavirus Vaccinations Around the W...,More than 4.24 billion people worldwide have r...,https://static01.nyt.com/images/2021/01/28/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH World | Tracking Coronavirus V...,366265,180,www.nytimes.com,0.449128,,,,,1,200,https://www.nytimes.com/interactive/2021/world...,Skip to content@@Skip to site index@@World@@@@...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/news-event/coronavirus...,The Coronavirus Pandemic@@Covid-19 Updates@@Co...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/world...,Skip to content@@Skip to site index@@World@@@@...,False@@False@@False@@False@@False@@False@@Fals...,https://ourworldindata.org/coronavirus@@https:...,Our World in Data@@this page@@gogov.ru@@annexe...,False@@False@@False@@False@@False@@False@@Fals...,,https://static01.nyt.com/images/2020/12/16/us/...,thumbnail,151.101.1.164,2021-11-27 18:02:35,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 17:51:48 GMT","Sat, 27 Nov 2021 17:51:48 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=5,no-cache",vi-interactive,2021-11-27 17:51:51 UTC,bytes,"Sat, 27 Nov 2021 18:02:35 GMT",646.0,"cache-lga21966-LGA, cache-fjr7920-FJR","HIT, HIT","1, 1","S1638036155.086335,VS0,VE1","Accept-Encoding, Fastly-SSL","nyt-a=wM76KxArTHJUL4YFlSpE_O; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,HIT-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=wM76KxArTHJUL4YFlSpE_O; nyt-...,Tracking Coronavirus Vaccinations Around the W...,,More than 4.24 billion people worldwide have r...,https://www.nytimes.com/interactive/2021/world...,https://www.nytimes.com/interactive/2021/world...,en-US,"[{'@context': 'http://schema.org', '@type': 'P...",2021-11-26T12:00:34.890Z,2021-01-29T09:32:19.000Z,Covid World Vaccination Tracker,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-26T12:00:34.890Z,https://www.nytimes.com/,"Charts show countries with at least 100,000 pe...",Tracking Coronavirus Vaccinations Around the W...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,https://www.nytimes.com/interactive/2021/us/co...,Covid in the U.S.: Track Daily Cases Near You ...,Build your own dashboard to track the coronavi...,"width=device-width, initial-scale=1",utf-8,Tracking the Coronavirus@@Site Index@@Site Inf...,The Coronavirus Pandemic@@United States@@Lates...,https://www.nytimes.com/interactive/2021/us/co...,,,https://www.nytimes.com/interactive/2021/us/co...,article,Track Coronavirus Cases in Places Important to...,Build your own dashboard to track the coronavi...,https://static01.nyt.com/images/2020/10/25/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. | Track Coronavirus Cases...,595221,180,www.nytimes.com,0.38993,,,,,1,200,https://www.nytimes.com/interactive/2021/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/news-event/coronavirus...,The Coronavirus Pandemic@@Covid-19 Updates@@Co...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,,,,151.101.1.164,2021-11-27 18:02:35,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 17:07:02 GMT","Sat, 27 Nov 2021 17:07:02 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=60,no-cache",vi-interactive,2021-11-27 17:08:44 UTC,bytes,"Sat, 27 Nov 2021 18:02:35 GMT",3285.0,"cache-lga21945-LGA, cache-fjr7923-FJR","HIT, HIT","1, 1","S1638036155.079775,VS0,VE1","Accept-Encoding, Fastly-SSL","nyt-a=wM76KxArTHJUL4YFlSpE_O; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,HIT-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=wM76KxArTHJUL4YFlSpE_O; nyt-...,Track Coronavirus Cases in Places Important to...,,Build your own dashboard to track the coronavi...,https://www.nytimes.com/interactive/2021/us/co...,https://www.nytimes.com/interactive/2021/us/co...,en,,2021-11-27T10:35:25.663Z,2020-11-24T21:02:00.000Z,Covid in the U.S.: Track Daily Cases Near You,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-27T10:35:25.663Z,https://www.nytimes.com/,,Track Coronavirus Cases in Places Important to...,https://www.nytimes.com/#publisher,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,https://www.nytimes.com/interactive/2021/us/ne...,New Mexico Coronavirus Map and Case Count - Th...,See the latest charts and maps of coronavirus ...,"width=device-width, initial-scale=1",utf-8,New reported cases@@Daily new hospital admissi...,The Coronavirus Pandemic@@Tests@@Hospitalized@...,https://www.nytimes.com/interactive/2021/us/ne...,,,https://www.nytimes.com/interactive/2021/us/ne...,article,New Mexico Coronavirus Map and Case Count,See the latest charts and maps of coronavirus ...,https://static01.nyt.com/images/2020/03/29/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. | New Mexico Coronavirus ...,1330295,180,www.nytimes.com,0.404914,,,,,1,200,https://www.nytimes.com/interactive/2021/us/ne...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/news-event/coronavirus...,The Coronavirus Pandemic@@Covid-19 Updates@@Co...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/us/ne...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,lazy@@lazy@@lazy@@lazy@@lazy@@lazy@@@@@@@@@@@@...,https://static01.nyt.com/newsgraphics/2021/cor...,Hot spots thumbnail@@Vaccinations thumbnail@@R...,151.101.1.164,2021-11-27 18:02:35,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 17:52:48 GMT","Sat, 27 Nov 2021 17:52:48 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=60,no-cache",vi-interactive,2021-11-27 17:52:48 UTC,bytes,"Sat, 27 Nov 2021 18:02:35 GMT",586.0,"cache-lga21934-LGA, cache-fjr7922-FJR","MISS, HIT","0, 1","S1638036155.086356,VS0,VE1","Accept-Encoding, Fastly-SSL","nyt-a=wM76KxArTHJUL4YFlSpE_O; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,MISS-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=wM76KxArTHJUL4YFlSpE_O; nyt-...,New Mexico Coronavirus Map and Case Count@@Tra...,,See the latest charts and maps of coronavirus ...,https://www.nytimes.com/interactive/2021/us/ne...,https://www.nytimes.com/interactive/2021/us/ne...,en-US,,2021-11-27T10:35:22.047Z,2020-04-01T15:47:00.000Z,New Mexico Coronavirus Map and Case Count,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-27T10:35:22.047Z,https://www.nytimes.com/,,New Mexico Coronavirus Map and Case Count,https://www.nytimes.com/#publisher,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Let's check how many URLs contain a question mark: 

In [17]:
nyt_params['url'].str.contains('\?', regex=True).sum()

38

We now crawl with `skip_url_params=True`:

In [18]:
adv.crawl(url_list='https://nytimes.com',
          output_file='nyt_crawl_no_params.jl',
          follow_links=True,
          skip_url_params=True,
          custom_settings={'CLOSESPIDER_PAGECOUNT': 300,
                           'LOG_FILE': 'nyt_crawl_no_params.log'})

In [19]:
nyt_no_params = pd.read_json('nyt_crawl_no_params.jl', lines=True)
nyt_no_params.head()

Unnamed: 0,url,title,meta_desc,viewport,charset,h2,h3,canonical,alt_href,alt_hreflang,og:url,og:type,og:title,og:description,og:image,twitter:site,jsonld_@context,jsonld_@type,jsonld_image,jsonld_name,jsonld_mainEntity.@context,jsonld_mainEntity.@type,jsonld_mainEntity.itemListElement,jsonld_mainEntity.numberOfItems,jsonld_publisher.@id,jsonld_1_@context,jsonld_1_@type,jsonld_1_name,jsonld_1_url,jsonld_1_@id,jsonld_1_diversityPolicy,jsonld_1_ethicsPolicy,jsonld_1_masthead,jsonld_1_foundingDate,jsonld_1_sameAs,jsonld_1_alternateName,jsonld_1_subOrganization,jsonld_1_logo.@context,jsonld_1_logo.@type,jsonld_1_logo.url,jsonld_1_logo.height,jsonld_1_logo.width,body_text,size,download_timeout,download_slot,download_latency,redirect_times,redirect_ttl,redirect_urls,redirect_reasons,depth,status,links_url,links_text,links_nofollow,nav_links_url,nav_links_text,nav_links_nofollow,header_links_url,header_links_text,header_links_nofollow,footer_links_url,footer_links_text,footer_links_nofollow,img_src,img_loading,img_alt,ip_address,crawl_time,resp_headers_server,resp_headers_content-type,resp_headers_x-nyt-data-last-modified,resp_headers_last-modified,resp_headers_x-pagetype,resp_headers_x-xss-protection,resp_headers_x-content-type-options,resp_headers_cache-control,resp_headers_x-nyt-route,resp_headers_x-origin-time,resp_headers_accept-ranges,resp_headers_date,resp_headers_age,resp_headers_x-served-by,resp_headers_x-cache,resp_headers_x-cache-hits,resp_headers_x-timer,resp_headers_vary,resp_headers_set-cookie,resp_headers_x-nyt-app-webview,resp_headers_x-gdpr,resp_headers_x-frame-options,resp_headers_onion-location,resp_headers_x-api-version,resp_headers_content-security-policy,resp_headers_strict-transport-security,resp_headers_x-nyt-edge-cache,request_headers_accept,request_headers_accept-language,request_headers_user-agent,request_headers_accept-encoding,request_headers_cookie,h1,og:image:alt,jsonld_description,jsonld_mainEntityOfPage,jsonld_url,jsonld_inLanguage,jsonld_author,jsonld_dateModified,jsonld_datePublished,jsonld_headline,jsonld_copyrightYear,jsonld_isAccessibleForFree,jsonld_copyrightHolder.@id,jsonld_sourceOrganization.@id,jsonld_hasPart.@type,jsonld_hasPart.isAccessibleForFree,jsonld_hasPart.cssSelector,jsonld_isPartOf.@type,jsonld_isPartOf.name,jsonld_isPartOf.productID,resp_headers_x-scoop-last-modified,request_headers_referer,h4,jsonld_alternativeHeadline,jsonld_author.@id,resp_headers_cf-chl-bypass,resp_headers_permissions-policy,resp_headers_expires,resp_headers_expect-ct,resp_headers_report-to,resp_headers_nel,resp_headers_cf-ray,img_width,img_height,twitter:app:name:googleplay,twitter:app:id:googleplay,twitter:app:url:googleplay,img_srcset,img_sizes,resp_headers_x-datadome-timer,resp_headers_fastly-restarts,jsonld_2_@context,jsonld_2_@type,jsonld_2_itemListElement,jsonld_video,jsonld_image.@context,jsonld_image.@type,jsonld_image.url,jsonld_image.height,jsonld_image.width,jsonld_image.caption,jsonld_3_@context,jsonld_3_@type,jsonld_3_@id,jsonld_3_description,jsonld_3_url,jsonld_3_name,jsonld_3_thumbnailUrl,jsonld_3_embedUrl,jsonld_3_uploadDate,jsonld_3_duration,jsonld_2_headline,jsonld_2_description,jsonld_2_image,jsonld_2_coverageStartTime,jsonld_2_coverageEndTime,jsonld_2_datePublished,jsonld_2_dateModified,jsonld_2_articleBody,jsonld_2_copyrightYear,jsonld_2_liveBlogUpdate,jsonld_2_url,jsonld_2_mainEntityOfPage,jsonld_2_author.@id,jsonld_2_publisher.@id,jsonld_2_copyrightHolder.@id,jsonld_2_sourceOrganization.@id,jsonld_2_@id,jsonld_2_name,jsonld_2_thumbnailUrl,jsonld_2_embedUrl,jsonld_2_uploadDate,jsonld_2_transcript,jsonld_2_duration,jsonld_author.@context,jsonld_author.@type,jsonld_author.description,jsonld_author.url,jsonld_author.name,jsonld_author.sameAs,jsonld_author.image.@context,jsonld_author.image.@type,jsonld_author.image.url,jsonld_3_image,jsonld_3_mainEntityOfPage,jsonld_3_inLanguage,jsonld_3_author,jsonld_3_dateModified,jsonld_3_datePublished,jsonld_3_headline,jsonld_3_copyrightYear,jsonld_3_isAccessibleForFree,jsonld_3_publisher.@id,jsonld_3_copyrightHolder.@id,jsonld_3_sourceOrganization.@id,jsonld_3_hasPart.@type,jsonld_3_hasPart.isAccessibleForFree,jsonld_3_hasPart.cssSelector,jsonld_3_isPartOf.@type,jsonld_3_isPartOf.name,jsonld_3_isPartOf.productID,jsonld_4_@context,jsonld_4_@type,jsonld_4_name,jsonld_4_url,jsonld_4_@id,jsonld_4_diversityPolicy,jsonld_4_ethicsPolicy,jsonld_4_masthead,jsonld_4_foundingDate,jsonld_4_sameAs,jsonld_4_logo.@context,jsonld_4_logo.@type,jsonld_4_logo.url,jsonld_4_logo.height,jsonld_4_logo.width,twitter:card,h6,jsonld_2_inLanguage,jsonld_2_author,jsonld_2_isAccessibleForFree,jsonld_2_hasPart.@type,jsonld_2_hasPart.isAccessibleForFree,jsonld_2_hasPart.cssSelector,jsonld_2_isPartOf.@type,jsonld_2_isPartOf.name,jsonld_2_isPartOf.productID,jsonld_3_diversityPolicy,jsonld_3_ethicsPolicy,jsonld_3_masthead,jsonld_3_foundingDate,jsonld_3_sameAs,jsonld_3_logo.@context,jsonld_3_logo.@type,jsonld_3_logo.url,jsonld_3_logo.height,jsonld_3_logo.width,jsonld_4_itemListElement,jsonld_3_transcript,jsonld_5_@context,jsonld_5_@type,jsonld_5_@id,jsonld_5_description,jsonld_5_url,jsonld_5_name,jsonld_5_thumbnailUrl,jsonld_5_embedUrl,jsonld_5_uploadDate,jsonld_5_duration,jsonld_6_@context,jsonld_6_@type,jsonld_6_@id,jsonld_6_description,jsonld_6_url,jsonld_6_name,jsonld_6_thumbnailUrl,jsonld_6_embedUrl,jsonld_6_uploadDate,jsonld_6_duration,jsonld_4_description,jsonld_4_thumbnailUrl,jsonld_4_embedUrl,jsonld_4_uploadDate,jsonld_4_transcript,jsonld_4_duration,jsonld_audio,jsonld_2_contentUrl,h5,jsonld_5_transcript,og:site_name,twitter:title,twitter:url,twitter:description,resp_headers_via,resp_headers_x-guploader-uploadid,resp_headers_etag,resp_headers_x-goog-generation,resp_headers_x-goog-metageneration,resp_headers_x-goog-stored-content-encoding,resp_headers_x-goog-stored-content-length,resp_headers_x-goog-hash,resp_headers_x-goog-storage-class,twitter:image,twitter:creator,img_usemap,img_ismap,jsonld_3_contentUrl,jsonld_2_alternativeHeadline,jsonld_6_transcript,jsonld_7_@context,jsonld_7_@type,jsonld_7_@id,jsonld_7_description,jsonld_7_url,jsonld_7_name,jsonld_7_thumbnailUrl,jsonld_7_embedUrl,jsonld_7_uploadDate,jsonld_7_transcript,jsonld_7_duration
0,https://www.nytimes.com/,"The New York Times - Breaking News, US News, W...","Live news, investigations, opinion, photos and...","width=device-width, initial-scale=1",utf-8,Tracking the Coronavirus ›@@Site Information N...,"As New Variant Circles the Globe, African Nati...",https://www.nytimes.com,https://www.nytimes.com@@https://www.nytimes.c...,x-default@@en-US@@en@@en-GB@@en-AU@@en-CA@@es@...,https://www.nytimes.com,website,"The New York Times - Breaking News, US News, W...","Live news, investigations, opinion, photos and...",https://static01.nyt.com/newsgraphics/images/i...,@nytimes,http://schema.org,WebPage,"[{'@context': 'http://schema.org', '@type': 'I...",The New York Times,http://schema.org,ItemList,[],0.0,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,"[https://www.facebook.com/nytimes/, https://tw...","[NYT, new york times, nytimes, ny times]","[{'@type': 'Organization', 'name': 'NYT Cookin...",http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. International Canada Espa...,1515591,180,nytimes.com,0.395742,1.0,19.0,https://nytimes.com,301.0,0,200,https://www.nytimes.com/#after-dfp-ad-top@@htt...,Continue reading the main story@@Skip to conte...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/#site-content@@https:/...,Skip to content@@Skip to site index@@@@U.S.@@I...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,https://static01.nyt.com/images/2021/11/27/wor...,lazy@@@@@@lazy@@lazy@@lazy@@lazy@@@@@@@@@@@@@@...,Travelers checking a flight information board ...,151.101.1.164,2021-11-27 18:04:44,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 17:21:36 GMT","Sat, 27 Nov 2021 17:21:36 GMT",vi-homepage,1; mode=block,nosniff,"s-maxage=30,no-cache",homepage,2021-11-27 17:26:15 UTC,bytes,"Sat, 27 Nov 2021 18:04:44 GMT",2309.0,"cache-lga21973-LGA, cache-fjr7923-FJR","MISS, HIT","0, 1","S1638036284.454049,VS0,VE5","Accept-Encoding, Fastly-SSL","nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; Expires=Sun, 27 ...",0.0,0.0,DENY,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,MISS-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,https://www.nytimes.com/interactive/2020/us/co...,How Full Are Hospital I.C.U.s Near You? - The ...,See how many Covid-19 patients are being treat...,"width=device-width, initial-scale=1",utf-8,Comments @@\n\tTracking the Coronavirus\n@@Sit...,United States@@Latest Maps and Data@@Vaccinati...,https://www.nytimes.com/interactive/2020/us/co...,,,https://www.nytimes.com/interactive/2020/us/co...,article,How Full Are Hospital I.C.U.s Near You?,See how many Covid-19 patients are being treat...,https://static01.nyt.com/images/2020/12/15/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. | How Full Are Hospital I...,283694,180,www.nytimes.com,0.412304,,,,,1,200,https://www.nytimes.com/interactive/2020/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/us/co...,Latest Maps and Data@@Vaccinations by State@@C...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2020/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://carlsonschool.umn.edu/mili-misrc-covid...,Covid-19 Hospitalization Tracking Project@@© 2...,False@@False@@False@@False@@False@@False@@Fals...,,,,151.101.1.164,2021-11-27 18:04:45,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 13:15:18 GMT","Sat, 27 Nov 2021 13:15:18 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=300,no-cache",vi-interactive,2021-11-27 13:23:55 UTC,bytes,"Sat, 27 Nov 2021 18:04:45 GMT",17107.0,"cache-lga21947-LGA, cache-fjr7921-FJR","HIT, HIT","1, 1","S1638036285.408643,VS0,VE1","Accept-Encoding, Fastly-SSL","nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,HIT-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; nyt-...,How Full Are Hospital I.C.U.s Near You?,,See how many Covid-19 patients are being treat...,https://www.nytimes.com/interactive/2020/us/co...,https://www.nytimes.com/interactive/2020/us/co...,en-US,"[{'@context': 'http://schema.org', '@type': 'P...",2021-11-15T16:37:40.400Z,2020-12-15T16:33:01.000Z,How Full Are Hospital I.C.U.s Near You?,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-15T16:37:40.400Z,https://www.nytimes.com/,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,https://www.nytimes.com/interactive/2021/world...,Covid World Vaccination Tracker - The New York...,More than 4.24 billion people worldwide have r...,"width=device-width, initial-scale=1",utf-8,\n\tTracking the Coronavirus\n@@Site Index@@Si...,The Coronavirus Pandemic@@Vaccinations by coun...,https://www.nytimes.com/interactive/2021/world...,,,https://www.nytimes.com/interactive/2021/world...,article,Tracking Coronavirus Vaccinations Around the W...,More than 4.24 billion people worldwide have r...,https://static01.nyt.com/images/2021/01/28/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH World | Tracking Coronavirus V...,366265,180,www.nytimes.com,0.423752,,,,,1,200,https://www.nytimes.com/interactive/2021/world...,Skip to content@@Skip to site index@@World@@@@...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/news-event/coronavirus...,The Coronavirus Pandemic@@Covid-19 Updates@@Co...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/world...,Skip to content@@Skip to site index@@World@@@@...,False@@False@@False@@False@@False@@False@@Fals...,https://ourworldindata.org/coronavirus@@https:...,Our World in Data@@this page@@gogov.ru@@annexe...,False@@False@@False@@False@@False@@False@@Fals...,https://static01.nyt.com/images/2020/12/16/us/...,,thumbnail,151.101.1.164,2021-11-27 18:04:45,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 18:01:31 GMT","Sat, 27 Nov 2021 18:01:31 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=5,no-cache",vi-interactive,2021-11-27 18:01:31 UTC,bytes,"Sat, 27 Nov 2021 18:04:45 GMT",194.0,"cache-lga21966-LGA, cache-fjr7922-FJR","MISS, HIT","0, 1","S1638036285.417494,VS0,VE1","Accept-Encoding, Fastly-SSL","nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,MISS-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; nyt-...,Tracking Coronavirus Vaccinations Around the W...,,More than 4.24 billion people worldwide have r...,https://www.nytimes.com/interactive/2021/world...,https://www.nytimes.com/interactive/2021/world...,en-US,"[{'@context': 'http://schema.org', '@type': 'P...",2021-11-26T12:00:34.890Z,2021-01-29T09:32:19.000Z,Covid World Vaccination Tracker,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-26T12:00:34.890Z,https://www.nytimes.com/,"Charts show countries with at least 100,000 pe...",Tracking Coronavirus Vaccinations Around the W...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,https://www.nytimes.com/interactive/2021/us/co...,Covid in the U.S.: Track Daily Cases Near You ...,Build your own dashboard to track the coronavi...,"width=device-width, initial-scale=1",utf-8,Tracking the Coronavirus@@Site Index@@Site Inf...,The Coronavirus Pandemic@@United States@@Lates...,https://www.nytimes.com/interactive/2021/us/co...,,,https://www.nytimes.com/interactive/2021/us/co...,article,Track Coronavirus Cases in Places Important to...,Build your own dashboard to track the coronavi...,https://static01.nyt.com/images/2020/10/25/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. | Track Coronavirus Cases...,595203,180,www.nytimes.com,0.588237,,,,,1,200,https://www.nytimes.com/interactive/2021/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/news-event/coronavirus...,The Coronavirus Pandemic@@Covid-19 Updates@@Co...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/us/co...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,,,,151.101.1.164,2021-11-27 18:04:45,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 18:02:11 GMT","Sat, 27 Nov 2021 18:02:11 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=60,no-cache",vi-interactive,2021-11-27 18:04:45 UTC,bytes,"Sat, 27 Nov 2021 18:04:45 GMT",27.0,"cache-lga21943-LGA, cache-fjr7921-FJR","HIT, MISS","1, 0","S1638036285.396643,VS0,VE191","Accept-Encoding, Fastly-SSL","nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,HIT-MISS,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; nyt-...,Track Coronavirus Cases in Places Important to...,,Build your own dashboard to track the coronavi...,https://www.nytimes.com/interactive/2021/us/co...,https://www.nytimes.com/interactive/2021/us/co...,en,,2021-11-27T10:35:25.663Z,2020-11-24T21:02:00.000Z,Covid in the U.S.: Track Daily Cases Near You,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-27T10:35:25.663Z,https://www.nytimes.com/,,Track Coronavirus Cases in Places Important to...,https://www.nytimes.com/#publisher,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,https://www.nytimes.com/interactive/2021/us/ne...,New Mexico Coronavirus Map and Case Count - Th...,See the latest charts and maps of coronavirus ...,"width=device-width, initial-scale=1",utf-8,New reported cases@@Daily new hospital admissi...,The Coronavirus Pandemic@@Tests@@Hospitalized@...,https://www.nytimes.com/interactive/2021/us/ne...,,,https://www.nytimes.com/interactive/2021/us/ne...,article,New Mexico Coronavirus Map and Case Count,See the latest charts and maps of coronavirus ...,https://static01.nyt.com/images/2020/03/29/us/...,@nytimes,http://schema.org,NewsArticle,"[{'@context': 'http://schema.org', '@type': 'I...",,,,,,https://www.nytimes.com/#publisher,http://schema.org,NewsMediaOrganization,The New York Times,https://www.nytimes.com/,https://www.nytimes.com/#publisher,https://www.nytco.com/diversity-and-inclusion-...,https://www.nytco.com/who-we-are/culture/stand...,https://www.nytimes.com/interactive/2020/09/08...,1851-09-18,https://en.wikipedia.org/wiki/The_New_York_Times,,,http://schema.org,ImageObject,https://static01.nyt.com/images/misc/NYT_logo_...,40.0,250.0,Sections SEARCH U.S. | New Mexico Coronavirus ...,1330313,180,www.nytimes.com,0.439286,,,,,1,200,https://www.nytimes.com/interactive/2021/us/ne...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/news-event/coronavirus...,The Coronavirus Pandemic@@Covid-19 Updates@@Co...,False@@False@@False@@False@@False@@False@@Fals...,https://www.nytimes.com/interactive/2021/us/ne...,Skip to content@@Skip to site index@@U.S.@@@@L...,False@@False@@False@@False@@False@@False@@Fals...,https://help.nytimes.com/hc/en-us/articles/115...,© 2021 The New York Times Company@@NYTCo@@Cont...,False@@False@@False@@False@@False@@False@@Fals...,https://static01.nyt.com/newsgraphics/2021/cor...,lazy@@lazy@@lazy@@lazy@@lazy@@lazy@@@@@@@@@@@@...,Hot spots thumbnail@@Vaccinations thumbnail@@R...,151.101.1.164,2021-11-27 18:04:46,nginx,text/html; charset=utf-8,"Sat, 27 Nov 2021 18:02:22 GMT","Sat, 27 Nov 2021 18:02:22 GMT",vi-interactive-standard,1; mode=block,nosniff,"s-maxage=60,no-cache",vi-interactive,2021-11-27 18:02:36 UTC,bytes,"Sat, 27 Nov 2021 18:04:45 GMT",129.0,"cache-lga21960-LGA, cache-fjr7922-FJR","MISS, HIT","0, 1","S1638036285.419445,VS0,VE2","Accept-Encoding, Fastly-SSL","nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; Expires=Sun, 27 ...",0.0,0.0,,https://www.nytimesn7cgmftshazwhfgzm37qxb44r64...,F-F-VI,upgrade-insecure-requests; default-src data: '...,max-age=63072000; preload,MISS-HIT,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.12.2,"gzip, deflate, br",nyt-gdpr=0; nyt-a=4CQQ4aSOYS7PXDiJyNu4ye; nyt-...,New Mexico Coronavirus Map and Case Count@@Tra...,,See the latest charts and maps of coronavirus ...,https://www.nytimes.com/interactive/2021/us/ne...,https://www.nytimes.com/interactive/2021/us/ne...,en-US,,2021-11-27T10:35:22.047Z,2020-04-01T15:47:00.000Z,New Mexico Coronavirus Map and Case Count,2021.0,0.0,https://www.nytimes.com/#publisher,https://www.nytimes.com/#publisher,WebPageElement,0.0,.meteredContent,"[CreativeWork, Product]",The New York Times,nytimes.com:basic,2021-11-27T10:35:22.047Z,https://www.nytimes.com/,,New Mexico Coronavirus Map and Case Count,https://www.nytimes.com/#publisher,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [20]:
nyt_no_params['url'].str.contains('\?', regex=True).sum()

0

We have zero URLs with question marks, which means that any such links were skipped and not crawled.

> Note: In some unlikely cases you might still set `skip_url_params=True` and get URLs with parameters. This might happen if the website has a redirect from `example.com/somepage` to `example.com/somepage_2?key=value` for example.


# Extracting all `<img> ` attributes

Previously only the `src` and `alt` attributes of images were scraped, and now whatever attributes are available are scraped as well.

We can check for the columns in our crawl dataset that contain "img_":

In [15]:
nyt_params.filter(regex='img_')

Unnamed: 0,img_src,img_loading,img_alt,img_height,img_srcset,img_sizes,img_width
0,https://static01.nyt.com/images/2021/11/26/mul...,lazy@@@@@@lazy@@lazy@@lazy@@lazy@@lazy@@lazy@@...,A lab at the Nelson Mandela School of Medicine...,,,,
1,@@@@https://static01.nyt.com/images/2021/11/26...,,The New York Times Style Magazine@@The New Yor...,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...,@@@@https://static01.nyt.com/images/2021/11/26...,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...
2,http://static01.nyt.com/newsgraphics/2021/01/1...,,thumbnail,,,,
3,https://static01.nyt.com/images/2021/11/28/rea...,,One of Patrica Buzo’s terrarium designs repeat...,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...,https://static01.nyt.com/images/2021/11/28/rea...,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...,@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@...
4,https://static01.nyt.com/images/2021/11/22/us/...,,Cinemagraph@@Cinemagraph@@Cinemagraph@@Cinemag...,,,,
...,...,...,...,...,...,...,...
103,https://static01.nyt.com/images/2021/10/12/us/...,,"Carl K. Dunn, the chief of police in Baker, La...",399@@,https://static01.nyt.com/images/2021/10/12/us/...,((min-width: 600px) and (max-width: 1004px)) 8...,600@@
104,https://static01.nyt.com/images/2011/06/20/us/...,,"Wess Young, 94, fled with his mother and siste...",365,https://static01.nyt.com/images/2011/06/20/us/...,((min-width: 600px) and (max-width: 1004px)) 8...,600
105,https://static01.nyt.com/images/2021/11/09/mul...,,Blanca Quintero setting up a sign asking custo...,400@@@@,https://static01.nyt.com/images/2021/11/09/mul...,((min-width: 600px) and (max-width: 1004px)) 8...,600@@@@
106,https://static01.nyt.com/images/2021/06/29/wel...,,"Kate, 12, has been in therapy for years to cop...",900@@@@400@@@@@@@@@@@@@@@@@@,https://static01.nyt.com/images/2021/06/29/wel...,100vw@@@@((min-width: 600px) and (max-width: 1...,600@@@@600@@@@@@@@@@@@@@@@@@


In [76]:
# the percentage of each attribute, and in how many of the pages it was used:

(nyt_params
 .filter(regex='img_')
 .notna()
 .mean()
 .sort_values(ascending=False)
 .to_frame()
 .style.format('{:.1%}'))

Unnamed: 0,0
img_src,88.5%
img_alt,86.8%
img_height,71.7%
img_width,71.7%
img_srcset,61.8%
img_sizes,60.9%
img_loading,5.9%


# New column `last_dir` in the `url_to_df` function:

There are no set rules for how to structure URLs, and every website is different. However in most cases, we find the following structure: 

`example.com/category/sub-category/sub-subcategory/page-title`

`example.com/category/sub-category/sub-subcategory/product-name`

All directories before the last one give some metadata about the page (which categories it belongs to), but it's usually the last directory that is the unique one. 

We can have many products under `/fashion/shoes/` for example, but we need the last directory to know which specific shoe it is. 

We split the URLs of the crawl DataFrame and see the result:

In [73]:
nyt_url_df = adv.url_to_df(nyt_params['url'])
nyt_url_df.sample(5)

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,dir_4,dir_5,dir_6,dir_7,last_dir,query_utm_source,query_action,query_pgtype,query_region,query_module,query_state,query__r,query_variant,query_block,query_WT.nav,query_clickSource,query_utm_medium,query_utm_campaign,query_inline,query_context,query_redir,query_contentCollection,query_abt,query_abg,query_ref,query_searchResultPosition
161,https://www.nytimes.com/2017/06/09/us/politics...,https,www.nytimes.com,/2017/06/09/us/politics/dan-scavino-hatch-act-...,,,2017,06,9.0,us,politics,dan-scavino-hatch-act-amash.html,,dan-scavino-hatch-act-amash.html,,,,,,,,,,,,,,,,,,,,,
296,https://www.nytimes.com/2021/10/21/us/politics...,https,www.nytimes.com,/2021/10/21/us/politics/biden-filibuster-votin...,,,2021,10,21.0,us,politics,biden-filibuster-voting-rights.html,,biden-filibuster-voting-rights.html,,,,,,,,,,,,,,,,,,,,,
138,https://www.nytimes.com/2013/12/06/world/afric...,https,www.nytimes.com,/2013/12/06/world/africa/central-african-repub...,,,2013,12,6.0,world,africa,central-african-republic-fighting.html,,central-african-republic-fighting.html,,,,,,,,,,,,,,,,,,,,,
59,https://www.nytimes.com/2021/01/08/us/who-was-...,https,www.nytimes.com,/2021/01/08/us/who-was-ashli-babbitt.html,,,2021,01,8.0,us,who-was-ashli-babbitt.html,,,who-was-ashli-babbitt.html,,,,,,,,,,,,,,,,,,,,,
69,https://www.nytimes.com/by/roni-caryn-rabin,https,www.nytimes.com,/by/roni-caryn-rabin,,,by,roni-caryn-rabin,,,,,,roni-caryn-rabin,,,,,,,,,,,,,,,,,,,,,


In [77]:
(nyt_url_df
 .sample(10)
 .drop(['query', 'fragment'], axis=1)
 .iloc[:, :12]
 .style.highlight_null(null_color='#FFEFD3'))

Unnamed: 0,url,scheme,netloc,path,dir_1,dir_2,dir_3,dir_4,dir_5,dir_6,dir_7,last_dir
172,https://www.nytimes.com/2018/01/21/us/politics/bipartisan-senators-government-shutdown.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region®ion=top-news&WT.nav=top-news,https,www.nytimes.com,/2018/01/21/us/politics/bipartisan-senators-government-shutdown.html,2018,1.0,21,us,politics,bipartisan-senators-government-shutdown.html,,bipartisan-senators-government-shutdown.html
23,https://www.nytimes.com/interactive/2021/us/crawford-michigan-covid-cases.html,https,www.nytimes.com,/interactive/2021/us/crawford-michigan-covid-cases.html,interactive,2021.0,us,crawford-michigan-covid-cases.html,,,,crawford-michigan-covid-cases.html
220,https://www.nytimes.com/2019/11/11/world/australia/fires-sydney-new-south-wales.html,https,www.nytimes.com,/2019/11/11/world/australia/fires-sydney-new-south-wales.html,2019,11.0,11,world,australia,fires-sydney-new-south-wales.html,,fires-sydney-new-south-wales.html
21,https://www.nytimes.com/interactive/2021/us/texas-covid-cases.html,https,www.nytimes.com,/interactive/2021/us/texas-covid-cases.html,interactive,2021.0,us,texas-covid-cases.html,,,,texas-covid-cases.html
32,https://www.nytimes.com/interactive/2021/us/pope-minnesota-covid-cases.html,https,www.nytimes.com,/interactive/2021/us/pope-minnesota-covid-cases.html,interactive,2021.0,us,pope-minnesota-covid-cases.html,,,,pope-minnesota-covid-cases.html
260,https://www.nytimes.com/2009/11/04/us/04vote.html,https,www.nytimes.com,/2009/11/04/us/04vote.html,2009,11.0,04,us,04vote.html,,,04vote.html
48,https://www.nytimes.com/2021/10/09/world/europe/boris-johnson-britain-brexit.html,https,www.nytimes.com,/2021/10/09/world/europe/boris-johnson-britain-brexit.html,2021,10.0,09,world,europe,boris-johnson-britain-brexit.html,,boris-johnson-britain-brexit.html
150,https://cn.nytimes.com/asia-pacific/20211028/china-hypersonic-missile/zh-hant/,https,cn.nytimes.com,/asia-pacific/20211028/china-hypersonic-missile/zh-hant/,asia-pacific,20211028.0,china-hypersonic-missile,zh-hant,,,,zh-hant
94,https://cn.nytimes.com/real-estate/?utm_source=nav-footer,https,cn.nytimes.com,/real-estate/,real-estate,,,,,,,real-estate
210,https://www.nytimes.com/2017/05/17/us/politics/robert-mueller-special-counsel-russia-investigation.html,https,www.nytimes.com,/2017/05/17/us/politics/robert-mueller-special-counsel-russia-investigation.html,2017,5.0,17,us,politics,robert-mueller-special-counsel-russia-investigation.html,,robert-mueller-special-counsel-russia-investigation.html


Let's see what they have in the first directory and further analyze: 

In [79]:
nyt_url_df['dir_1'].value_counts()[:20]

2021                   55
interactive            47
2017                   25
2019                   24
by                     22
2018                   18
live                   18
2020                   15
2016                    7
world                   6
usa                     6
china                   5
2015                    4
es                      4
2013                    4
asia-pacific            3
topic                   3
readers-translation     3
2012                    2
2011                    2
Name: dir_1, dtype: int64

In [81]:
(nyt_url_df
 [nyt_url_df['dir_1'].eq('2021')]       # filter where dir_1 is "2021"
 ['last_dir']                           # select the last_dir column
 .str.replace('\.html', '', regex=True) # remove ".html"
 .str.split("-")                        # split by "-"
 .explode()                             # the split strings become a list, put each
                                        # element of the resulting list into its own row
 .value_counts()[:20])                  # count the words, display the top 20

trump          8
ethiopia       7
covid          6
vaccine        6
biden          4
coronavirus    4
delta          4
georgia        4
election       4
johnson        3
un             3
variant        3
coup           3
governor       3
who            3
omicron        3
boris          3
tigray         3
executive      2
lilly          2
Name: last_dir, dtype: int64

Do the same again, but for URLs where `dir_1` is "es":

In [82]:
(nyt_url_df
 [nyt_url_df['dir_1'].eq('es')]
 ['last_dir'].str.replace('\.html', '', regex=True)
 .str.split("-")
 .explode()
 .value_counts())

vacuna         3
coronavirus    2
covid          1
rusia          1
memes          1
sputnik        1
polio          1
sanders        1
Name: last_dir, dtype: int64