Installing Libraries

In [None]:
!pip install advertools

Collecting advertools
  Downloading advertools-0.13.5-py2.py3-none-any.whl (312 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/312.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/312.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m307.2/312.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.1/312.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting scrapy>=2.5.0 (from advertools)
  Downloading Scrapy-2.11.0-py2.py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.4/286.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting twython>=3.8.0 (from advertools)
  Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Collecting Twisted<23.8.0,>=18.9.0 (from scrapy>=2.5.0->advertools)
  Downloading T

In [None]:
!pip install adviz

Collecting adviz
  Downloading adviz-0.0.15-py3-none-any.whl (21 kB)
Installing collected packages: adviz
Successfully installed adviz-0.0.15


In [None]:
import advertools as adv # specialized functions for getting robots.txt files, XML sitemaps, and splitting URLs
import adviz

# general data manipulation tasks
import pandas as pd
pd.options.display.max_columns = None

# data visualization
import plotly.express as px
from IPython.display import display_html, display_markdown
from ipywidgets import interact, IntRangeSlider, Text, Dropdown, IntSlider
import ipywidgets as widgets
import plotly

In [None]:
for pkg in [adv, adviz, pd, plotly, widgets]:
    print(f'{pkg.__name__:-<30}v{pkg.__version__}')
def md(text):
    return display_markdown(text, raw=True)

advertools--------------------v0.13.5
adviz-------------------------v0.0.15
pandas------------------------v1.5.3
plotly------------------------v5.15.0
ipywidgets--------------------v7.7.1


## Getting the robots.txt file with the robotstxt_to_df function

Getting the robots.txt file with the robotstxt_to_df function
The advertools library has a special function for retreiving single or multiple robots.txt files in one function call. As a result, we get the file in a DataFrame format, which helps us in analyzing it more easily. It is very simple and all we have to do is provide the URL of the robots.txt file:

In [None]:
robots_df = adv.robotstxt_to_df('https://www.penshoppe.com/robots.txt')
robots_df

INFO:root:Getting: https://www.penshoppe.com/robots.txt


Unnamed: 0,directive,content,etag,robotstxt_url,download_date
0,comment,we use Shopify as our ecommerce platform,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
1,User-agent,*,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
2,Disallow,/admin,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
3,Disallow,/cart,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
4,Disallow,/orders,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
...,...,...,...,...,...
127,Sitemap,https://www.penshoppe.com/sitemap.xml,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
128,User-agent,MJ12bot,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
129,Crawl-Delay,10,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00
130,User-agent,Pinterest,"W/""cacheable:6ac14f9c15eecf384f63c4b5e074b907""",https://www.penshoppe.com/robots.txt,2023-12-30 06:15:30.742215+00:00


## Extracting sitemaps from the robots.txt file

If sitemaps exist in the file, they can be easily found by filtering the rows where the column directive is equal to "Sitemap", and getting the respective value in the content column:

In [None]:
sitemap_urls = robots_df[robots_df['directive'].str.contains('Sitemap', case=False)]['content'].tolist()
sitemap_urls[0]

'https://www.penshoppe.com/sitemap.xml'

## Crawling and parsing all sitemaps recursively

We can see that it's a list with a single sitemap, which we can now use to convert all sitemaps recursively with one function call:

sitemap_df = adv.sitemap_to_df('https://www.penshoppe.com/sitemap.xml')

In [None]:
penshoppe = adv.sitemap_to_df(sitemap_urls[0])
caption = f'<h2>Penshoppe.com XML Sitemaps</h2><h4>Rows: {penshoppe.shape[0]:,} – Columns: {penshoppe.shape[1]:,}</h4><br>'
penshoppe.sample(10).style.set_caption(caption)

INFO:root:Getting https://www.penshoppe.com/sitemap_pages_1.xml
INFO:root:Getting https://www.penshoppe.com/sitemap_blogs_1.xml
INFO:root:Getting https://www.penshoppe.com/sitemap_products_2.xml?from=7845143445694&to=7860073398462
INFO:root:Getting https://www.penshoppe.com/sitemap_collections_1.xml
INFO:root:Getting https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862


Unnamed: 0,loc,lastmod,changefreq,sitemap,etag,sitemap_size_mb,download_date,image,image_loc,image_title,image_caption
3398,https://www.penshoppe.com/products/relaxed-fit-textured-resort-shirt-with-patch-pockets-975833-off-white,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/975833-Off_White_1.jpg?v=1700450652,Relaxed Fit Textured Resort Shirt with Patch Pockets,975833-Off White (1).jpg
2316,https://www.penshoppe.com/products/coin-purse-973514-blue-stone,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/973514-Blue_Stone_2.jpg?v=1694434350,Neoprene Coin Purse,973514-Blue Stone (2).jpg
1962,https://www.penshoppe.com/products/5-pocket-skinny-jeans-974187-light-blue,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/974187-Light_Blue_6.jpg?v=1690952930,5-Pocket Skinny Jeans,974187-Light Blue (6).jpg
1171,https://www.penshoppe.com/products/linen-side-tie-skort,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/971700-BROWN_3.jpg?v=1678958177,Linen Side Tie Skort,
901,https://www.penshoppe.com/collections/mens-summer,2023-12-29 10:40:26+00:00,daily,https://www.penshoppe.com/sitemap_collections_1.xml,"W/""cacheable:d40a2c2f45142d3d897c090398d483c0""",0.136486,2023-12-30 06:22:19.830552+00:00,,,,
2675,https://www.penshoppe.com/products/semi-fit-cropped-polo-with-hi-density-print-975021-blue,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/975021-Blue_6.jpg?v=1695866057,Semi Fit Cropped Polo with Hi Density Print,975021-Blue (6).jpg
2860,https://www.penshoppe.com/products/mini-corduroy-sling-bag-977685-chocolate-brown,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/977685-Chocolate_Brown_3.jpg?v=1696574681,Mini Corduroy Sling Bag,977685-Chocolate Brown (3).jpg
3544,https://www.penshoppe.com/products/teddy-bear-slim-fit-graphic-t-shirt-976352-off-white,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/976352-Off_White_2.jpg?v=1701220061,Teddy Bear Slim Fit Graphic T-Shirt,976352-Off White (2).jpg
2165,https://www.penshoppe.com/products/crew-classic-cap-972783-dark-green,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/972783-Dark_Green_3.jpg?v=1693287674,Crew Classic Cap,972783-Dark Green (3).jpg
2337,https://www.penshoppe.com/products/womens-platform-lace-up-canvas-sneakers-974364-beige,2023-12-30 06:22:19+00:00,daily,https://www.penshoppe.com/sitemap_products_1.xml?from=347827961890&to=7845050908862,"W/""cacheable:b2f5b9ddfafbdc89de91e337ae71b183""",1.130939,2023-12-30 06:22:20.160503+00:00,,https://cdn.shopify.com/s/files/1/2282/7539/products/974364-Beige_2.jpg?v=1694485236,Women's Platform Lace-Up Canvas Sneakers,974364-Beige (2).jpg


## Splitting URLs
The url_to_df function decomposes a list of URLs into their components, and we cand do that for the URLs in loc column:

In [None]:
url_df = adv.url_to_df(penshoppe['loc'])
caption = f'<h2>Penshoppe.com URLs split</h2><h4>Rows: {url_df.shape[0]:,} – Columns: {url_df.shape[1]:,}</h4><br>'
url_df.sample(10).style.set_caption(caption)

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,last_dir
670,https://www.penshoppe.com/collections/pinks,https,www.penshoppe.com,/collections/pinks,,,collections,pinks,,pinks
3532,https://www.penshoppe.com/products/short-sleeves-henley-top-975027-cobalt-blue,https,www.penshoppe.com,/products/short-sleeves-henley-top-975027-cobalt-blue,,,products,short-sleeves-henley-top-975027-cobalt-blue,,short-sleeves-henley-top-975027-cobalt-blue
3081,https://www.penshoppe.com/products/straight-fit-cargo-trousers-in-ripstop-fabric-975606-dark-gray,https,www.penshoppe.com,/products/straight-fit-cargo-trousers-in-ripstop-fabric-975606-dark-gray,,,products,straight-fit-cargo-trousers-in-ripstop-fabric-975606-dark-gray,,straight-fit-cargo-trousers-in-ripstop-fabric-975606-dark-gray
2131,https://www.penshoppe.com/products/penshoppe-marvel-polo-with-captain-america-print-974888-navy-blue,https,www.penshoppe.com,/products/penshoppe-marvel-polo-with-captain-america-print-974888-navy-blue,,,products,penshoppe-marvel-polo-with-captain-america-print-974888-navy-blue,,penshoppe-marvel-polo-with-captain-america-print-974888-navy-blue
2619,https://www.penshoppe.com/products/floral-relaxed-fit-all-over-print-t-shirt-976421-mustard,https,www.penshoppe.com,/products/floral-relaxed-fit-all-over-print-t-shirt-976421-mustard,,,products,floral-relaxed-fit-all-over-print-t-shirt-976421-mustard,,floral-relaxed-fit-all-over-print-t-shirt-976421-mustard
732,https://www.penshoppe.com/collections/for-the-cute-and-charming-mens,https,www.penshoppe.com,/collections/for-the-cute-and-charming-mens,,,collections,for-the-cute-and-charming-mens,,for-the-cute-and-charming-mens
1494,https://www.penshoppe.com/products/basic-modern-fit-shorts-964573-pastel-green,https,www.penshoppe.com,/products/basic-modern-fit-shorts-964573-pastel-green,,,products,basic-modern-fit-shorts-964573-pastel-green,,basic-modern-fit-shorts-964573-pastel-green
405,https://www.penshoppe.com/collections/mens-fragrances,https,www.penshoppe.com,/collections/mens-fragrances,,,collections,mens-fragrances,,mens-fragrances
3268,https://www.penshoppe.com/products/womens-striped-flip-flops-977660-beige,https,www.penshoppe.com,/products/womens-striped-flip-flops-977660-beige,,,products,womens-striped-flip-flops-977660-beige,,womens-striped-flip-flops-977660-beige
1055,https://www.penshoppe.com/collections/choose-any-3-get-the-3rd-item-for-only-p10,https,www.penshoppe.com,/collections/choose-any-3-get-the-3rd-item-for-only-p10,,,collections,choose-any-3-get-the-3rd-item-for-only-p10,,choose-any-3-get-the-3rd-item-for-only-p10


In [None]:
url_df

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,last_dir
0,https://www.penshoppe.com/pages/12-days-of-christmas,https,www.penshoppe.com,/pages/12-days-of-christmas,,,pages,12-days-of-christmas,,12-days-of-christmas
1,https://www.penshoppe.com/pages/about,https,www.penshoppe.com,/pages/about,,,pages,about,,about
2,https://www.penshoppe.com/pages/activation,https,www.penshoppe.com,/pages/activation,,,pages,activation,,activation
3,https://www.penshoppe.com/pages/christmas-deals,https,www.penshoppe.com,/pages/christmas-deals,,,pages,christmas-deals,,christmas-deals
4,https://www.penshoppe.com/pages/contact-us,https,www.penshoppe.com,/pages/contact-us,,,pages,contact-us,,contact-us
...,...,...,...,...,...,...,...,...,...,...
3653,https://www.penshoppe.com/products/dress-code-flat-knit-mock-neck-dress,https,www.penshoppe.com,/products/dress-code-flat-knit-mock-neck-dress,,,products,dress-code-flat-knit-mock-neck-dress,,dress-code-flat-knit-mock-neck-dress
3654,https://www.penshoppe.com/products/basic-relaxed-fit-t-shirt-8,https,www.penshoppe.com,/products/basic-relaxed-fit-t-shirt-8,,,products,basic-relaxed-fit-t-shirt-8,,basic-relaxed-fit-t-shirt-8
3655,https://www.penshoppe.com/products/ankle-length-drawstring-pants,https,www.penshoppe.com,/products/ankle-length-drawstring-pants,,,products,ankle-length-drawstring-pants,,ankle-length-drawstring-pants
3656,https://www.penshoppe.com/products/power-stretch-r-mid-waist-jeans-5,https,www.penshoppe.com,/products/power-stretch-r-mid-waist-jeans-5,,,products,power-stretch-r-mid-waist-jeans-5,,power-stretch-r-mid-waist-jeans-5


In [None]:
url_df.dir_1.value_counts()

products       2839
collections     773
pages            33
blogs            12
Name: dir_1, dtype: int64

In [None]:
url_df[url_df['dir_1'] == 'products']['dir_2'].value_counts().sort_values(ascending=False)

semi-fit-ribbed-button-down-polo-974876-papaya           1
linen-side-tie-skort-1                                   1
parachute-cargo-pants-977125-choco-brown                 1
rhythm-oversized-fit-graphic-t-shirt-977321-off-white    1
modern-fit-graphic-t-shirt-977324-sand                   1
                                                        ..
75-alcohol-hand-sanitizer-spray-sweet-floral-50ml        1
dress-code-flat-knit-mock-neck-dress                     1
basic-relaxed-fit-t-shirt-8                              1
ankle-length-drawstring-pants                            1
linen-chic-trousers                                      1
Name: dir_2, Length: 2839, dtype: int64