# CLIR Sitemap Analysis

This analyis uses [advertools](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html) to analyze the sitemap for <https://www.clir.org>

In [1]:
!pip install advertools



Import library and read the CLIR sitemap as a data frame.

In [2]:
import advertools as adv

clir_sitemap = adv.sitemap_to_df('https://www.clir.org/sitemap.xml')

2023-05-31 15:17:36,511 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-04.xml
2023-05-31 15:17:36,533 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-01.xml
2023-05-31 15:17:36,534 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-02.xml
2023-05-31 15:17:36,535 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-03.xml
2023-05-31 15:17:36,536 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-05.xml
2023-05-31 15:17:36,559 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2022-11.xml
2023-05-31 15:17:36,561 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2022-12.xml
2023-05-31 15:17:36,563 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-misc.xm

Preview the first 10 records

In [3]:
dlf_sitemap.head(10)

Unnamed: 0,loc,lastmod,changefreq,priority,sitemap,sitemap_size_mb,download_date
0,https://www.diglib.org/five-tips-for-rapid-met...,2023-05-15 21:06:16+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.001037,2023-05-16 20:17:47.523840+00:00
1,https://www.diglib.org/dlf-digest-may-2023/,2023-04-27 18:45:01+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.001037,2023-05-16 20:17:47.523840+00:00
2,https://www.diglib.org/,2023-05-09 18:02:05+00:00,daily,1.0,https://www.diglib.org/sitemap-misc.xml,0.000988,2023-05-16 20:17:47.541711+00:00
3,https://www.diglib.org/sitemap.html,2023-05-15 21:06:16+00:00,monthly,0.5,https://www.diglib.org/sitemap-misc.xml,0.000988,2023-05-16 20:17:47.541711+00:00
4,https://www.diglib.org/legal-and-ethical-consi...,2023-02-28 16:59:14+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.002025,2023-05-16 20:17:47.623845+00:00
5,https://www.diglib.org/april-6-7-join-the-virt...,2023-02-23 14:02:44+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.002025,2023-05-16 20:17:47.623845+00:00
6,https://www.diglib.org/join-us-for-ndsa-member...,2023-02-21 17:26:36+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.002025,2023-05-16 20:17:47.623845+00:00
7,https://www.diglib.org/call-for-new-members-an...,2023-02-13 15:45:39+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.002025,2023-05-16 20:17:47.623845+00:00
8,https://www.diglib.org/legal-and-ethical-consi...,2023-02-28 16:58:51+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.002025,2023-05-16 20:17:47.623845+00:00
9,https://www.diglib.org/dlf-digest-february-2023/,2023-01-31 22:33:51+00:00,monthly,0.2,https://www.diglib.org/sitemap-pt-post-p1-2023...,0.002025,2023-05-16 20:17:47.623845+00:00


Show basical statistics about the sitemap (rows and number of columns)

In [4]:
print(dlf_sitemap.shape)

(1789, 7)


Break apart the location into path components

In [6]:
url_df = adv.url_to_df(dlf_sitemap['loc'])
url_df

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,dir_4,dir_5,last_dir
0,https://www.diglib.org/five-tips-for-rapid-met...,https,www.diglib.org,/five-tips-for-rapid-metadata-assessment/,,,five-tips-for-rapid-metadata-assessment,,,,,five-tips-for-rapid-metadata-assessment
1,https://www.diglib.org/dlf-digest-may-2023/,https,www.diglib.org,/dlf-digest-may-2023/,,,dlf-digest-may-2023,,,,,dlf-digest-may-2023
2,https://www.diglib.org/,https,www.diglib.org,/,,,,,,,,
3,https://www.diglib.org/sitemap.html,https,www.diglib.org,/sitemap.html,,,sitemap.html,,,,,sitemap.html
4,https://www.diglib.org/legal-and-ethical-consi...,https,www.diglib.org,/legal-and-ethical-considerations-for-providin...,,,legal-and-ethical-considerations-for-providing...,,,,,legal-and-ethical-considerations-for-providing...
...,...,...,...,...,...,...,...,...,...,...,...,...
1784,https://www.diglib.org/groups/past/,https,www.diglib.org,/groups/past/,,,groups,past,,,,past
1785,https://www.diglib.org/events/,https,www.diglib.org,/events/,,,events,,,,,events
1786,https://www.diglib.org/about/staff/,https,www.diglib.org,/about/staff/,,,about,staff,,,,staff
1787,https://www.diglib.org/news/,https,www.diglib.org,/news/,,,news,,,,,news


Show counts by top level directory

In [7]:
url_df['dir_1'].value_counts()

dlf-events                                                                             283
groups                                                                                  66
opportunities                                                                           26
about                                                                                    9
community                                                                                4
                                                                                      ... 
archives-have-never-been-neutral-an-ndsa-interview-with-jarrett-drake                    1
proposal-for-a-dlf-working-group-on-labor-in-digital-libraries-archives-and-museums      1
karina-wratschko                                                                         1
ndsa-call-for-volunteers-digital-preservation-2017                                       1
announcing-the-2015-dlf-dhsi-cross-pollinators                                           1

Break down `lastmod` for largest number of pages (`dlf-events`)

In [8]:
(dlf_sitemap[dlf_sitemap['loc'].str.contains('/dlf-events/')].set_index('lastmod').resample('M')['loc'].count())

lastmod
2011-01-31 00:00:00+00:00    3
2011-02-28 00:00:00+00:00    0
2011-03-31 00:00:00+00:00    0
2011-04-30 00:00:00+00:00    0
2011-05-31 00:00:00+00:00    1
                            ..
2023-01-31 00:00:00+00:00    0
2023-02-28 00:00:00+00:00    0
2023-03-31 00:00:00+00:00    0
2023-04-30 00:00:00+00:00    1
2023-05-31 00:00:00+00:00    2
Freq: M, Name: loc, Length: 149, dtype: int64

Show top twenty directories and number of pages

In [9]:
url_df['dir_2'].value_counts()[:20]

2014forum                   60
2012forum                   52
2013forum                   48
2011forum                   45
2015forum                   29
past                        22
2016forum                   22
fellowships                 16
2017forum                   10
fall2010                     7
clir-dlf-affiliates          6
authenticity-project         6
linkeddata                   3
members                      2
jobs                         2
membership-cohorts           2
code-of-conduct              2
museums-cohort-2             1
liberal-arts-colleges        1
community-capacity-award     1
Name: dir_2, dtype: int64