# CLIR Sitemap Analysis

This analyis uses [advertools](https://advertools.readthedocs.io/en/master/advertools.sitemaps.html) to analyze the sitemap for <https://www.clir.org>

In [1]:
!pip install advertools



Import library and read the CLIR sitemap as a data frame.

In [3]:
import advertools as adv

clir_sitemap = adv.sitemap_to_df('https://www.clir.org/sitemap.xml')

2023-05-31 15:18:11,849 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-misc.xml
2023-05-31 15:18:11,865 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-03.xml
2023-05-31 15:18:11,880 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-04.xml
2023-05-31 15:18:11,882 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-02.xml
2023-05-31 15:18:11,895 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-05.xml
2023-05-31 15:18:11,896 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2023-01.xml
2023-05-31 15:18:11,910 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2022-12.xml
2023-05-31 15:18:11,926 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://www.clir.org/sitemap-pt-post-p1-2022-11.xm

Preview the first 10 records

In [9]:
clir_sitemap.head(10)

Unnamed: 0,loc,lastmod,changefreq,priority,sitemap,sitemap_size_mb,download_date
0,https://www.clir.org/,2023-05-03 13:31:03+00:00,daily,1.0,https://www.clir.org/sitemap-misc.xml,0.000982,2023-05-31 19:18:11.852863+00:00
1,https://www.clir.org/sitemap.html,2023-05-30 18:10:21+00:00,monthly,0.5,https://www.clir.org/sitemap-misc.xml,0.000982,2023-05-31 19:18:11.852863+00:00
2,https://www.clir.org/2023/03/calls-for-proposa...,2023-05-02 13:52:46+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.00088,2023-05-31 19:18:11.869152+00:00
3,https://www.clir.org/2023/04/clir-news-152/,2023-05-02 17:33:08+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001882,2023-05-31 19:18:11.884584+00:00
4,https://www.clir.org/2023/04/updates-from-our-...,2023-04-20 02:28:44+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001882,2023-05-31 19:18:11.884584+00:00
5,https://www.clir.org/2023/04/accessibility-by-...,2023-04-20 02:29:11+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001882,2023-05-31 19:18:11.884584+00:00
6,https://www.clir.org/2023/04/where-are-they-no...,2023-05-02 17:33:45+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001882,2023-05-31 19:18:11.884584+00:00
7,https://www.clir.org/2023/04/four-questions-wi...,2023-05-02 13:56:16+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001882,2023-05-31 19:18:11.884584+00:00
8,https://www.clir.org/2023/04/floridas-bright-w...,2023-05-02 13:54:58+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001882,2023-05-31 19:18:11.884584+00:00
9,https://www.clir.org/2023/02/toward-radical-im...,2023-05-02 13:54:26+00:00,monthly,0.2,https://www.clir.org/sitemap-pt-post-p1-2023-0...,0.001318,2023-05-31 19:18:11.899314+00:00


Show basical statistics about the sitemap (rows and number of columns)

In [6]:
print(clir_sitemap.shape)

(2783, 7)


Break apart the location into path components

In [10]:
url_df = adv.url_to_df(clir_sitemap['loc'])
url_df

Unnamed: 0,url,scheme,netloc,path,query,fragment,dir_1,dir_2,dir_3,dir_4,dir_5,last_dir
0,https://www.clir.org/,https,www.clir.org,/,,,,,,,,
1,https://www.clir.org/sitemap.html,https,www.clir.org,/sitemap.html,,,sitemap.html,,,,,sitemap.html
2,https://www.clir.org/2023/03/calls-for-proposa...,https,www.clir.org,/2023/03/calls-for-proposals-for-2023-clir-eve...,,,2023,03,calls-for-proposals-for-2023-clir-events-are-n...,,,calls-for-proposals-for-2023-clir-events-are-n...
3,https://www.clir.org/2023/04/clir-news-152/,https,www.clir.org,/2023/04/clir-news-152/,,,2023,04,clir-news-152,,,clir-news-152
4,https://www.clir.org/2023/04/updates-from-our-...,https,www.clir.org,/2023/04/updates-from-our-affiliates/,,,2023,04,updates-from-our-affiliates,,,updates-from-our-affiliates
...,...,...,...,...,...,...,...,...,...,...,...,...
2778,https://www.clir.org/1988/10/,https,www.clir.org,/1988/10/,,,1988,10,,,,10
2779,https://www.clir.org/1988/09/,https,www.clir.org,/1988/09/,,,1988,09,,,,09
2780,https://www.clir.org/1988/08/,https,www.clir.org,/1988/08/,,,1988,08,,,,08
2781,https://www.clir.org/1988/07/,https,www.clir.org,/1988/07/,,,1988,07,,,,07


Show counts by top level directory

In [11]:
url_df['dir_1'].value_counts()

pubs                                                   1397
2013                                                     77
2014                                                     75
2015                                                     65
2012                                                     61
                                                       ... 
clir-camp-summer-2021                                     1
curated-futures-project-a-third-library-is-possible       1
interview-with-fenella-france                             1
interview-with-christa-williford                          1
rovelstad                                                 1
Name: dir_1, Length: 62, dtype: int64

Break down `lastmod` for largest number of pages (`pubs`)

In [12]:
(clir_sitemap[clir_sitemap['loc'].str.contains('/pub/')].set_index('lastmod').resample('M')['loc'].count())

Series([], Freq: M, Name: loc, dtype: int64)

Show top twenty directories and number of pages

In [13]:
url_df['dir_2'].value_counts()[:20]

reports                               1251
05                                     118
04                                     116
03                                     114
01                                     113
07                                     109
09                                     109
11                                     106
06                                      97
08                                      91
10                                      84
annual                                  84
02                                      84
12                                      80
resources                               42
postdoc                                 25
archives                                12
leadership-through-new-communities      10
events                                  10
2015-symposium-unconference              9
Name: dir_2, dtype: int64