# URL normalization (all)

In this Notebook, the normalization of external URLs is carried out.

In [1]:
import re
import urllib
from url_normalize import url_normalize
from scrapy.utils.url import canonicalize_url
import pandas as pd

# 0. Import the dataset

The full dataset of references is imported.

In [2]:
df = pd.read_csv('url_normalization/enwiki-20210701-externallinks.csv')
df

Unnamed: 0,el_from,el_to
0,3850540,http://www.housing.berkeley.edu/housing/
1,3850540,http://www.freebornhall.com/History/ResidenceH...
2,3802986,http://www.usdoj.gov/usao/iln/osc/documents/ag...
3,3834458,http://worlddmc.ohiolink.edu/OMP/NewDetails?oi...
4,840171,http://www.kloster-einsiedeln.ch
...,...,...
161719035,67938919,https://ui.adsabs.harvard.edu/abs/2016JHEP...0...
161719036,67938919,https://api.semanticscholar.org/CorpusID:11920...
161719037,67666683,https://cluebotng.toolforge.org/?page=View&id=...
161719038,32439342,https://en.wikisource.org/wiki/Eminent_Chinese...


These are all the external links included in the Wikipedia page of The Strokes.

In [3]:
df[df['el_from'] == 148546]

Unnamed: 0,el_from,el_to
11662364,148546,http://thestrokes.com
17246254,148546,http://www.clashmusic.com/artists/the-strokes
17796240,148546,http://www.mtve.com/article.php?ArticleId=4690
17796241,148546,http://www.clashmusic.com/news/the-strokes-dis...
17796242,148546,http://www.hitquarters.com/index.php3?page=int...
...,...,...
136754505,148546,https://variety.com/2020/tv/news/saturday-nigh...
137090067,148546,https://www.nme.com/news/music/watch-the-strok...
147425416,148546,https://www.nme.com/news/music/the-strokes-tal...
151968272,148546,https://pitchfork.com/news/the-strokes-win-bes...


Due to the huge size of the dataset is necessary to reduce the external links to only from Wikipedia pages. To do it, the Wikipedia page dataset es imported.

In [10]:
pages = pd.read_csv('page/page.tsv', sep='\t')
pages = pages[pages['page_namespace']==0]
pages

Unnamed: 0,page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_touched,page_links_updated,page_latest,page_len,page_content_model
0,10,0,AccessibleComputing,,1,0,20210607122734,2.021061e+13,1002250816,111,wikitext
1,12,0,Anarchism,,0,0,20210701093040,2.021070e+13,1030472204,96584,wikitext
2,13,0,AfghanistanHistory,,1,0,20210629133822,2.021061e+13,783865149,90,wikitext
3,14,0,AfghanistanGeography,,1,0,20210607122734,2.021061e+13,783865160,92,wikitext
4,15,0,AfghanistanPeople,,1,0,20210629123442,2.021061e+13,783865293,95,wikitext
...,...,...,...,...,...,...,...,...,...,...,...
53710509,68103359,0,Carrie_Flemmer,,0,1,20210701094546,2.021070e+13,1031387168,1300,wikitext
53710510,68103360,0,US_des_Forces_Armees,,1,1,20210701094549,2.021070e+13,1031387177,35,wikitext
53710512,68103362,0,Carrie_Flemmer-Marshall,,1,1,20210701094611,2.021070e+13,1031387212,27,wikitext
53710515,68103365,0,Dapp_Browsers,,0,1,20210701094630,2.021070e+13,1031387241,2682,wikitext


In [11]:
pages_id = pages['page_id'].tolist()

In [12]:
df = df[df['el_from'].isin(pages_id)]
df

Unnamed: 0,el_from,el_to
0,3850540,http://www.housing.berkeley.edu/housing/
1,3850540,http://www.freebornhall.com/History/ResidenceH...
4,840171,http://www.kloster-einsiedeln.ch
5,1290279,http://www.erbzine.com/mag1/0117.html
6,3856533,http://freepages.genealogy.rootsweb.com/~vanrc...
...,...,...
161719030,3889279,https://www.rnz.co.nz/news/political/419769/pr...
161719031,67209044,https://www.thewi.org.uk/about-us/history-of-t...
161719032,851024,https://www.imdb.com/name/nm0004133/
161719033,851024,https://www.imdb.com/name/nm6662944/


# 1. Preprocessing

The dataset it si filter to only URL mentions. Also this URLs are reviewed to remove erroneous strings assigned as URL.

In [13]:
df = df[df[['el_from', 'el_to']].notnull().all(axis=1)]
df = df[['el_from', 'el_to']]
df = df[df.el_to.str.contains('^http|^www[0-9]{0,2}')]
df.head()

Unnamed: 0,el_from,el_to
0,3850540,http://www.housing.berkeley.edu/housing/
1,3850540,http://www.freebornhall.com/History/ResidenceH...
4,840171,http://www.kloster-einsiedeln.ch
5,1290279,http://www.erbzine.com/mag1/0117.html
6,3856533,http://freepages.genealogy.rootsweb.com/~vanrc...


Before obtaining the domains or modify the URLs the top [web archives](https://en.wikipedia.org/wiki/Wikipedia:List_of_web_archives_on_Wikipedia) URLs are transformed in order to take into account the real URLs. Then, all the URLs are filtered to remove erroenous URLs.

## 1.1 Web archive

In [14]:
df['el_to'] = [re.sub('http[s]{0,1}://(web\.archive|waybackmachine)\.org/.*http', 'http', x) for x in df['el_to']]
df['el_to'] = [re.sub('http[s]{0,1}://(web\.archive|waybackmachine)\.org/.*www', 'www', x) for x in df['el_to']]
df['el_to'] = [re.sub('http[s]{0,1}://(web\.archive|waybackmachine)\.org/web/([0-9a-z]*|[0-9a-z]*\*|\*)/|http[s]{0,1}://(web\.archive|waybackmachine)\.org/\*/', '', x) for x in df['el_to']]

## 1.2 archive.today

It should be noticed that not all URLs are encoded in the same way. Whereas some URLs include the archived URL others do not. There are also multiple domains.

In [15]:
df['el_to'] = [re.sub('http[s]{0,1}://archive\.(today|is|fo|li|vn|md)/.*http', 'http', x) for x in df['el_to']]
df['el_to'] = [re.sub('http[s]{0,1}://archive\.(today|is|fo|li|vn|md)/.*www', 'www', x) for x in df['el_to']]

## 1.3 Webcitation

In this case, most of them does not include the archived URL.

In [16]:
df['el_to'] = [re.sub('http[s]{0,1}://((www\.)?)webcitation\.org/.*http', 'http', x) for x in df['el_to']]
df['el_to'] = [re.sub('http[s]{0,1}://((www\.)?)webcitation\.org/.*www', 'www', x) for x in df['el_to']]

Finally, they are cleaned to get only URLs strings.

In [17]:
df = df[df.el_to.str.contains('^http|^www[0-9]{0,2}')]

Before the preprocessing, there are 53,520,188 unique URLs.

In [18]:
len(df.groupby('el_to').count().index)

53520188

A new column is created to normalize the URLs, and all URLs are transformed to lowercase.

In [19]:
df['URL_n'] = [re.sub('^http://www[0-9]{0,2}\\.|^http://|^https://www[0-9]{0,2}\\.|^https://|^//www[0-9]{0,2}\\.', '', x) for x in df['el_to']]
df['URL_n'] = [re.sub('^http[s]{0,1}%3A%2F%2F((www\.)?)', '', x) for x in df['el_to']]
#df.loc[:,'URL_n']  = df.loc[:,'URL_n'].str.lower()
df.head()

Unnamed: 0,el_from,el_to,URL_n
0,3850540,http://www.housing.berkeley.edu/housing/,http://www.housing.berkeley.edu/housing/
1,3850540,http://www.freebornhall.com/History/ResidenceH...,http://www.freebornhall.com/History/ResidenceH...
4,840171,http://www.kloster-einsiedeln.ch,http://www.kloster-einsiedeln.ch
5,1290279,http://www.erbzine.com/mag1/0117.html,http://www.erbzine.com/mag1/0117.html
6,3856533,http://freepages.genealogy.rootsweb.com/~vanrc...,http://freepages.genealogy.rootsweb.com/~vanrc...


There are some characters that usually appear at the end of the URL, so they are removed.

In [20]:
#df['URL_n'] = [re.sub('#.*$', '', x) for x in df['URL_n']]
df['URL_n'] = [re.sub('((\\\\[ntr])+)$|\\\\+$', '', x) for x in df['URL_n']]
df['URL_n'] = [re.sub('/+$', '', x) for x in df['URL_n']]
#df['URL_n'] = [re.sub('\\.[s]{0,1}(htm|html|xml)$|\\.[s]{0,1}(htm|html|xml)$|\\.[s]{0,1}(htm|html|xml)/+$|\\.[s]{0,1}(htm|html|xml)/+$', '', x) for x in df['URL_n']]
df.head()

Unnamed: 0,el_from,el_to,URL_n
0,3850540,http://www.housing.berkeley.edu/housing/,http://www.housing.berkeley.edu/housing
1,3850540,http://www.freebornhall.com/History/ResidenceH...,http://www.freebornhall.com/History/ResidenceH...
4,840171,http://www.kloster-einsiedeln.ch,http://www.kloster-einsiedeln.ch
5,1290279,http://www.erbzine.com/mag1/0117.html,http://www.erbzine.com/mag1/0117.html
6,3856533,http://freepages.genealogy.rootsweb.com/~vanrc...,http://freepages.genealogy.rootsweb.com/~vanrc...


Some URLs appear encoded whereas others do not, so all are decoded.

In [21]:
df['URL_n'] = [urllib.parse.unquote(x) for x in df['URL_n']]

Then all URLs are normalized using two similar Python packages (url_normalize and scrapy). If an error is returned no transformation is carried out. However, there are a few cases, as it is showed. Sometimes these erroneous URLs work but their structure is a bit strange, for this reason they are not removed.

In [22]:
urls = df['URL_n'].tolist()
err_urls = []

for x in range(len(urls)):
    try:
        urls[x] = url_normalize(urls[x])
    except:
        err_urls.append(urls[x])

print(len(err_urls))

420


In [23]:
df['URL_n'] = urls

In [24]:
df['URL_n'] = [re.sub('^http://www[0-9]{0,2}\\.|^http://|^https://www[0-9]{0,2}\\.|^https://|^//www[0-9]{0,2}\\.', '', x) for x in df['URL_n']]

## 2. Domains

Domain are obtained and the entire data.frame is filtered by normalized URL. Erroneous domains are removed.

In [25]:
df['domain'] = [re.sub('/.*', '', x) for x in df['URL_n']]
df = df[df['domain'] != '']
df = df[df['domain'].str.contains('.', regex=False)]
df.head()

Unnamed: 0,el_from,el_to,URL_n,domain
0,3850540,http://www.housing.berkeley.edu/housing/,housing.berkeley.edu/housing,housing.berkeley.edu
1,3850540,http://www.freebornhall.com/History/ResidenceH...,freebornhall.com/History/ResidenceHalls,freebornhall.com
4,840171,http://www.kloster-einsiedeln.ch,kloster-einsiedeln.ch/,kloster-einsiedeln.ch
5,1290279,http://www.erbzine.com/mag1/0117.html,erbzine.com/mag1/0117.html,erbzine.com
6,3856533,http://freepages.genealogy.rootsweb.com/~vanrc...,freepages.genealogy.rootsweb.com/~vanrcwisner/...,freepages.genealogy.rootsweb.com


## 3. Top domains modifications

Taking into account the most mentioned domain, the following modifications have been carried out.

In [26]:
domains_freq = pd.value_counts(df.domain).to_frame().reset_index()
domains_freq[0:50]

Unnamed: 0,index,domain
0,wikidata.org,1568914
1,books.google.com,1561114
2,worldcat.org,1055887
3,viaf.org,926592
4,archive.org,879754
5,jstor.org,731514
6,id.loc.gov,675470
7,ssd.jpl.nasa.gov,665722
8,minorplanetcenter.net,636808
9,isni.org,521135


The top 50 most mentioned domains represent the 28% of mentions.

In [27]:
100*sum(pd.value_counts(df.domain).to_frame().reset_index()['domain'][0:50])/sum(pd.value_counts(df.domain).to_frame().reset_index()['domain'])

27.696590018936167

There are a total of 3,069,254 different domains.

In [28]:
len(pd.value_counts(df.domain).to_frame().reset_index())

3069254

This is not an exhaustive process. Not all URLs are examinated to detec all failures and normalize in a exhaustive way. In fact an alogithm was considered to clean all URLs, but i tas not possible.

It could lead to some relevan mistakes, for instance un the case os http://www.jstor.org.

## Temporal files

In [29]:
df.to_csv('url_normalization/temp_ext.csv', index=False)

In [None]:
df = pd.read_csv('url_normalization/temp_ext.csv')
df

In [None]:
df['el_to'] = df['el_to'].astype(str)
df['URL_n'] = df['URL_n'].astype(str)

In [None]:
df[df['el_from'] == 37852327]

### 1. wikidata.org

Minor changes related to the language.

In [30]:
list(df.loc[df['domain'] == 'wikidata.org', 'URL_n'])[0:5]

['wikidata.org/wiki/Special:WhatLinksHere/Property:P496',
 'wikidata.org/wiki/Property:P827',
 'wikidata.org/wiki/Q5749041',
 'wikidata.org/wiki/Q12198',
 'wikidata.org/wiki/Q19828893']

In [31]:
df.loc[df['domain'] == 'wikidata.org', 'URL_n'] = [re.sub('\?uselang=.*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'wikidata.org', 'URL_n']]
df.loc[df['domain'] == 'wikidata.org', 'URL_n'] = [re.sub('\?uselang=.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'wikidata.org', 'URL_n']]

In [32]:
list(df.loc[df['domain'] == 'wikidata.org', 'URL_n'])[0:5]

['wikidata.org/wiki/Special:WhatLinksHere/Property:P496',
 'wikidata.org/wiki/Property:P827',
 'wikidata.org/wiki/Q5749041',
 'wikidata.org/wiki/Q12198',
 'wikidata.org/wiki/Q19828893']

### 2. books.google.com

These URLs require a lot of transformations to be reduced.

In [33]:
list(df.loc[df['domain'] == 'books.google.com', 'URL_n'])[0:5]

['books.google.com/books?oi=spell&q=Demetrius+Hondros%20&spell=1',
 'books.google.com/books?vid=OCLC02621211',
 'books.google.com/books?hl=fr&id=nuoBAAAAYAAJ',
 'books.google.com/books?id=k561uXI-uPgC&printsec',
 'books.google.com/books?id=kFpd86J8PLsC&printsec']

In [35]:
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('\?.*&id=', '?id=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('(\?id=.*?&)(.*)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('\?.*&q=', '?q=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('ngrams/graph\?.*&content=', 'ngrams/graph?content=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('&.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('/books.*\?id=', '/?id=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('/books.*\?vid=', '/?vid=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('/books.*\?q=', '/?q=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]
df.loc[df['domain'] == 'books.google.com', 'URL_n'] = [re.sub('#(v|search).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.com', 'URL_n']]

In [37]:
list(df.loc[df['domain'] == 'books.google.com', 'URL_n'])[0:5]

['books.google.com/?q=Demetrius+Hondros%20',
 'books.google.com/?vid=OCLC02621211',
 'books.google.com/?id=nuoBAAAAYAAJ',
 'books.google.com/?id=k561uXI-uPgC',
 'books.google.com/?id=kFpd86J8PLsC']

### 3. worldcat.org

Some tags are removed.

In [38]:
list(df[df['domain'] == 'worldcat.org']['URL_n'])[0:5]

['worldcat.org/',
 'worldcat.org/search?fq=yr:1906&q=Butmi&qt=faceted',
 'worldcat.org/search?fq=yr:1907&q=Butmi&qt=faceted',
 'worldcat.org/oclc/33311040&tab=editions',
 'worldcat.org/search?q=su:Seventh-Day+Adventists+Periodicals.&qt=hot_subject']

In [39]:
df.loc[df['domain'] == 'worldcat.org', 'URL_n'] = [re.sub('(referer=|tab=|ht=|qt=|submit=).*?(&|$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'worldcat.org', 'URL_n']]
df.loc[df['domain'] == 'worldcat.org', 'URL_n'] = [re.sub('(\?$|\?&+$|&+$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'worldcat.org', 'URL_n']]

In [40]:
list(df[df['domain'] == 'worldcat.org']['URL_n'])[0:5]

['worldcat.org/',
 'worldcat.org/search?fq=yr:1906&q=Butmi',
 'worldcat.org/search?fq=yr:1907&q=Butmi',
 'worldcat.org/oclc/33311040',
 'worldcat.org/search?q=su:Seventh-Day+Adventists+Periodicals.']

### 4. viaf.org

Some tags are removed.

In [41]:
list(df[df['domain'] == 'viaf.org']['URL_n'])[0:5]

['viaf.org/viaf/54944866',
 'viaf.org/processed/NUKAT%7Cn%202010124544',
 'viaf.org/viaf/21283575/#Glover%2C_Stephen%2C_1813-1870',
 'viaf.org/viaf/105950421',
 'viaf.org/viaf/83649420/#Aguera_y_Arcas%2C_Blaise']

In [42]:
df.loc[df['domain'] == 'viaf.org', 'URL_n'] = [re.sub('&sortkeys=.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'viaf.org', 'URL_n']]
df.loc[df['domain'] == 'viaf.org', 'URL_n'] = [re.sub('/#.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'viaf.org', 'URL_n']]

In [43]:
list(df[df['domain'] == 'viaf.org']['URL_n'])[0:5]

['viaf.org/viaf/54944866',
 'viaf.org/processed/NUKAT%7Cn%202010124544',
 'viaf.org/viaf/21283575',
 'viaf.org/viaf/105950421',
 'viaf.org/viaf/83649420']

### 5. archive.org

Some tags are removed.

In [44]:
list(df.loc[df['domain'] == 'archive.org', 'URL_n'])[0:5]

['archive.org/details/historyofbowling008027mbp',
 'archive.org/details/lightshipslighth00talbuoft',
 'archive.org/details/jimconnolly00schl',
 'archive.org/details/aramaicpapyrioff00ahikuoft',
 'archive.org/stream/flyingforfrancewmcco#page%2F42%2Fmode%2F2up']

In [45]:
df.loc[df['domain'] == 'archive.org', 'URL_n'] = [re.sub('/page/.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'archive.org', 'URL_n']]
df.loc[df['domain'] == 'archive.org', 'URL_n'] = [re.sub('#(page|mode|start).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'archive.org', 'URL_n']]

In [46]:
list(df.loc[df['domain'] == 'archive.org', 'URL_n'])[0:5]

['archive.org/details/historyofbowling008027mbp',
 'archive.org/details/lightshipslighth00talbuoft',
 'archive.org/details/jimconnolly00schl',
 'archive.org/details/aramaicpapyrioff00ahikuoft',
 'archive.org/stream/flyingforfrancewmcco']

### 6. jstor.org

Some tags are removed.

In [47]:
list(df.loc[df['domain'] == 'jstor.org', 'URL_n'])[0:5]

['jstor.org/',
 'jstor.org/pss/189392',
 'jstor.org/pss/607190',
 'jstor.org/pss/772442',
 'jstor.org/stable/1144537?seq=1']

In [48]:
df.loc[df['domain'] == 'jstor.org', 'URL_n'] = [re.sub('([0-9])(\?(seq=|sid=|cookieSet=|origin=|ref=).*)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'jstor.org', 'URL_n']]
df.loc[df['domain'] == 'jstor.org', 'URL_n'] = [re.sub('#(page|meta|fndtn).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'jstor.org', 'URL_n']]

In [49]:
list(df.loc[df['domain'] == 'jstor.org', 'URL_n'])[0:5]

['jstor.org/',
 'jstor.org/pss/189392',
 'jstor.org/pss/607190',
 'jstor.org/pss/772442',
 'jstor.org/stable/1144537']

### 7. id.loc.gov

No changes required.

In [50]:
list(df.loc[df['domain'] == 'id.loc.gov', 'URL_n'])[0:5]

['id.loc.gov/authorities/sh2006000275',
 'id.loc.gov/authorities/about.html',
 'id.loc.gov/authorities/sh2006000391#concept',
 'id.loc.gov/authorities/sh2006000395#concept',
 'id.loc.gov/authorities/sh91003975#concept']

### 8. ssd.jpl.nasa.gov

No changes required.

In [51]:
list(df.loc[df['domain'] == 'ssd.jpl.nasa.gov', 'URL_n'])[0:5]

['ssd.jpl.nasa.gov/horizons.html',
 'ssd.jpl.nasa.gov/?sat_phys_par',
 'ssd.jpl.nasa.gov/?great_comets',
 'ssd.jpl.nasa.gov/great_comets.html',
 'ssd.jpl.nasa.gov/horizons.cgi']

### 9. minorplanetcenter.net

One minor tag removed.

In [52]:
list(df.loc[df['domain'] == 'minorplanetcenter.net', 'URL_n'])[0:5]

['minorplanetcenter.net/mpec/K06/K06N38.html',
 'minorplanetcenter.net/iau/Ephemerides/Comets',
 'minorplanetcenter.net/iau/Ephemerides/Comets',
 'minorplanetcenter.net/iau/Ephemerides/Comets',
 'minorplanetcenter.net/iau/Ephemerides/Comets']

In [53]:
df.loc[df['domain'] == 'minorplanetcenter.net', 'URL_n'] = [re.sub('commit=show&', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'minorplanetcenter.net', 'URL_n']]

In [54]:
list(df.loc[df['domain'] == 'minorplanetcenter.net', 'URL_n'])[0:5]

['minorplanetcenter.net/mpec/K06/K06N38.html',
 'minorplanetcenter.net/iau/Ephemerides/Comets',
 'minorplanetcenter.net/iau/Ephemerides/Comets',
 'minorplanetcenter.net/iau/Ephemerides/Comets',
 'minorplanetcenter.net/iau/Ephemerides/Comets']

### 10. isni.org

No changes required.

In [55]:
list(df.loc[df['domain'] == 'isni.org', 'URL_n'])[0:5]

['isni.org/content/orcid-begin-using-ringgold-registration-agency',
 'isni.org/0000000115679704',
 'isni.org/isni_and_orcid',
 'isni.org/',
 'isni.org/about']

### 11. nytimes.com

Removed after the .html.

In [56]:
list(df.loc[df['domain'] == 'nytimes.com', 'URL_n'])[0:5]

['nytimes.com/2006/02/17/business/17richOBT.html',
 'nytimes.com/2002/06/02/international/asia/02HEPA.html?ei=5070&en=3d41c410bad15410&ex=1157169600',
 'nytimes.com/aponline/us/AP-Ancient-Bee.html',
 'nytimes.com/2007/01/27/obituaries/27christensen.html',
 'nytimes.com/2007/05/15/world/middleeast/15embed.html']

In [57]:
df.loc[df['domain'] == 'nytimes.com', 'URL_n'] = [re.sub('\.html\?.*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'nytimes.com', 'URL_n']]
df.loc[df['domain'] == 'nytimes.com', 'URL_n'] = [re.sub('\.html#.*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'nytimes.com', 'URL_n']]

In [58]:
list(df.loc[df['domain'] == 'nytimes.com', 'URL_n'])[0:5]

['nytimes.com/2006/02/17/business/17richOBT.html',
 'nytimes.com/2002/06/02/international/asia/02HEPA.html',
 'nytimes.com/aponline/us/AP-Ancient-Bee.html',
 'nytimes.com/2007/01/27/obituaries/27christensen.html',
 'nytimes.com/2007/05/15/world/middleeast/15embed.html']

### 12. imdb.com

Here sometimes the same resource are linked but pointing to different sections of it, for example a specific year of awards or the cast of a movie.

In [59]:
list(df.loc[df['domain'] == 'imdb.com', 'URL_n'])[0:5]

['imdb.com/title/tt0353168',
 'imdb.com/title/tt0224837',
 'imdb.com/title/tt0079878',
 'imdb.com/title/tt0107414',
 'imdb.com/wga']

In [60]:
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('\?(ref_|ref)=[a-z0-9_]+&', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('&(ref_|ref)=[a-z0-9_]+&', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('(\?|&)(ref_|ref)=[a-z0-9_]+#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('(\?|&)(ref_|ref)=[a-z0-9_]+$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('\?mode=[a-z]+&', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('&mode=[a-z]+&', '&', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('(\?|&)mode=[a-z]+#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('(\?|&)mode=[a-z]+$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]
df.loc[df['domain'] == 'imdb.com', 'URL_n'] = [re.sub('\?(&+)', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'imdb.com', 'URL_n']]

In [61]:
list(df.loc[df['domain'] == 'imdb.com', 'URL_n'])[0:5]

['imdb.com/title/tt0353168',
 'imdb.com/title/tt0224837',
 'imdb.com/title/tt0079878',
 'imdb.com/title/tt0107414',
 'imdb.com/wga']

### 13. ncbi.nlm.nih.gov

One minor tag removed

In [62]:
list(df.loc[df['domain'] == 'ncbi.nlm.nih.gov', 'URL_n'])[0:5]

['ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=90101392',
 'ncbi.nlm.nih.gov/About/model',
 'ncbi.nlm.nih.gov/pmc/articles/PMC1242907/?page=1',
 'ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=112600',
 'ncbi.nlm.nih.gov/mapview/maps.cgi?chr=Y&taxid=9606']

In [63]:
df.loc[df['domain'] == 'ncbi.nlm.nih.gov', 'URL_n'] = [re.sub('/\?page.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'ncbi.nlm.nih.gov', 'URL_n']]

In [64]:
list(df.loc[df['domain'] == 'ncbi.nlm.nih.gov', 'URL_n'])[0:5]

['ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=90101392',
 'ncbi.nlm.nih.gov/About/model',
 'ncbi.nlm.nih.gov/pmc/articles/PMC1242907',
 'ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=112600',
 'ncbi.nlm.nih.gov/mapview/maps.cgi?chr=Y&taxid=9606']

### 14. youtube.com

Specific time parameter and unuseful parameters are removed.

In [65]:
list(df.loc[df['domain'] == 'youtube.com', 'URL_n'])[0:5]

['youtube.com/watch?mode=user&search=&v=utyPmSlounc',
 'youtube.com/watch?feature=PlayList&index=2&p=C890829FE50FA652&v=EUj9FrGLfsc',
 'youtube.com/watch?feature=channel&v=rpBBCd8wxiM',
 'youtube.com/watch?v=7SLiN6yXeZs',
 'youtube.com/greenhatangel']

In [66]:
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('\?t=.*&', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('\?t=.*$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('\?.*&v=', '?v=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('\?.*&search_query=', '?search_query=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('&.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('\?(flow|feat).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]
df.loc[df['domain'] == 'youtube.com', 'URL_n'] = [re.sub('(;t=|#t|#at).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'youtube.com', 'URL_n']]

In [67]:
list(df.loc[df['domain'] == 'youtube.com', 'URL_n'])[0:5]

['youtube.com/watch?v=utyPmSlounc',
 'youtube.com/watch?v=EUj9FrGLfsc',
 'youtube.com/watch?v=rpBBCd8wxiM',
 'youtube.com/watch?v=7SLiN6yXeZs',
 'youtube.com/greenhatangel']

### 15. d-nb.info

No changes required.

In [68]:
list(df.loc[df['domain'] == 'd-nb.info', 'URL_n'])[0:5]

['d-nb.info/850838274',
 'd-nb.info/831224622',
 'd-nb.info/57228621X',
 'd-nb.info/365594431',
 'd-nb.info/572286309']

### 16. gbif.org

No changes required.

In [69]:
list(df.loc[df['domain'] == 'gbif.org', 'URL_n'])[0:5]

['gbif.org/UsesPrimaryData.pdf',
 'gbif.org/species/124789228',
 'gbif.org/species/2286534',
 'gbif.org/species/7022067',
 'gbif.org/species/7022098']

### 17. bbc.co.uk

Some old URLs can include parameters, but they cannot be removed.

In [70]:
list(df.loc[df['domain'] == 'bbc.co.uk', 'URL_n'])[0:5]

['bbc.co.uk/nature/wildfacts/factfiles/608.shtml',
 'bbc.co.uk/radio4/womanshour/2004_48_mon_01.shtml',
 'bbc.co.uk/comedy/coupling',
 'bbc.co.uk/wales/music/sites/datblygu/pages/datblygu_biography.shtml',
 'bbc.co.uk/radio4/markthomaspresents/pip/2f6i0']

In [71]:
df.loc[df['domain'] == 'bbc.co.uk', 'URL_n'] = [re.sub('\?(ocid|intlink|print|ns_).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'bbc.co.uk', 'URL_n']]
df.loc[df['domain'] == 'bbc.co.uk', 'URL_n'] = [re.sub('#TWEET.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'bbc.co.uk', 'URL_n']]

In [72]:
list(df.loc[df['domain'] == 'bbc.co.uk', 'URL_n'])[0:5]

['bbc.co.uk/nature/wildfacts/factfiles/608.shtml',
 'bbc.co.uk/radio4/womanshour/2004_48_mon_01.shtml',
 'bbc.co.uk/comedy/coupling',
 'bbc.co.uk/wales/music/sites/datblygu/pages/datblygu_biography.shtml',
 'bbc.co.uk/radio4/markthomaspresents/pip/2f6i0']

### 18. billboard.com

chart/?f= is used to select charts, it cannot be removed. There are also other parameters, such as order, begin page or rank, that can alter the page.

In [73]:
list(df.loc[df['domain'] == 'billboard.com', 'URL_n'])[0:5]

['billboard.com/bb/reviews/album_article_display.jsp?vnu_content_id=1001262309',
 'billboard.com/bb/daily/article_display.jsp?vnu_content_id=1000550918',
 'billboard.com/bbcom/news/article_display.jsp?imw=Y&vnu_content_id=1001433922',
 'billboard.com/bbcom/bio/index.jsp?cr=artist&kw=pray%20for%20the%20soul%20of%20betty&or=ASCENDING&pid=659311&sf=length',
 'billboard.com/bbcom/retrieve_chart_history.do?model.vnuAlbumId=703015&model.vnuArtistId=664010']

### 19. api.semanticscholar.org

No changes required.

In [74]:
list(df.loc[df['domain'] == 'api.semanticscholar.org', 'URL_n'])[0:5]

['api.semanticscholar.org/CorpusID:63576455',
 'api.semanticscholar.org/CorpusID:120313863',
 'api.semanticscholar.org/CorpusID:33224883',
 'api.semanticscholar.org/CorpusID:1749330',
 'api.semanticscholar.org/CorpusID:18918768']

### 20. musicbrainz.org

No changes required.

In [75]:
list(df.loc[df['domain'] == 'musicbrainz.org', 'URL_n'])[0:5]

['musicbrainz.org/artist/08874e35-f5de-4864-88aa-cdc4248a7138.html',
 'musicbrainz.org/artist/86864388-ecf6-4d80-b22a-692f38023563.html',
 'musicbrainz.org/artist/93266cca-6c66-4b82-84a1-7eae5bc35b09.html',
 'musicbrainz.org/artist/63f5e91e-6407-4eac-b3c3-0bfb82ef5951.html',
 'musicbrainz.org/artist/c6b0db5a-d750-4ed8-9caa-ddcfb75dcb0a.html']

### 21. data.bnf.fr

No changes required.

In [76]:
list(df.loc[df['domain'] == 'data.bnf.fr', 'URL_n'])[0:5]

['data.bnf.fr/12179004/pedro_de_aguado',
 'data.bnf.fr/12361945/nyoro__peuple_d_afrique_',
 'data.bnf.fr/12488303/nkole__peuple_d_afrique_',
 'data.bnf.fr/15742308/ngata__peuple_d_afrique_',
 'data.bnf.fr/semanticweb']

### 22. catalogue.bnf.fr

No changes required.

In [77]:
list(df.loc[df['domain'] == 'catalogue.bnf.fr', 'URL_n'])[0:5]

['catalogue.bnf.fr/',
 'catalogue.bnf.fr/',
 'catalogue.bnf.fr/',
 'catalogue.bnf.fr/',
 'catalogue.bnf.fr/ark:/12148/bpt6k200978m.table']

### 23. npgallery.nps.gov

One minor tag removed

In [78]:
list(df.loc[df['domain'] == 'npgallery.nps.gov', 'URL_n'])[0:5]

['npgallery.nps.gov/nrhp/Download?path=/natreg/docs/All_Data.html',
 'npgallery.nps.gov/nrhp',
 'npgallery.nps.gov/nrhp/GetAsset?assetID=984b4b46-d5ec-47a5-a4dd-43e575a397bf',
 'npgallery.nps.gov/pdfhost/docs/NRHP/Text/85002566.pdf',
 'npgallery.nps.gov/nrhp/AssetDetail?assetID=8d915fd4-3677-4864-bd1a-933b1d3c92a9']

In [79]:
df.loc[df['domain'] == 'npgallery.nps.gov', 'URL_n'] = [re.sub('#page.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'npgallery.nps.gov', 'URL_n']]

In [80]:
list(df.loc[df['domain'] == 'npgallery.nps.gov', 'URL_n'])[0:5]

['npgallery.nps.gov/nrhp/Download?path=/natreg/docs/All_Data.html',
 'npgallery.nps.gov/nrhp',
 'npgallery.nps.gov/nrhp/GetAsset?assetID=984b4b46-d5ec-47a5-a4dd-43e575a397bf',
 'npgallery.nps.gov/pdfhost/docs/NRHP/Text/85002566.pdf',
 'npgallery.nps.gov/nrhp/AssetDetail?assetID=8d915fd4-3677-4864-bd1a-933b1d3c92a9']

### 24. theguardian.com

Some parameters are not useful.

In [81]:
list(df.loc[df['domain'] == 'theguardian.com', 'URL_n'])[0:5]

['theguardian.com/profile/jesscartnermorley',
 'theguardian.com/music/series/newbandoftheday',
 'theguardian.com/uk-news/2013/jul/24/alan-partridge-norwich-alpha-papa-movie-premiere',
 'theguardian.com/music/2013/sep/06/metamono-new-band-of-day',
 'theguardian.com/culture/2013/oct/22/how-we-made-sandman-gaiman?INTCMP=ILCNETTXT3487']

In [82]:
df.loc[df['domain'] == 'theguardian.com', 'URL_n'] = [re.sub('\?(INTCMP=|newsfeed=|cmp=|feed=).*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'theguardian.com', 'URL_n']]
df.loc[df['domain'] == 'theguardian.com', 'URL_n'] = [re.sub('\?(INTCMP=|newsfeed=|cmp=|feed=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'theguardian.com', 'URL_n']]

In [83]:
list(df.loc[df['domain'] == 'theguardian.com', 'URL_n'])[0:5]

['theguardian.com/profile/jesscartnermorley',
 'theguardian.com/music/series/newbandoftheday',
 'theguardian.com/uk-news/2013/jul/24/alan-partridge-norwich-alpha-papa-movie-premiere',
 'theguardian.com/music/2013/sep/06/metamono-new-band-of-day',
 'theguardian.com/culture/2013/oct/22/how-we-made-sandman-gaiman']

### 25. amigo.geneontology.org

One minor error is fixed.

In [84]:
list(df.loc[df['domain'] == 'amigo.geneontology.org', 'URL_n'])[0:5]

['amigo.geneontology.org/cgi-bin/amigo/go.cgi?depth=0&query=8373&search_constraint=terms&view=details',
 'amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=MGI:MGI:2387648&session_id=9844amigo1367648742',
 'amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016006',
 'amigo.geneontology.org/amigo/term/GO:0001664',
 'amigo.geneontology.org/amigo/term/GO:0005102']

In [85]:
df.loc[df['domain'] == 'amigo.geneontology.org', 'URL_n'] = [re.sub(',$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'amigo.geneontology.org', 'URL_n']]

In [86]:
list(df.loc[df['domain'] == 'amigo.geneontology.org', 'URL_n'])[0:5]

['amigo.geneontology.org/cgi-bin/amigo/go.cgi?depth=0&query=8373&search_constraint=terms&view=details',
 'amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=MGI:MGI:2387648&session_id=9844amigo1367648742',
 'amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0016006',
 'amigo.geneontology.org/amigo/term/GO:0001664',
 'amigo.geneontology.org/amigo/term/GO:0005102']

### 26. irmng.org

No changes required.

In [87]:
list(df.loc[df['domain'] == 'irmng.org', 'URL_n'])[0:5]

['irmng.org/aphia.php?id=10360369&p=taxdetails',
 'irmng.org/aphia.php?id=11043809&p=taxdetails',
 'irmng.org/aphia.php?id=10537593&p=taxdetails',
 'irmng.org/aphia.php?id=10537593&p=taxdetails',
 'irmng.org/aphia.php?id=11321007&p=taxdetails']

### 27. ui.adsabs.harvard.edu

By default it goes to "abstract".

In [88]:
list(df.loc[df['domain'] == 'ui.adsabs.harvard.edu', 'URL_n'])[0:5]

['ui.adsabs.harvard.edu/',
 'ui.adsabs.harvard.edu/',
 'ui.adsabs.harvard.edu/#abs%2F1989A%26A...219..125M',
 'ui.adsabs.harvard.edu/#abs%2F2012BASI...40...51K',
 'ui.adsabs.harvard.edu/#abs%2F2000A%26A...361..945L']

In [89]:
df.loc[df['domain'] == 'ui.adsabs.harvard.edu', 'URL_n'] = [re.sub('/abstract$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'ui.adsabs.harvard.edu', 'URL_n']]

In [90]:
list(df.loc[df['domain'] == 'ui.adsabs.harvard.edu', 'URL_n'])[0:5]

['ui.adsabs.harvard.edu/',
 'ui.adsabs.harvard.edu/',
 'ui.adsabs.harvard.edu/#abs%2F1989A%26A...219..125M',
 'ui.adsabs.harvard.edu/#abs%2F2012BASI...40...51K',
 'ui.adsabs.harvard.edu/#abs%2F2000A%26A...361..945L']

### 28. news.bbc.co.uk

Some parameters are not useful.

In [91]:
list(df.loc[df['domain'] == 'news.bbc.co.uk', 'URL_n'])[0:5]

['news.bbc.co.uk/sport2/hi/football/africa/4636552.stm',
 'news.bbc.co.uk/sport2/hi/football/africa/4636562.stm',
 'news.bbc.co.uk/sport2/hi/football/africa/4634824.stm',
 'news.bbc.co.uk/sport2/hi/football/africa/4634982.stm',
 'news.bbc.co.uk/sport2/hi/other_sports/snooker/4637756.stm']

In [92]:
df.loc[df['domain'] == 'news.bbc.co.uk', 'URL_n'] = [re.sub('\?([a-zA-Z]{1,3}$|rss=.*|from=.*)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'news.bbc.co.uk', 'URL_n']]

In [93]:
list(df.loc[df['domain'] == 'news.bbc.co.uk', 'URL_n'])[0:5]

['news.bbc.co.uk/sport2/hi/football/africa/4636552.stm',
 'news.bbc.co.uk/sport2/hi/football/africa/4636562.stm',
 'news.bbc.co.uk/sport2/hi/football/africa/4634824.stm',
 'news.bbc.co.uk/sport2/hi/football/africa/4634982.stm',
 'news.bbc.co.uk/sport2/hi/other_sports/snooker/4637756.stm']

### 29. inaturalist.org

No changes required.

In [94]:
list(df.loc[df['domain'] == 'inaturalist.org', 'URL_n'])[0:5]

['inaturalist.org/taxa/Cnemaspis_scalpensis',
 'inaturalist.org/taxa/160690-Clematis-morefieldii',
 'inaturalist.org/observations/1630409',
 'inaturalist.org/observations/1917484',
 'inaturalist.org/projects/reptiles-of-the-horn-of-africa-iucn-redlist-forum/assessments/122-agama-robecchii']

### 30. newspapers.com

In this case some URLs can include parameters

In [95]:
list(df.loc[df['domain'] == 'newspapers.com', 'URL_n'])[0:5]

['newspapers.com/',
 'newspapers.com/',
 'newspapers.com/image/67360166',
 'newspapers.com/image/47396396',
 'newspapers.com/image/6923243']

In [96]:
df.loc[df['domain'] == 'newspapers.com', 'URL_n'] = [re.sub('(/[0-9].*/)([a-zA-Z0-9?].*$)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'newspapers.com', 'URL_n']]
df.loc[df['domain'] == 'newspapers.com', 'URL_n'] = [re.sub('fcfToken=.*?&', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'newspapers.com', 'URL_n']]
df.loc[df['domain'] == 'newspapers.com', 'URL_n'] = [re.sub('&fcfToken=.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'newspapers.com', 'URL_n']]
df.loc[df['domain'] == 'newspapers.com', 'URL_n'] = [re.sub('/$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'newspapers.com', 'URL_n']]

In [97]:
list(df.loc[df['domain'] == 'newspapers.com', 'URL_n'])[0:5]

['newspapers.com',
 'newspapers.com',
 'newspapers.com/image/67360166',
 'newspapers.com/image/47396396',
 'newspapers.com/image/6923243']

### 31. eol.org

One parameter is removed.

In [101]:
list(df.loc[df['domain'] == 'eol.org', 'URL_n'])[0:5]

['eol.org/',
 'eol.org/',
 'eol.org/',
 'eol.org/search?q=Hyla&search_image=',
 'eol.org/']

In [102]:
df.loc[df['domain'] == 'eol.org', 'URL_n'] = [re.sub('\?category_id.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'eol.org', 'URL_n']]

In [103]:
list(df.loc[df['domain'] == 'eol.org', 'URL_n'])[0:5]

['eol.org/',
 'eol.org/',
 'eol.org/',
 'eol.org/search?q=Hyla&search_image=',
 'eol.org/']

### 32. sports-reference.com

No changes required.

In [104]:
list(df.loc[df['domain'] == 'sports-reference.com', 'URL_n'])[0:5]

['sports-reference.com/cbb/conferences/NJNY',
 'sports-reference.com/cbb/conferences/AMSO',
 'sports-reference.com/olympics/athletes/th/micheen-thornycroft-1.html',
 'sports-reference.com/cfb/coaches/james-horne-1.html',
 'sports-reference.com/cfb/coaches/james-sheldon-1.html']

### 33. idref.fr

No changes required.

In [105]:
list(df.loc[df['domain'] == 'idref.fr', 'URL_n'])[0:5]

['idref.fr/028329600',
 'idref.fr/030795834',
 'idref.fr/077803418',
 'idref.fr/112923844',
 'idref.fr/026975238']

### 34. allmusic.com

One tag is removed.

In [106]:
list(df.loc[df['domain'] == 'allmusic.com', 'URL_n'])[0:5]

['allmusic.com/cg/x.dll?p=amg&sql=B6185',
 'allmusic.com/',
 'allmusic.com/',
 'allmusic.com/cg/amg.dll?p=amg',
 'allmusic.com/']

In [107]:
df.loc[df['domain'] == 'allmusic.com', 'URL_n'] = [re.sub('#no-js$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'allmusic.com', 'URL_n']]

In [108]:
list(df.loc[df['domain'] == 'allmusic.com', 'URL_n'])[0:5]

['allmusic.com/cg/x.dll?p=amg&sql=B6185',
 'allmusic.com/',
 'allmusic.com/',
 'allmusic.com/cg/amg.dll?p=amg',
 'allmusic.com/']

### 35. data.bibliotheken.nl

No changes required.

In [109]:
list(df.loc[df['domain'] == 'data.bibliotheken.nl', 'URL_n'])[0:5]

['data.bibliotheken.nl/id/thes/p072610212',
 'data.bibliotheken.nl/id/thes/p170985776',
 'data.bibliotheken.nl/id/thes/p070845352',
 'data.bibliotheken.nl/id/thes/p06874949X',
 'data.bibliotheken.nl/id/thes/p40744727X']

### 36. census.gov

No changes required.

In [110]:
list(df.loc[df['domain'] == 'census.gov', 'URL_n'])[0:5]

['census.gov/geo/maps/urbanarea/uaoutline/UC2000/uc59923/uc59923_01.pdf',
 'census.gov/prod2/decennial/documents/1880a_v1-01.pdf',
 'census.gov/geo/maps/tribaltract2000/1125_FondduLac/CTN1125_001.pdf',
 'census.gov/geo/maps/tribaltract2000/4595_WhiteEarth/CTN4595_001.pdf',
 'census.gov/prod2/decennial/documents/33405927v1ch02.pdf']

### 37. id.worldcat.org

No changes required.

In [111]:
list(df.loc[df['domain'] == 'id.worldcat.org', 'URL_n'])[0:5]

['id.worldcat.org/fast/1917405',
 'id.worldcat.org/fast/393538',
 'id.worldcat.org/fast/1532152',
 'id.worldcat.org/fast/629264',
 'id.worldcat.org/fast/128702']

### 38. news.google.com

In this case, there is not a clear parameter, because if sjid, nid or id are removed the URL not works. For example in news.google.com/newspapers?nid=1873&dat=19610313&id=qpkoAAAAIBAJ&sjid=h8wEAAAAIBAJ&pg=6269,2359408. But the language can be removed.

In [112]:
list(df.loc[df['domain'] == 'news.google.com', 'URL_n'])[0:5]

['news.google.com/newspapers?dq==dan+healy&hl=n&id=hVxMAAAAIBAAJ&pg=3484,5915299&sjid=rOADAAAAIBAJ',
 'news.google.com/newspapers?id=P41ZAAAAIBAJ&pg=5323,4708879&sjid=Z0kNAAAAIBAJ',
 'news.google.com/newspapers?id=fPEcAAAAIBAJ&pg=5935,1532525&sjid=tmgEAAAAIBAJ',
 'news.google.com/newspapers?id=UTQiAAAAIBAJ&pg=1048,4560722&sjid=F6cFAAAAIBAJ',
 'news.google.com/newspapers?id=3fhTAAAAIBAJ&pg=6729,2620754&sjid=Ao0DAAAAIBAJ']

In [113]:
df.loc[df['domain'] == 'news.google.com', 'URL_n'] = [re.sub('(hl=.*?&|&hl=[a-zA-Z-]{2,6}$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'news.google.com', 'URL_n']]
df.loc[df['domain'] == 'news.google.com', 'URL_n'] = [re.sub('\?dq=.*?&', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'news.google.com', 'URL_n']]

In [114]:
list(df.loc[df['domain'] == 'news.google.com', 'URL_n'])[0:5]

['news.google.com/newspapers?id=hVxMAAAAIBAAJ&pg=3484,5915299&sjid=rOADAAAAIBAJ',
 'news.google.com/newspapers?id=P41ZAAAAIBAJ&pg=5323,4708879&sjid=Z0kNAAAAIBAJ',
 'news.google.com/newspapers?id=fPEcAAAAIBAJ&pg=5935,1532525&sjid=tmgEAAAAIBAJ',
 'news.google.com/newspapers?id=UTQiAAAAIBAJ&pg=1048,4560722&sjid=F6cFAAAAIBAJ',
 'news.google.com/newspapers?id=3fhTAAAAIBAJ&pg=6729,2620754&sjid=Ao0DAAAAIBAJ']

### 39. aleph.nkp.cz

No changes required.

In [115]:
list(df.loc[df['domain'] == 'aleph.nkp.cz', 'URL_n'])[0:5]

['aleph.nkp.cz/F/CRBKIKHD3PPMUQ5IIAX1H1JVC4AY2F3FPFSHPHNNG6AIY6BK5J-27038?acc_sequence=000121710&func=FIND-ACC',
 'aleph.nkp.cz/F/4F68MISLEVADC8E66Y75CPM6FIYJ5EG9D9D6H99DYTG961GQRH-34741?acc_sequence=000121710&func=accref',
 'aleph.nkp.cz/F/?CON_LNG=ENG&ccl_term=wau=jn20000728817+or+wkw=jn20000728817&func=find-c&local_base=nkc',
 'aleph.nkp.cz/F/?ccl_term=wau=jk01040550+or+wkw=jk01040550&func=find-c&local_base=nkc',
 'aleph.nkp.cz/F/?ccl_term=wau=jk01151786+or+wkw=jk01151786&func=find-c&local_base=nkc']

### 40. int.soccerway.com

One parameter is removed.

In [119]:
list(df.loc[df['domain'] == 'int.soccerway.com', 'URL_n'])[0:5]

['int.soccerway.com/national/maldives/dhiraagu-dhivehi-league/2014/championship-round/r28159',
 'int.soccerway.com/teams/dominican-republic/barcelona/8534',
 'int.soccerway.com/venues/dominican-republic/cancha-municipal-puerto-plata/v20519',
 'int.soccerway.com/players/rahul-jaiswal/267434',
 'int.soccerway.com/matches/2009/08/08/italy/super-cup/fc-internazionale-milano/ss-lazio-roma/786439']

In [120]:
df.loc[df['domain'] == 'int.soccerway.com', 'URL_n'] = [re.sub('/\?icid=.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'int.soccerway.com', 'URL_n']]

In [121]:
list(df.loc[df['domain'] == 'int.soccerway.com', 'URL_n'])[0:5]

['int.soccerway.com/national/maldives/dhiraagu-dhivehi-league/2014/championship-round/r28159',
 'int.soccerway.com/teams/dominican-republic/barcelona/8534',
 'int.soccerway.com/venues/dominican-republic/cancha-municipal-puerto-plata/v20519',
 'int.soccerway.com/players/rahul-jaiswal/267434',
 'int.soccerway.com/matches/2009/08/08/italy/super-cup/fc-internazionale-milano/ss-lazio-roma/786439']

### 41. espncricinfo.com

In this case, if ".html" is removed, the URL does not work, but it allows to unify. Only specific parameters can be removed.

In [116]:
list(df.loc[df['domain'] == 'espncricinfo.com', 'URL_n'])[0:5]

['espncricinfo.com/southafrica/content/story/472149.html',
 'espncricinfo.com/afghanistan/content/story/486563.html',
 'espncricinfo.com/wisdenalmanack/content/story/228030.html',
 'espncricinfo.com/ci/content/player/20324.html',
 'espncricinfo.com/wisdenalmanack/content/story/227835.html']

In [117]:
df.loc[df['domain'] == 'espncricinfo.com', 'URL_n'] = [re.sub('(\.html\?template|\.html\?view|\.html\?cmp|\.html\?version|\.html\?comments|\.html\?ex_cid|\.html\?innings|\.html\?index).*#', '.html#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'espncricinfo.com', 'URL_n']]
df.loc[df['domain'] == 'espncricinfo.com', 'URL_n'] = [re.sub('(\.html\?template|\.html\?view|\.html\?cmp|\.html\?version|\.html\?comments|\.html\?ex_cid|\.html\?innings|\.html\?index).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'espncricinfo.com', 'URL_n']]

In [118]:
list(df.loc[df['domain'] == 'espncricinfo.com', 'URL_n'])[0:5]

['espncricinfo.com/southafrica/content/story/472149.html',
 'espncricinfo.com/afghanistan/content/story/486563.html',
 'espncricinfo.com/wisdenalmanack/content/story/228030.html',
 'espncricinfo.com/ci/content/player/20324.html',
 'espncricinfo.com/wisdenalmanack/content/story/227835.html']

### 42. itis.gov

Some parameters are removed

In [123]:
list(df.loc[df['domain'] == 'itis.gov', 'URL_n'])[0:5]

['itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=161216',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=593668',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=550347',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=77709',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=44014']

In [124]:
df.loc[df['domain'] == 'itis.gov', 'URL_n'] = [re.sub('#null$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itis.gov', 'URL_n']]
df.loc[df['domain'] == 'itis.gov', 'URL_n'] = [re.sub('(print_version=prt&|&source=(to_print|from_print|html)$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itis.gov', 'URL_n']]

In [125]:
list(df.loc[df['domain'] == 'itis.gov', 'URL_n'])[0:5]

['itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=161216',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=593668',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=550347',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=77709',
 'itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=44014']

### 43. baseball-reference.com

No changes required. Some # have change and they do not work.

In [126]:
list(df.loc[df['domain'] == 'baseball-reference.com', 'URL_n'])[0:5]

['baseball-reference.com/leaders/RC_career.shtml',
 'baseball-reference.com/leaders/RC_season.shtml',
 'baseball-reference.com/leaders/RC_active.shtml',
 'baseball-reference.com/leaders/RC_leagues.shtml',
 'baseball-reference.com/teams/ROK/1871.shtml']

### 44. nla.gov.au

Some parameters are only use to highlight some terms.

In [127]:
list(df.loc[df['domain'] == 'nla.gov.au', 'URL_n'])[0:5]

['nla.gov.au/nla.pic-an23386505',
 'nla.gov.au/nla.pic-an23386438',
 'nla.gov.au/nla.map-erpc1',
 'nla.gov.au/guides/federation/people/lawsonl.html',
 'nla.gov.au/nla.pic-vn3046935']

In [128]:
df.loc[df['domain'] == 'nla.gov.au', 'URL_n'] = [re.sub('(#page|#pstart).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'nla.gov.au', 'URL_n']]
df.loc[df['domain'] == 'nla.gov.au', 'URL_n'] = [re.sub('(\?search|\?zoom|/view).*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'nla.gov.au', 'URL_n']]
df.loc[df['domain'] == 'nla.gov.au', 'URL_n'] = [re.sub('(\?search|\?zoom|/view).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'nla.gov.au', 'URL_n']]

In [129]:
list(df.loc[df['domain'] == 'nla.gov.au', 'URL_n'])[0:5]

['nla.gov.au/nla.pic-an23386505',
 'nla.gov.au/nla.pic-an23386438',
 'nla.gov.au/nla.map-erpc1',
 'nla.gov.au/guides/federation/people/lawsonl.html',
 'nla.gov.au/nla.pic-vn3046935']

### 45. tvbythenumbers.zap2it.com

No changes required.

In [130]:
list(df.loc[df['domain'] == 'tvbythenumbers.zap2it.com', 'URL_n'])[0:5]

['tvbythenumbers.zap2it.com/2010/05/19/cws-melrose-place-canceled/51848',
 'tvbythenumbers.zap2it.com/2010/11/12/thursday-final-ratings-fringe-community-30-rock-outsourced-the-office-adjusted-down-bones-my-dad-says-adjusted-up/71871',
 'tvbythenumbers.zap2it.com/2010/11/12/thursday-final-ratings-fringe-community-30-rock-outsourced-the-office-adjusted-down-bones-my-dad-says-adjusted-up/71871',
 'tvbythenumbers.zap2it.com/2010/11/12/thursday-final-ratings-fringe-community-30-rock-outsourced-the-office-adjusted-down-bones-my-dad-says-adjusted-up/71871',
 'tvbythenumbers.zap2it.com/2010/11/12/thursday-final-ratings-fringe-community-30-rock-outsourced-the-office-adjusted-down-bones-my-dad-says-adjusted-up/71871']

### 46. geonames.usgs.gov

Tags are removed.

In [131]:
list(df.loc[df['domain'] == 'geonames.usgs.gov', 'URL_n'])[0:5]

['geonames.usgs.gov/antex.html',
 'geonames.usgs.gov/',
 'geonames.usgs.gov/fips55/fips55down.html',
 'geonames.usgs.gov/redirect.html',
 'geonames.usgs.gov/domestic/index.html']

In [132]:
df.loc[df['domain'] == 'geonames.usgs.gov', 'URL_n'] = [re.sub('#.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'geonames.usgs.gov', 'URL_n']]

In [133]:
list(df.loc[df['domain'] == 'geonames.usgs.gov', 'URL_n'])[0:5]

['geonames.usgs.gov/antex.html',
 'geonames.usgs.gov/',
 'geonames.usgs.gov/fips55/fips55down.html',
 'geonames.usgs.gov/redirect.html',
 'geonames.usgs.gov/domestic/index.html']

### 47. twitter.com

Some parameters are removed.

In [134]:
list(df.loc[df['domain'] == 'twitter.com', 'URL_n'])[0:5]

['twitter.com/justinmj/status/1395983675',
 'twitter.com/pleasetouch/status/1348070565',
 'twitter.com/socialcode',
 'twitter.com/Astro_127',
 'twitter.com/nasa']

In [135]:
df.loc[df['domain'] == 'twitter.com', 'URL_n'] = [re.sub('\?.*q=', '?q=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'twitter.com', 'URL_n']]
df.loc[df['domain'] == 'twitter.com', 'URL_n'] = [re.sub('\?(lang|ref|s=|p=|tw_e=|dec).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'twitter.com', 'URL_n']]
df.loc[df['domain'] == 'twitter.com', 'URL_n'] = [re.sub('\.html$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'twitter.com', 'URL_n']]

In [136]:
list(df.loc[df['domain'] == 'twitter.com', 'URL_n'])[0:5]

['twitter.com/justinmj/status/1395983675',
 'twitter.com/pleasetouch/status/1348070565',
 'twitter.com/socialcode',
 'twitter.com/Astro_127',
 'twitter.com/nasa']

### 48. geohack.toolforge.org

No changes required.

In [137]:
list(df.loc[df['domain'] == 'geohack.toolforge.org', 'URL_n'])[0:5]

['geohack.toolforge.org/geohack.php?pagename=River_Itchen,_Hampshire&params=51.042044148996_N_1.1612396354662_W_region:GB_scale:100000',
 'geohack.toolforge.org/geohack.php?pagename=River_Itchen,_Hampshire&params=50.883389996042_N_1.3886940272936_W_region:GB_scale:100000',
 'geohack.toolforge.org/geohack.php?pagename=Croagh_Patrick&params=53.760062662731_N_9.6595264818095_W_region:IE_scale:25000',
 'geohack.toolforge.org/geohack.php?pagename=Ballard_Down&params=50.630493379119_N_1.9660143715323_W_region:GB_scale:25000',
 'geohack.toolforge.org/geohack.php?pagename=Turlough_Hill&params=53.024236643604_N_6.4165234583946_W_region:IE_scale:25000']

### 49. washingtonpost.com

Some parameters are removed.

In [138]:
list(df.loc[df['domain'] == 'washingtonpost.com', 'URL_n'])[0:5]

['washingtonpost.com/wp-dyn/content/article/2005/07/29/AR2005072902133_pf.html',
 'washingtonpost.com/ac2/wp-dyn?contentId=A43118-2001Mar8&node=&pagename=article',
 'washingtonpost.com/wp-dyn/content/article/2007/04/24/AR2007042401973.html',
 'washingtonpost.com/ac2/wp-dyn/A44371-2004Jun15?language=printer',
 'washingtonpost.com/ac2/wp-dyn/A45442-2004Jul12?language=printer']

In [139]:
df.loc[df['domain'] == 'washingtonpost.com', 'URL_n'] = [re.sub('(/?)\?(language|noredirect|wprss|tid|sid|hpid|nav|sub|referrer|arc404|utm_term|wp).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'washingtonpost.com', 'URL_n']]

In [140]:
list(df.loc[df['domain'] == 'washingtonpost.com', 'URL_n'])[0:5]

['washingtonpost.com/wp-dyn/content/article/2005/07/29/AR2005072902133_pf.html',
 'washingtonpost.com/ac2/wp-dyn?contentId=A43118-2001Mar8&node=&pagename=article',
 'washingtonpost.com/wp-dyn/content/article/2007/04/24/AR2007042401973.html',
 'washingtonpost.com/ac2/wp-dyn/A44371-2004Jun15',
 'washingtonpost.com/ac2/wp-dyn/A45442-2004Jul12']

### 50. historicengland.org.uk

No changes required.

In [141]:
list(df.loc[df['domain'] == 'historicengland.org.uk', 'URL_n'])[0:5]

['historicengland.org.uk/listing/what-is-designation/listed-buildings',
 'historicengland.org.uk/listing/what-is-designation/listed-buildings',
 'historicengland.org.uk/listing/what-is-designation/listed-buildings',
 'historicengland.org.uk/listing/what-is-designation/listed-buildings',
 'historicengland.org.uk/listing/what-is-designation/listed-buildings']

## Extra

### 51. telegraph.co.uk

Some parameters are removed.

In [142]:
list(df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n'])[0:5]

['telegraph.co.uk/health/main.jhtml?xml=/health/2006/04/04/hrowing01.xml#aa',
 'telegraph.co.uk/money/main.jhtml?xml=/money/2002/09/13/cnwed113.xml',
 'telegraph.co.uk/core/Content/displayPrintable.jhtml?page=0&site=5&xml=/news/2004/12/13/db1302.xml',
 'telegraph.co.uk/sport/main.jhtml?xml=/sport/2007/11/08/uowoodward108.xml',
 'telegraph.co.uk/news/uknews/1584599/Extravagance-uncovered-during-Saudi-arms-probe.html']

In [143]:
df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n'] = [re.sub('\.html\?(utm_|WT|li_medium).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n']]
df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n'] = [re.sub('[/]{0,1}\?(utm_|WT|li_medium).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n']]
df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n'] = [re.sub('#disqus_thread$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n']]

In [144]:
list(df.loc[df['domain'] == 'telegraph.co.uk', 'URL_n'])[0:5]

['telegraph.co.uk/health/main.jhtml?xml=/health/2006/04/04/hrowing01.xml#aa',
 'telegraph.co.uk/money/main.jhtml?xml=/money/2002/09/13/cnwed113.xml',
 'telegraph.co.uk/core/Content/displayPrintable.jhtml?page=0&site=5&xml=/news/2004/12/13/db1302.xml',
 'telegraph.co.uk/sport/main.jhtml?xml=/sport/2007/11/08/uowoodward108.xml',
 'telegraph.co.uk/news/uknews/1584599/Extravagance-uncovered-during-Saudi-arms-probe.html']

### 52. independent.co.uk

If ".hmtl" is removed, in some cases, the URL does not work, but it makes possible to unify URL.

In [145]:
list(df.loc[df['domain'] == 'independent.co.uk', 'URL_n'])[0:5]

['independent.co.uk/sport/football/internationals/israel-showed-they-are-honest-people--we-have-to-behave-in-the-same-way-400969.html',
 'independent.co.uk/news/obituaries/boris-grushin-397539.html',
 'independent.co.uk/news/obituaries/sir-john-brown-730102.html',
 'independent.co.uk/arts-entertainment/films/features/ah-007-we-meet-again-a-brief-history-of-the-bond-villain-963852.html',
 'independent.co.uk/arts-entertainment/theatre-dance/features/leo-butler-thats-not-an-usher-thats-the-author-662973.html']

In [146]:
df.loc[df['domain'] == 'independent.co.uk', 'URL_n'] = [re.sub('\.html\?(amp|CMP|print|origin|dkdkd|service|r=).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'independent.co.uk', 'URL_n']]

In [147]:
list(df.loc[df['domain'] == 'independent.co.uk', 'URL_n'])[0:5]

['independent.co.uk/sport/football/internationals/israel-showed-they-are-honest-people--we-have-to-behave-in-the-same-way-400969.html',
 'independent.co.uk/news/obituaries/boris-grushin-397539.html',
 'independent.co.uk/news/obituaries/sir-john-brown-730102.html',
 'independent.co.uk/arts-entertainment/films/features/ah-007-we-meet-again-a-brief-history-of-the-bond-villain-963852.html',
 'independent.co.uk/arts-entertainment/theatre-dance/features/leo-butler-thats-not-an-usher-thats-the-author-662973.html']

### 53. variety.com

The use of parameters looks associated to an old web version. Parameters included in the *variety.com/index.asp* URL cannot be removed as well as category. In some cases if the category parameter is removed it doesn't work.

In [148]:
list(df.loc[df['domain'] == 'variety.com', 'URL_n'])[0:5]

['variety.com/review/VE1117793667.html?categoryid=31&cs=1&p=0',
 'variety.com/index.asp?articleid=VR1117958975&content=jump&dept=berlin&jump=story&layout=features2007&nav=FBberlin',
 'variety.com/ref.asp?p=H2BE&sid=VE1117931860&u=IMDB',
 'variety.com/screening/la/default.asp?show=10',
 'variety.com/screening/ny/default.asp?show=10']

In [None]:
#df.loc[df['domain'] == 'variety.com', 'URL_n'] = [re.sub('\.html\?.*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'variety.com', 'URL_n']]
#df.loc[df['domain'] == 'variety.com', 'URL_n'] = [re.sub('([0-9]|/)(\?.*)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'variety.com', 'URL_n']]
#df.loc[df['domain'] == 'variety.com', 'URL_n'] = [re.sub('\?(refcatid|categoryid|printerfriendly).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'variety.com', 'URL_n']]

In [149]:
list(df.loc[df['domain'] == 'variety.com', 'URL_n'])[0:5]

['variety.com/review/VE1117793667.html?categoryid=31&cs=1&p=0',
 'variety.com/index.asp?articleid=VR1117958975&content=jump&dept=berlin&jump=story&layout=features2007&nav=FBberlin',
 'variety.com/ref.asp?p=H2BE&sid=VE1117931860&u=IMDB',
 'variety.com/screening/la/default.asp?show=10',
 'variety.com/screening/ny/default.asp?show=10']

### 54. itunes.apple.com

There are some parameters that could be included. mt and l (language) looks unuseful whereas i is for specific album songs. #see-all and #fullText would be removed.

In [150]:
list(df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'])[0:5]

['itunes.apple.com/gb/album/they-may-talk-ep/id359886387',
 'itunes.apple.com/gb/album/an-emotional-victory/id373520393',
 'itunes.apple.com/us/album/turning-point/id366481125',
 'itunes.apple.com/us/album/jerk-single/id412674162',
 'itunes.apple.com/us/album/flaws-and-all-live/id268305666?i=268305829']

In [151]:
df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'] = [re.sub('(\?mt=|&mt=|\?l=|&l=).*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itunes.apple.com', 'URL_n']]
df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'] = [re.sub('(\?mt=|&mt=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itunes.apple.com', 'URL_n']]
df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'] = [re.sub('(\?l=|&l=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itunes.apple.com', 'URL_n']]
df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'] = [re.sub('\?app.*?&', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itunes.apple.com', 'URL_n']]
df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'] = [re.sub('\?app.*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itunes.apple.com', 'URL_n']]
df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'] = [re.sub('\?app.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'itunes.apple.com', 'URL_n']]

In [152]:
list(df.loc[df['domain'] == 'itunes.apple.com', 'URL_n'])[0:5]

['itunes.apple.com/gb/album/they-may-talk-ep/id359886387',
 'itunes.apple.com/gb/album/an-emotional-victory/id373520393',
 'itunes.apple.com/us/album/turning-point/id366481125',
 'itunes.apple.com/us/album/jerk-single/id412674162',
 'itunes.apple.com/us/album/flaws-and-all-live/id268305666?i=268305829']

### 55. timesofindia.indiatimes.com

Some parameters are removed.

In [153]:
list(df.loc[df['domain'] == 'timesofindia.indiatimes.com', 'URL_n'])[0:5]

['timesofindia.indiatimes.com/articleshow/1180863.cms',
 'timesofindia.indiatimes.com/articleshow/903897.cms',
 'timesofindia.indiatimes.com/articleshow/431188.cms',
 'timesofindia.indiatimes.com/articleshow/11243699.cms',
 'timesofindia.indiatimes.com/articleshow/msid-21724548,prtpage-1.cms']

In [154]:
df.loc[df['domain'] == 'timesofindia.indiatimes.com', 'URL_n'] = [re.sub('\.cms\?(referral=|intenttarget=|null|from=).*', '.cms', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'timesofindia.indiatimes.com', 'URL_n']]
df.loc[df['domain'] == 'timesofindia.indiatimes.com', 'URL_n'] = [re.sub('\.cms(#write$|#ixzz.*)', '.cms', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'timesofindia.indiatimes.com', 'URL_n']]

In [155]:
list(df.loc[df['domain'] == 'timesofindia.indiatimes.com', 'URL_n'])[0:5]

['timesofindia.indiatimes.com/articleshow/1180863.cms',
 'timesofindia.indiatimes.com/articleshow/903897.cms',
 'timesofindia.indiatimes.com/articleshow/431188.cms',
 'timesofindia.indiatimes.com/articleshow/11243699.cms',
 'timesofindia.indiatimes.com/articleshow/msid-21724548,prtpage-1.cms']

### 56. bbc.com

The main unuseful parameters are removed.

In [156]:
list(df.loc[df['domain'] == 'bbc.com', 'URL_n'])[0:5]

['bbc.com/bbcfour/documentaries/storyville/house-saud.shtml',
 'bbc.com/cumbria/features/tall_ships/earl_of_pembroke.shtml',
 'bbc.com/pressoffice/speeches/stories/dyke_mactaggart2000.shtml',
 'bbc.com/dna/getwriting/module14',
 'bbc.com/pressoffice/pressreleases/stories/2003/12_december/10/bbc3_factual.shtml']

In [157]:
df.loc[df['domain'] == 'bbc.com', 'URL_n'] = [re.sub('\?(ocid|intlink|print|ns_).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'bbc.com', 'URL_n']]
df.loc[df['domain'] == 'bbc.com', 'URL_n'] = [re.sub('#TWEET.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'bbc.com', 'URL_n']]

In [158]:
list(df.loc[df['domain'] == 'bbc.com', 'URL_n'])[0:5]

['bbc.com/bbcfour/documentaries/storyville/house-saud.shtml',
 'bbc.com/cumbria/features/tall_ships/earl_of_pembroke.shtml',
 'bbc.com/pressoffice/speeches/stories/dyke_mactaggart2000.shtml',
 'bbc.com/dna/getwriting/module14',
 'bbc.com/pressoffice/pressreleases/stories/2003/12_december/10/bbc3_factual.shtml']

### 57. stat.gov.pl

The index is fixed.

In [159]:
list(df.loc[df['domain'] == 'stat.gov.pl', 'URL_n'])[0:5] 

['stat.gov.pl/bdren/bdrap.dane_cechter.czaskatpoag?P_NTS_ID=11&P_SZUKANIE=&P_TERY_ID=2788',
 'stat.gov.pl/',
 'stat.gov.pl/',
 'stat.gov.pl/gus/45_655_PLK_HTML.htm',
 'stat.gov.pl/gus/45_655_PLK_HTML.htm']

In [160]:
df.loc[df['domain'] == 'stat.gov.pl', 'URL_n'] = [re.sub('\.html\?p_name.*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'stat.gov.pl', 'URL_n']]

In [161]:
list(df.loc[df['domain'] == 'stat.gov.pl', 'URL_n'])[0:5] 

['stat.gov.pl/bdren/bdrap.dane_cechter.czaskatpoag?P_NTS_ID=11&P_SZUKANIE=&P_TERY_ID=2788',
 'stat.gov.pl/',
 'stat.gov.pl/',
 'stat.gov.pl/gus/45_655_PLK_HTML.htm',
 'stat.gov.pl/gus/45_655_PLK_HTML.htm']

### 58. hollywoodreporter.com

URLs with *article_display.jsp* can not be transformed.

In [162]:
list(df.loc[df['domain'] == 'hollywoodreporter.com', 'URL_n'])[0:5]

['hollywoodreporter.com/thr/television/brief_display.jsp?vnu_content_id=1000698662',
 'hollywoodreporter.com/thr/awards/sundance/reviews_display.jsp?vnu_content_id=1000760661',
 'hollywoodreporter.com/thr/crafts/feature_display.jsp?vnu_content_id=1000865347',
 'hollywoodreporter.com/thr/crafts/feature_display.jsp?vnu_content_id=1000865347',
 'hollywoodreporter.com/thr/film/feature_display.jsp?vnu_content_id=1000863174']

In [163]:
df.loc[df['domain'] == 'hollywoodreporter.com', 'URL_n'] = [re.sub('([0-9a-zA-Z-]{4})(\?(mobile_redirect|facebook).*)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'hollywoodreporter.com', 'URL_n']]

In [164]:
list(df.loc[df['domain'] == 'hollywoodreporter.com', 'URL_n'])[0:5]

['hollywoodreporter.com/thr/television/brief_display.jsp?vnu_content_id=1000698662',
 'hollywoodreporter.com/thr/awards/sundance/reviews_display.jsp?vnu_content_id=1000760661',
 'hollywoodreporter.com/thr/crafts/feature_display.jsp?vnu_content_id=1000865347',
 'hollywoodreporter.com/thr/crafts/feature_display.jsp?vnu_content_id=1000865347',
 'hollywoodreporter.com/thr/film/feature_display.jsp?vnu_content_id=1000863174']

### 59. deadline.com

Some parameters are removed.

In [165]:
list(df.loc[df['domain'] == 'deadline.com', 'URL_n'])[0:5]

['deadline.com/2010/01/imagine-does-transmedia-storytelling-deal',
 'deadline.com/2009/05/an-attempt-to-stop-the-disney-machine',
 'deadline.com/2010/05/maura-tierney-in-negotiations-for-abcs-whole-truth',
 'deadline.com/2010/06/maura-tierney-set-for-the-whole-truth',
 'deadline.com/2010/08/sony-pic-billboards-offer-help-for-virgins-and-cause-nationwide-controversy']

In [166]:
df.loc[df['domain'] == 'deadline.com', 'URL_n'] = [re.sub('(/?)\?(iframe|_escaped_fragment).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'deadline.com', 'URL_n']]

In [167]:
list(df.loc[df['domain'] == 'deadline.com', 'URL_n'])[0:5]

['deadline.com/2010/01/imagine-does-transmedia-storytelling-deal',
 'deadline.com/2009/05/an-attempt-to-stop-the-disney-machine',
 'deadline.com/2010/05/maura-tierney-in-negotiations-for-abcs-whole-truth',
 'deadline.com/2010/06/maura-tierney-set-for-the-whole-truth',
 'deadline.com/2010/08/sony-pic-billboards-offer-help-for-virgins-and-cause-nationwide-controversy']

### 60. animenewsnetwork.com

No changes required.

In [168]:
list(df.loc[df['domain'] == 'animenewsnetwork.com', 'URL_n'])[0:5]

['animenewsnetwork.com/article.php?id=2714',
 'animenewsnetwork.com/feature.php?id=161',
 'animenewsnetwork.com/feature.php?id=155',
 'animenewsnetwork.com/article.php?id=6081',
 'animenewsnetwork.com/article.php?id=7985']

### 61. reuters.com

Here is more complex to identify the key parameters. 

In [169]:
list(df.loc[df['domain'] == 'reuters.com', 'URL_n'])[0:5]

['reuters.com/newsArticle.jhtml?section=news&storyID=6725339&type=worldNews',
 'reuters.com/news_article.jhtml;jsessionid=1BRREDIJMOGRKCRBAELCFEY?StoryID=1493221&type=topnews',
 'reuters.com/article/businessNews/idUSTRE5507X420090601',
 'reuters.com/article/pressrelease/idus211064+23-Jun-2008+PRN20080623',
 'reuters.com/newsArticle.jhtml?section=news&storyID=6679582&type=worldNews']

In [170]:
df.loc[df['domain'] == 'reuters.com', 'URL_n'] = [re.sub('(irpc=|sp=|rpc=|feedtype=|feedname=|il=|locale=|type=|virtualbrandchannel=|view=).*?(&|$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'reuters.com', 'URL_n']]
df.loc[df['domain'] == 'reuters.com', 'URL_n'] = [re.sub('(#[a-zA-Z0-9]+\.[0-9]{2}$|#targetText.*)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'reuters.com', 'URL_n']]
df.loc[df['domain'] == 'reuters.com', 'URL_n'] = [re.sub('(\?$|\?&$|&$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'reuters.com', 'URL_n']]

In [171]:
list(df.loc[df['domain'] == 'reuters.com', 'URL_n'])[0:5]

['reuters.com/newsArticle.jhtml?section=news&storyID=6725339',
 'reuters.com/news_article.jhtml;jsessionid=1BRREDIJMOGRKCRBAELCFEY?StoryID=1493221',
 'reuters.com/article/businessNews/idUSTRE5507X420090601',
 'reuters.com/article/pressrelease/idus211064+23-Jun-2008+PRN20080623',
 'reuters.com/newsArticle.jhtml?section=news&storyID=6679582']

### 62. thehindu.com

Some parameters are removed.

In [172]:
list(df.loc[df['domain'] == 'thehindu.com', 'URL_n'])[0:5]

['thehindu.com/thehindu/mp/2003/02/06/stories/2003020600890100.htm',
 'thehindu.com/thehindu/2003/03/20/stories/2003032005180300.htm',
 'thehindu.com/2009/08/13/stories/2009081352170300.htm',
 'thehindu.com/mag/2007/07/22/stories/2007072250030400.htm',
 'thehindu.com/2009/03/27/stories/2009032758310300.html']

In [173]:
df.loc[df['domain'] == 'thehindu.com', 'URL_n'] = [re.sub('\?(homepage|css|sec|test|ref|_escaped_fragment_).*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'thehindu.com', 'URL_n']]
df.loc[df['domain'] == 'thehindu.com', 'URL_n'] = [re.sub('\?(homepage|css|sec|test|ref|_escaped_fragment_).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'thehindu.com', 'URL_n']]

In [174]:
list(df.loc[df['domain'] == 'thehindu.com', 'URL_n'])[0:5]

['thehindu.com/thehindu/mp/2003/02/06/stories/2003020600890100.htm',
 'thehindu.com/thehindu/2003/03/20/stories/2003032005180300.htm',
 'thehindu.com/2009/08/13/stories/2009081352170300.htm',
 'thehindu.com/mag/2007/07/22/stories/2007072250030400.htm',
 'thehindu.com/2009/03/27/stories/2009032758310300.html']

### 63. cricketarchive.com

A very few URLs include parameters, but it appears be the same.

In [175]:
list(df.loc[df['domain'] == 'cricketarchive.com', 'URL_n'])[0:5]

['cricketarchive.com/Archive/Grounds/4/10174.html',
 'cricketarchive.com/Archive/Players/187/187174/187174.html',
 'cricketarchive.com/Archive/Players/158/158668/158668.html',
 'cricketarchive.com/Kerala/Players/260/260744/260744.html',
 'cricketarchive.com/Archive/Players/28/28331/28331.html']

### 64. articles.latimes.com

No changes required.

In [176]:
list(df.loc[df['domain'] == 'articles.latimes.com', 'URL_n'])[0:5]

['articles.latimes.com/2007/07/25/calendar/et-scriptland25',
 'articles.latimes.com/2007/aug/03/entertainment/et-doublesuicide3',
 'articles.latimes.com/2004/apr/20/entertainment/et-dutka20',
 'articles.latimes.com/2007/jun/12/entertainment/et-shehori12',
 'articles.latimes.com/2008/jan/20/world/fg-farc20']

### 65. discogs.com

The main parameters are anv, filter_anv, noanv (these are used to link with alternative names) and release.

In [177]:
list(df.loc[df['domain'] == 'discogs.com', 'URL_n'])[0:5]

['discogs.com/release/559377',
 'discogs.com/label/Digital+Hardcore+Recordings+(DHR)',
 'discogs.com/label/Cheap',
 'discogs.com/label/Stigmata',
 'discogs.com/artist/Imatran+Voima']

In [178]:
df.loc[df['domain'] == 'discogs.com', 'URL_n'] = [re.sub('(layout=).*?(&|$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'discogs.com', 'URL_n']]
df.loc[df['domain'] == 'discogs.com', 'URL_n'] = [re.sub('(\?$|\?&$|&$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'discogs.com', 'URL_n']]

In [179]:
list(df.loc[df['domain'] == 'discogs.com', 'URL_n'])[0:5]

['discogs.com/release/559377',
 'discogs.com/label/Digital+Hardcore+Recordings+(DHR)',
 'discogs.com/label/Cheap',
 'discogs.com/label/Stigmata',
 'discogs.com/artist/Imatran+Voima']

### 66. officialcharts.com

No changes required.

In [180]:
list(df.loc[df['domain'] == 'officialcharts.com', 'URL_n'])[0:5]

['officialcharts.com/artist/_/cocteau%20twins/#albums',
 'officialcharts.com/artist/_/kelly%20clarkson',
 'officialcharts.com/singles-chart',
 'officialcharts.com/artist/_/bruce%20springsteen/#albums',
 'officialcharts.com/artist/_/nick%20cave%20&%20the%20bad%20seeds/#albums']

### 67. metacritic.com

There are several parameters (filter, q, sort, page, ref, ftag, part, tag...) but only a few can be removed.

In [181]:
list(df.loc[df['domain'] == 'metacritic.com', 'URL_n'])[0:5]

['metacritic.com/video/titles/diamondmen',
 'metacritic.com/music/artists/explosionsinthesky/earthisnotacolddeadplace',
 'metacritic.com/music/artists/futureheads/futureheads',
 'metacritic.com/games/platforms/ps2/fullmetalalchemistdreamcarnival',
 'metacritic.com/music/artists/coxongraham/goldend']

In [182]:
df.loc[df['domain'] == 'metacritic.com', 'URL_n'] = [re.sub('\?q=.*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'metacritic.com', 'URL_n']]
df.loc[df['domain'] == 'metacritic.com', 'URL_n'] = [re.sub('\?(q=|part=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'metacritic.com', 'URL_n']]

In [183]:
list(df.loc[df['domain'] == 'metacritic.com', 'URL_n'])[0:5]

['metacritic.com/video/titles/diamondmen',
 'metacritic.com/music/artists/explosionsinthesky/earthisnotacolddeadplace',
 'metacritic.com/music/artists/futureheads/futureheads',
 'metacritic.com/games/platforms/ps2/fullmetalalchemistdreamcarnival',
 'metacritic.com/music/artists/coxongraham/goldend']

### 68. abc.net.au

Some parameters are removed.

In [None]:
list(df.loc[df['domain'] == 'abc.net.au', 'URL_n'])[0:5]

In [None]:
df.loc[df['domain'] == 'abc.net.au', 'URL_n'] = [re.sub('\.htm\?(site|section).*#', '.htm#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'abc.net.au', 'URL_n']]
df.loc[df['domain'] == 'abc.net.au', 'URL_n'] = [re.sub('\.htm\?(site|section).*', '.htm', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'abc.net.au', 'URL_n']]
df.loc[df['domain'] == 'abc.net.au', 'URL_n'] = [re.sub('(/\?|\?)(site|section).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'abc.net.au', 'URL_n']]

In [None]:
list(df.loc[df['domain'] == 'abc.net.au', 'URL_n'])[0:5]

### 69. books.google.co.uk

These URLs require a lot of transformations to be reduced.

In [184]:
list(df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'])[0:5]

['books.google.co.uk/books?isbn=0304706566',
 'books.google.co.uk/books?dq=Mr+Coughron+mathematician&ei=Cz7jT_CrKInS0QXu7_SaAw&hl=en&id=MukGAAAAYAAJ&lpg=RA1-PA225&ots=3Mt8cBxmf7&pg=RA1-PA225&sa=X&sig=f1rBp8xK75LJuIF5pMWffUHP6Fg&source=bl&sqi=2&ved=0CDsQ6AEwAw#v%3Donepage%26q%3DMr%20Coughron%20mathematician%26f%3Dfalse',
 'books.google.co.uk/books?dq=james+robson+of+the+pretenders+army&ei=TgHiT-fcGaii0QX3tfXEAw&hl=en&id=-RtNAAAAMAAJ&lpg=PA69&ots=4sHitgd86J&pg=PA69&sa=X&sig=LoZRCHcysfZzSQHOw7RdY-kkXk0&source=bl&sqi=2&ved=0CDkQ6AEwAQ#v%3Dsnippet%26q%3Djames%20robson%26f%3Dfalse',
 'books.google.co.uk/books?cad=0&id=MB4IAAAAQAAJ&printsec=frontcover&source=gbs_ge_summary_r#v%3Donepage%26q%26f%3Dfalse',
 'books.google.co.uk/books?dq=ambassador+peter+onu&ei=rAOHVLzDCYvnUr2ugIgH&hl=en&id=S9NKCCXbzbQC&lpg=PA98&ots=SpY1hfSZ1K&pg=PA98&sa=X&sig=9NqfSmLJvEFih9ccn4dmU_bs6Y4&source=bl&ved=0CDkQ6AEwBA#v%3Donepage%26q%3Dambassador%20peter%20onu%26f%3Dfalse']

In [185]:
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('\?.*&id=', '?id=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('(\?id=.*?&)(.*)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('\?.*&q=', '?q=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('ngrams/graph\?.*&content=', 'ngrams/graph?content=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('&.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('/books.*\?id=', '/?id=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('/books.*\?vid=', '/?vid=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('/books.*\?q=', '/?q=', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]
df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'] = [re.sub('#(v|search).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'books.google.co.uk', 'URL_n']]

In [186]:
list(df.loc[df['domain'] == 'books.google.co.uk', 'URL_n'])[0:5]

['books.google.co.uk/books?isbn=0304706566',
 'books.google.co.uk/?id=MukGAAAAYAAJ',
 'books.google.co.uk/?id=-RtNAAAAMAAJ',
 'books.google.co.uk/?id=MB4IAAAAQAAJ',
 'books.google.co.uk/?id=S9NKCCXbzbQC']

### 70. facebook.com

Some parameters are removed.

In [187]:
list(df.loc[df['domain'] == 'facebook.com', 'URL_n'])[0:5]

['facebook.com/terms.php',
 'facebook.com/apps/application.php?id=17895489176',
 'facebook.com/policy.php',
 'facebook.com/profile.php?id=789638867',
 'facebook.com/pages/Ivoryton-CT/The-Ivoryton-Playhouse-Foundation/6553109986?ref=ts']

In [188]:
df.loc[df['domain'] == 'facebook.com', 'URL_n'] = [re.sub('\?(theater|theatre|stream_).*?(&|$)', '?', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'facebook.com', 'URL_n']]
df.loc[df['domain'] == 'facebook.com', 'URL_n'] = [re.sub('(__tn__=|pnref=|fref=|type=|notif_t=|permpage=|total_comments=).*?(&|$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'facebook.com', 'URL_n']]
df.loc[df['domain'] == 'facebook.com', 'URL_n'] = [re.sub('(\?$|\?&$|&$)', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'facebook.com', 'URL_n']]

In [189]:
list(df.loc[df['domain'] == 'facebook.com', 'URL_n'])[0:5]

['facebook.com/terms.php',
 'facebook.com/apps/application.php?id=17895489176',
 'facebook.com/policy.php',
 'facebook.com/profile.php?id=789638867',
 'facebook.com/pages/Ivoryton-CT/The-Ivoryton-Playhouse-Foundation/6553109986?ref=ts']

### 71. cbc.ca

Some parameters are removed.

In [190]:
list(df.loc[df['domain'] == 'cbc.ca', 'URL_n'])[0:5]

['cbc.ca/olympics/sports/skijumping/stories/index.shtml?/story/olympics/national/2006/01/01/Sports/jakub_janda060101.html',
 'cbc.ca/canadavotes',
 'cbc.ca/news/background/paris_riots',
 'cbc.ca/canadavotes/riding/192',
 'cbc.ca/canadavotes/riding/114']

In [191]:
df.loc[df['domain'] == 'cbc.ca', 'URL_n'] = [re.sub('#ixzz.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'cbc.ca', 'URL_n']]
df.loc[df['domain'] == 'cbc.ca', 'URL_n'] = [re.sub('\.html\?(__vfz=|autoplay=|rss=|cmp=|ref=|print=|r=).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'cbc.ca', 'URL_n']]
df.loc[df['domain'] == 'cbc.ca', 'URL_n'] = [re.sub('\?(__vfz=|autoplay=|rss=|cmp=|ref=|print=|r=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'cbc.ca', 'URL_n']]

In [192]:
list(df.loc[df['domain'] == 'cbc.ca', 'URL_n'])[0:5]

['cbc.ca/olympics/sports/skijumping/stories/index.shtml?/story/olympics/national/2006/01/01/Sports/jakub_janda060101.html',
 'cbc.ca/canadavotes',
 'cbc.ca/news/background/paris_riots',
 'cbc.ca/canadavotes/riding/192',
 'cbc.ca/canadavotes/riding/114']

### 72. amazon.com

ref= and keywords= can be removed not in all cases.

In [193]:
list(df.loc[df['domain'] == 'amazon.com', 'URL_n'])[0:5]

['amazon.com/Nachume-Miller-Donald-B-Kuspit/dp/0913263206',
 'amazon.com/gp/pdp/profile/A1LTNPL5JQN85U/104-3914805-7956713',
 'amazon.com/Economic-Structures-of-Antiquity/dp/B000PY3KRQ',
 'amazon.com/s?field-author=Milton%20V.%20Backman&ie=UTF8&index=books&page=1&search-type=ss',
 'amazon.com/Childrens-Hospital-Chris-Adrian/dp/1932416609']

In [194]:
df.loc[df['domain'] == 'amazon.com', 'URL_n'] = [re.sub('(\?ie=|\?_encoding=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'amazon.com', 'URL_n']]
df.loc[df['domain'] == 'amazon.com', 'URL_n'] = [re.sub('([0-9a-z]{2})(/ref=.*)', r'\1', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'amazon.com', 'URL_n']]

In [195]:
list(df.loc[df['domain'] == 'amazon.com', 'URL_n'])[0:5]

['amazon.com/Nachume-Miller-Donald-B-Kuspit/dp/0913263206',
 'amazon.com/gp/pdp/profile/A1LTNPL5JQN85U/104-3914805-7956713',
 'amazon.com/Economic-Structures-of-Antiquity/dp/B000PY3KRQ',
 'amazon.com/s?field-author=Milton%20V.%20Backman&ie=UTF8&index=books&page=1&search-type=ss',
 'amazon.com/Childrens-Hospital-Chris-Adrian/dp/1932416609']

### 73. espn.com

No changes required.

In [196]:
list(df.loc[df['domain'] == 'espn.com', 'URL_n'])[0:5]

['espn.com/',
 'espn.com/',
 'espn.com/',
 'espn.com/',
 'espn.com/page2/s/1986/revisit/alcs.html']

### 74. latimes.com

Some parameters are removed.

In [197]:
list(df.loc[df['domain'] == 'latimes.com', 'URL_n'])[0:5]

['latimes.com/news/nationworld/nation/la-na-banker8jan08,0,1764103.story?coll=la-home-headlines&track=morenews',
 'latimes.com/news/obituaries/la-me-pickett20jan20,0,2840087.story',
 'latimes.com/news/obituaries/la-me-maynardsmith24apr24,0,6745983.story?coll=la-news-obituaries',
 'latimes.com/news/science/la-sci-solarsail20jun20,1,2584884.story?coll=la-news-science&cset=true&ctrack=1',
 'latimes.com/classified/realestate/printedition/la-re-guide23may23,0,1036326.story?coll=la-class-realestate']

In [198]:
df.loc[df['domain'] == 'latimes.com', 'URL_n'] = [re.sub('(#axzz|#ixzz).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'latimes.com', 'URL_n']]
df.loc[df['domain'] == 'latimes.com', 'URL_n'] = [re.sub('\.html\?(col|track|cset|_amp|outputType|barc|dlvrit).*#pag', '.html#pag', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'latimes.com', 'URL_n']]
df.loc[df['domain'] == 'latimes.com', 'URL_n'] = [re.sub('\.html\?(col|track|cset|_amp|outputType|barc|dlvrit).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'latimes.com', 'URL_n']]
df.loc[df['domain'] == 'latimes.com', 'URL_n'] = [re.sub('(\?|\.html\?)(col|track|cset|_amp|outputType|barc|dlvrit).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'latimes.com', 'URL_n']]

In [199]:
list(df.loc[df['domain'] == 'latimes.com', 'URL_n'])[0:5]

['latimes.com/news/nationworld/nation/la-na-banker8jan08,0,1764103.story',
 'latimes.com/news/obituaries/la-me-pickett20jan20,0,2840087.story',
 'latimes.com/news/obituaries/la-me-maynardsmith24apr24,0,6745983.story',
 'latimes.com/news/science/la-sci-solarsail20jun20,1,2584884.story',
 'latimes.com/classified/realestate/printedition/la-re-guide23may23,0,1036326.story']

### 75. usatoday.com 

Some parameters are removed.

In [200]:
list(df.loc[df['domain'] == 'usatoday.com', 'URL_n'])[0:5]

['usatoday.com/life/books/news/2006-07-17-magdalene-book_x.htm',
 'usatoday.com/news/nation/2005-08-20-counter-protest_x.htm',
 'usatoday.com/sports/college/football/2006-11-06-clock-loophole_x.htm?imw=Y',
 'usatoday.com/sports/preps/football/2003-09-10-mckissick-milestone_x.htm',
 'usatoday.com/sports/preps/football/2006-08-30-mckissick_x.htm']

In [201]:
df.loc[df['domain'] == 'usatoday.com', 'URL_n'] = [re.sub('\.htm\?(POE=|loc|csp=|dlvrit=).*#', '.htm#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'usatoday.com', 'URL_n']]
df.loc[df['domain'] == 'usatoday.com', 'URL_n'] = [re.sub('\.htm\?(POE=|loc|csp=|dlvrit=).*', '.htm', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'usatoday.com', 'URL_n']]
df.loc[df['domain'] == 'usatoday.com', 'URL_n'] = [re.sub('(/\?|/1\?)(POE=|loc|csp=|dlvrit=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'usatoday.com', 'URL_n']]

In [202]:
list(df.loc[df['domain'] == 'usatoday.com', 'URL_n'])[0:5]

['usatoday.com/life/books/news/2006-07-17-magdalene-book_x.htm',
 'usatoday.com/news/nation/2005-08-20-counter-protest_x.htm',
 'usatoday.com/sports/college/football/2006-11-06-clock-loophole_x.htm?imw=Y',
 'usatoday.com/sports/preps/football/2003-09-10-mckissick-milestone_x.htm',
 'usatoday.com/sports/preps/football/2006-08-30-mckissick_x.htm']

### 76. rollingstone.com

Similar to hollywoodreporter.

In [203]:
list(df.loc[df['domain'] == 'rollingstone.com', 'URL_n'])[0:5]

['rollingstone.com/reviews/cd/review.asp?aid=12318&cf=',
 'rollingstone.com/artists/theflaminglips/albums/album/88623/review/5946762/finally_the_punk_rockers_are_taking_acid_198388',
 'rollingstone.com/rockdaily/index.php/2006/09/22/rs-exclusive-download-the-oohlas-tripped',
 'rollingstone.com/',
 'rollingstone.com/']

In [204]:
df.loc[df['domain'] == 'rollingstone.com', 'URL_n'] = [re.sub('#ixzz.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'rollingstone.com', 'URL_n']]
df.loc[df['domain'] == 'rollingstone.com', 'URL_n'] = [re.sub('\?source.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'rollingstone.com', 'URL_n']]

In [205]:
list(df.loc[df['domain'] == 'rollingstone.com', 'URL_n'])[0:5]

['rollingstone.com/reviews/cd/review.asp?aid=12318&cf=',
 'rollingstone.com/artists/theflaminglips/albums/album/88623/review/5946762/finally_the_punk_rockers_are_taking_acid_198388',
 'rollingstone.com/rockdaily/index.php/2006/09/22/rs-exclusive-download-the-oohlas-tripped',
 'rollingstone.com/',
 'rollingstone.com/']

### 77. smh.com.au

Some parameters are removed.

In [206]:
list(df.loc[df['domain'] == 'smh.com.au', 'URL_n'])[0:5]

['smh.com.au/news/national/wall-street-comes-to-campus/2005/09/30/1127804663283.html',
 'smh.com.au/news/sport/dale-begs-the-question-can-australia-win-a-mogul-medal-in-turin/2006/01/23/1137864864166.html',
 'smh.com.au/articles/2004/05/19/1084917645003.html?from=storyrhs',
 'smh.com.au/news/national/granny-killer-found-dead-in-cell/2005/09/09/1125772681493.html',
 'smh.com.au/news/national/mystery-woman-pays-for-killers-funeral/2005/09/17/1126750168519.html']

In [207]:
df.loc[df['domain'] == 'smh.com.au', 'URL_n'] = [re.sub('#ixzz.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'smh.com.au', 'URL_n']]
df.loc[df['domain'] == 'smh.com.au', 'URL_n'] = [re.sub('\.html\?(from=|skin=|oneclick=|s_cid=|autostart=|rand=|feed=|fbclid=|deviceType=|ref=).*#', '.html#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'smh.com.au', 'URL_n']]
df.loc[df['domain'] == 'smh.com.au', 'URL_n'] = [re.sub('\.html\?(from=|skin=|oneclick=|s_cid=|autostart=|rand=|feed=|fbclid=|deviceType=|ref=).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'smh.com.au', 'URL_n']]

In [208]:
list(df.loc[df['domain'] == 'smh.com.au', 'URL_n'])[0:5]

['smh.com.au/news/national/wall-street-comes-to-campus/2005/09/30/1127804663283.html',
 'smh.com.au/news/sport/dale-begs-the-question-can-australia-win-a-mogul-medal-in-turin/2006/01/23/1137864864166.html',
 'smh.com.au/articles/2004/05/19/1084917645003.html',
 'smh.com.au/news/national/granny-killer-found-dead-in-cell/2005/09/09/1125772681493.html',
 'smh.com.au/news/national/mystery-woman-pays-for-killers-funeral/2005/09/17/1126750168519.html']

### 78. forbes.com

Some paremeters such as list= can be removed but the page changes.

In [209]:
list(df.loc[df['domain'] == 'forbes.com', 'URL_n'])[0:5]

['forbes.com/',
 'forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=938304',
 'forbes.com/technology/2004/03/04/cz_jw_0304soapbox.html',
 'forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=1106729',
 'forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=895590']

In [210]:
df.loc[df['domain'] == 'forbes.com', 'URL_n'] = [re.sub('(/*)#[a-z0-9]{12}.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'forbes.com', 'URL_n']]
df.loc[df['domain'] == 'forbes.com', 'URL_n'] = [re.sub('\.html\?(thisSpeed=|c=|boxes=).*#', '.html#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'forbes.com', 'URL_n']]
df.loc[df['domain'] == 'forbes.com', 'URL_n'] = [re.sub('\.html\?(thisSpeed=|c=|boxes=).*', '.html', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'forbes.com', 'URL_n']]
df.loc[df['domain'] == 'forbes.com', 'URL_n'] = [re.sub('(/*)\?(thisSpeed=|c=|boxes=).*#', '#', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'forbes.com', 'URL_n']]
df.loc[df['domain'] == 'forbes.com', 'URL_n'] = [re.sub('(/*)\?(thisSpeed=|c=|boxes=).*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'forbes.com', 'URL_n']]

In [211]:
list(df.loc[df['domain'] == 'forbes.com', 'URL_n'])[0:5]

['forbes.com/',
 'forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=938304',
 'forbes.com/technology/2004/03/04/cz_jw_0304soapbox.html',
 'forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=1106729',
 'forbes.com/finance/mktguideapps/personinfo/FromPersonIdPersonTearsheet.jhtml?passedPersonId=895590']

### 79. cnn.com

Some parameters are removed.

In [212]:
list(df.loc[df['domain'] == 'cnn.com', 'URL_n'])[0:5]

['cnn.com/2006/LAW/01/12/vermont.judge.ap',
 'cnn.com/2003/EDUCATION/08/01/student.tracking/index.html',
 'cnn.com/2006/POLITICS/01/03/abramoff.fallout',
 'cnn.com/2006/POLITICS/01/04/lobbyist.fraud.ap/index.html',
 'cnn.com/2005/LAW/03/11/atlanta.shooting/index.html']

In [213]:
df.loc[df['domain'] == 'cnn.com', 'URL_n'] = [re.sub('(\?|&)hpt=[a-z0-9_]+$', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'cnn.com', 'URL_n']]
df.loc[df['domain'] == 'cnn.com', 'URL_n'] = [re.sub('(\?|&)(eref|iref|section|_S)=.*', '', x, flags=re.IGNORECASE) for x in df.loc[df['domain'] == 'cnn.com', 'URL_n']]

In [214]:
list(df.loc[df['domain'] == 'cnn.com', 'URL_n'])[0:5]

['cnn.com/2006/LAW/01/12/vermont.judge.ap',
 'cnn.com/2003/EDUCATION/08/01/student.tracking/index.html',
 'cnn.com/2006/POLITICS/01/03/abramoff.fallout',
 'cnn.com/2006/POLITICS/01/04/lobbyist.fraud.ap/index.html',
 'cnn.com/2005/LAW/03/11/atlanta.shooting/index.html']

## Final

In [216]:
df['URL_n'] = [re.sub('(/+$|&+$|\?+$)', '', x) for x in df['URL_n']]
df['URL_n'] = [re.sub('(/+$|&+$|\?+$)', '', x) for x in df['URL_n']]
df['URL_n'] = [re.sub('(/+$|&+$|\?+$)', '', x) for x in df['URL_n']]

In [219]:
df = df.rename(columns={'el_from': 'page_id', 'el_to': 'URL'})
df.head()

Unnamed: 0,page_id,URL,URL_n,domain
0,3850540,http://www.housing.berkeley.edu/housing/,housing.berkeley.edu/housing,housing.berkeley.edu
1,3850540,http://www.freebornhall.com/History/ResidenceH...,freebornhall.com/History/ResidenceHalls,freebornhall.com
4,840171,http://www.kloster-einsiedeln.ch,kloster-einsiedeln.ch,kloster-einsiedeln.ch
5,1290279,http://www.erbzine.com/mag1/0117.html,erbzine.com/mag1/0117.html,erbzine.com
6,3856533,http://freepages.genealogy.rootsweb.com/~vanrc...,freepages.genealogy.rootsweb.com/~vanrcwisner/...,freepages.genealogy.rootsweb.com


There are 51,263,816 unique URLs.

In [220]:
len(df.groupby('URL_n').count().index)

51263816

In [221]:
df.to_csv('url_normalization/url_ext_norm.tsv', sep='\t', index=False)