# Categories analysis

This notebook includes all data preprocessing and analysis related to the categories of the English Wikipedia edition of the study "*Where is the science in Wikipedia? Identification and characterization of scientifically supported contents*", authored by:
* Wenceslao Arroyo-Machado
* Daniel Torres-Salinas
* Rodrigo Costas

# Libraries

In [1]:
import pandas as pd
import csv

# 1. Data preprocessing

The main file is `page` as it includes the `page id` and title of articles and categories.

In [2]:
df_page = pd.read_csv('data/page.tsv', sep='\t', quoting=csv.QUOTE_NONE)
df_page

Unnamed: 0,page_id,namespace,title,is_redirect,is_new,touched,links_updated,latest,len,content_model,page_edits,creation,editors,views,references
0,10,0,AccessibleComputing,1,0,20210607122734,2.021061e+13,1002250816,111,wikitext,14.0,2001-01-21,13.0,186.0,
1,12,0,Anarchism,0,0,20210701093040,2.021070e+13,1030472204,96584,wikitext,19819.0,2001-10-11,3773.0,237226.0,92.0
2,13,0,AfghanistanHistory,1,0,20210629133822,2.021061e+13,783865149,90,wikitext,6.0,2001-04-05,5.0,47.0,
3,14,0,AfghanistanGeography,1,0,20210607122734,2.021061e+13,783865160,92,wikitext,7.0,2001-01-21,7.0,23.0,
4,15,0,AfghanistanPeople,1,0,20210629123442,2.021061e+13,783865293,95,wikitext,8.0,2001-01-21,7.0,16.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53710524,68103374,11,World_championships_in_2023,0,1,20210701094743,2.021070e+13,1031387361,22,wikitext,1.0,2021-07-01,1.0,,
53710525,68103375,3,2C0F:F5F0:4290:3F0:74A4:790D:12EC:9B01,0,1,20210701094755,2.021070e+13,1031387381,799,wikitext,1.0,2021-07-01,1.0,,
53710526,68103376,2,Asif_Khan_Tarand,0,1,20210701094801,2.021070e+13,1031387393,11,wikitext,3.0,2021-07-01,1.0,,
53710527,68103377,118,Juanita_Head_Walton,1,1,20210701094810,2.021070e+13,1031387407,80,wikitext,1.0,2021-07-01,1.0,,


`page id` of hidden categories are identified from `page properties` file and category name is retrieved from `page` file. There are 30,146 hidden categories.

In [3]:
df_prop = pd.read_csv('data/page_property.tsv', sep='\t')
hidden_cat_pages = df_prop[df_prop['property_name']=='hiddencat']['page_id'].tolist()
hidden_cat_pages = df_page[(df_page['namespace']==14) & (df_page['page_id'].isin(hidden_cat_pages))]['title'].tolist()
len(hidden_cat_pages)

30146

There are 2,094,407 categories, that are reduced to 2,064,261 (98.56%) after removing hidden categories. It should be noted that all of them are not used.

In [4]:
df_cat = pd.read_csv('data/category.tsv', sep='\t')
df_cat = df_cat[df_cat['title'].isin(df_page[df_page['namespace']==14].title.tolist())][['category_id', 'title']]
df_cat

Unnamed: 0,category_id,title
0,2,Unprintworthy_redirects
1,3,Computer_storage_devices
2,7,Unknown-importance_Animation_articles
3,8,Low-importance_Animation_articles
4,9,Vietnam_stubs
...,...,...
2179612,248517457,Fukushima_University
2179613,248517458,New_Birth_(band)_songs
2179617,248517464,ASUN_Conference_lacrosse
2179618,248517465,ASUN_Conference_men's_lacrosse


All `category id` are selected for further analysis.

In [5]:
df_cat_all = df_cat.category_id.tolist()

In [6]:
df_cat = df_cat[~df_cat['title'].isin(hidden_cat_pages)]
df_cat.shape

(2064261, 2)

Before continuing with the analysis of categories, `page` data.frame is reduced to only articles that are not redirect as we are only focus on these 6,328,134 pages.

In [7]:
df_page = df_page[(df_page['is_redirect']==0) & (df_page['namespace']==0)][['page_id', 'title', 'references']]
df_page

Unnamed: 0,page_id,title,references
1,12,Anarchism,92.0
11,25,Autism,226.0
17,39,Albedo,37.0
59,290,A,30.0
67,303,Alabama,207.0
...,...,...,...
53710490,68103340,Karen_Doell,
53710499,68103349,John_W._Fewell,
53710509,68103359,Carrie_Flemmer,
53710515,68103365,Dapp_Browsers,


`page_category` includes the categories tagged in each Wikipedia page. A total of 71,149,534 links between pages and categories.

In [8]:
df_page_c = pd.read_csv('data/page_category.tsv', sep='\t')
df_page_c = df_page_c[(df_page_c['page_id'].isin(df_page.page_id.tolist())) &
                     (df_page_c['category_id'].isin(df_cat_all))][['page_id', 'category_id']]
df_page_c

Unnamed: 0,page_id,category_id
9733,18662323,2865710
19244,32626257,2865710
964450,56763534,2865710
1107648,58999938,2865710
1261045,60344561,2865710
...,...,...
165501690,68102341,248517441
165501697,68102945,247961470
165501698,68102945,247245428
165501699,68102958,34380941


A total of 1,495,072 categories are included in at least one page.

In [9]:
len(set(df_page_c.category_id))

1495072

Hidden categories are only 1.38% of the categories, they account for 50% of the links.

In [10]:
df_page_c = df_page_c[df_page_c['category_id'].isin(df_cat.category_id.tolist())]
df_page_c

Unnamed: 0,page_id,category_id
3281334,12,1659
3281335,11826,1659
3281336,18048,1659
3281337,26775,1659
3281338,43421,1659
...,...,...
165501679,68102176,248517422
165501689,68102341,248517442
165501690,68102341,248517441
165501697,68102945,247961470


In [11]:
df_page_c = df_page_c[df_page_c['category_id']!=248440938]

## 2. Categories analysis

In [12]:
page_category = df_page_c.groupby('page_id').size().reset_index(name='categories')
category_page = df_page_c.groupby('category_id').size().reset_index(name='pages')

There are 1,473,628 categories included in at least one page.

In [13]:
len(set(category_page.category_id))

1473628

There are 442 pages without any category that were included in the frequency analysis.

In [42]:
page_category = pd.concat([page_category,
                           pd.DataFrame({'page_id':df_page[~df_page['page_id'].isin(page_category.page_id.tolist())].page_id.tolist(),
                                         'categories':0})])

There are 6,327,692 pages with at least one category.

In [15]:
len(set(page_category.page_id))

6327692

On average, categories are included in 24.07 Wikipedia pages (±867.834).

In [17]:
print(round(category_page.pages.mean(),3))
print(round(category_page.pages.std(),3))

24.017
867.834


Whereas pages included a mean of 5.593 categories (±4.765).

In [18]:
print(round(page_category.categories.mean(),3))
print(round(page_category.categories.std(),3))

5.593
4.765


`page_category` and `category_page` data.frames are exported to generate plots using R.

In [18]:
page_category.to_csv('results/page_category.tsv', sep='\t', index=False)
category_page.to_csv('results/category_page.tsv', sep='\t', index=False)

# References

In [20]:
df_pubs = pd.read_csv('data/pub.tsv', sep='\t')
df_pubs

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,pub_id,arxiv,asin,bibcode,doi,isbn,ismn,jfm,jstor,lccn,...,oclc,ol,osti,pmc,pmid,rfc,ssrn,url,usenetid,zbl
0,1,,,,10.4064/fm-17-1-152-170,,,,,,...,,,,,,,,eudml.org/doc/212513,,3.02701
1,2,,,,,9780959659634,,,,,...,,,,,,,,fieldgeologyclubsa.org.au/publications.htm,,
2,3,,,,10.1007/bf02086276,,,,,,...,,,,,,,,zenodo.org/record/1642598,,
3,4,,,,,0719013380,,,,,...,,,,,,,,books.google.com/?id=5367aaaaiaaj,,
4,5,,,,,0618133844,,,,,...,,,,,,,,archive.org/details/eastasiacultural00ebre_0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2367543,2367544,,,,,,,,,,...,,17845409m,,,,,,,,
2367544,2367545,,,,,,,,,,...,,,,,,,2247615.0,,,
2367545,2367546,,,,,,,,,,...,,,,,,,2617574.0,,,
2367546,2367547,,,,,9781585440450,,,,,...,13093276.0,,,,,,,,,


In [21]:
df_page_pubs = pd.read_csv('data/page_pub.tsv', sep='\t')
df_page_pubs

Unnamed: 0,page_id,pub_id
0,10630303,1
1,12008665,2
2,23560884,3
3,3144280,3
4,652221,4
...,...,...
3728517,3148435,2367544
3728518,34084825,2367545
3728519,17547034,2367546
3728520,6265065,2367547


The number of references that includes DOI or ISBN is calculated for each page and the is merged with `page` data.frame.

In [23]:
doi = df_page_pubs[df_page_pubs['pub_id'].isin(df_pubs[~df_pubs['doi'].isna()].pub_id.tolist())].groupby('page_id').size().reset_index(name='doi')
isbn = df_page_pubs[df_page_pubs['pub_id'].isin(df_pubs[~df_pubs['isbn'].isna()].pub_id.tolist())].groupby('page_id').size().reset_index(name='isbn')
df_page = df_page.merge(doi, how='left', on='page_id').merge(isbn, how='left', on='page_id')
df_page

Unnamed: 0,page_id,title,references,doi,isbn
0,12,Anarchism,92.0,8.0,56.0
1,25,Autism,226.0,170.0,8.0
2,39,Albedo,37.0,19.0,2.0
3,290,A,30.0,2.0,7.0
4,303,Alabama,207.0,3.0,20.0
...,...,...,...,...,...
6328129,68103340,Karen_Doell,,,
6328130,68103349,John_W._Fewell,,,
6328131,68103359,Carrie_Flemmer,,,
6328132,68103365,Dapp_Browsers,,,


In [32]:
top_categories = df_page_c.merge(df_page, how='right', on='page_id')[['category_id', 'doi', 'isbn']].groupby('category_id').sum().reset_index().merge(df_cat, how='inner', on='category_id')
top_categories

Unnamed: 0,category_id,doi,isbn,title
0,3.0,12.0,20.0,Computer_storage_devices
1,9.0,3.0,36.0,Vietnam_stubs
2,10.0,17.0,14.0,Rivers_of_Vietnam
3,16.0,28.0,77.0,Comedy
4,17.0,525.0,533.0,Sociolinguistics
...,...,...,...,...
1473623,248517444.0,0.0,0.0,January_2016_sports_events_in_New_Zealand
1473624,248517449.0,0.0,0.0,Volleyball_players_at_the_2020_Summer_Olympics
1473625,248517454.0,0.0,5.0,Phoenicians_in_the_New_Testament
1473626,248517455.0,0.0,4.0,MFSB_songs


In [35]:
top_categories.sort_values(by='doi', ascending=False).head(20)

Unnamed: 0,category_id,doi,isbn,title
62,173.0,82778.0,172018.0,Living_people
255516,518425.0,22887.0,13687.0,IUCN_Red_List_least_concern_species
871358,235791609.0,18885.0,184.0,Genes_on_human_chromosome_1
883527,239012017.0,13192.0,7998.0,Taxa_named_by_Carl_Linnaeus
192961,310057.0,12793.0,98.0,Transcription_factors
871374,235792035.0,12078.0,124.0,Genes_on_human_chromosome_17
871359,235791644.0,12025.0,112.0,Genes_on_human_chromosome_2
871368,235791887.0,11957.0,144.0,Genes_on_human_chromosome_11
76339,114545.0,11255.0,681.0,Durchmusterung_objects
235043,439236.0,11168.0,677.0,Hipparcos_objects


In [36]:
top_categories.sort_values(by='isbn', ascending=False).head(20)

Unnamed: 0,category_id,doi,isbn,title
62,173.0,82778.0,172018.0,Living_people
282035,612544.0,7.0,19423.0,English_Football_League_players
389,775.0,656.0,18616.0,English-language_films
388,774.0,562.0,17346.0,American_films
81350,121845.0,35.0,17080.0,English_footballers
255516,518425.0,22887.0,13687.0,IUCN_Red_List_least_concern_species
1172931,247354635.0,1030.0,10401.0,20th-century_American_male_writers
1201909,247462206.0,1380.0,8442.0,American_male_non-fiction_writers
1014055,246840892.0,1187.0,8281.0,20th-century_American_women_writers
883527,239012017.0,13192.0,7998.0,Taxa_named_by_Carl_Linnaeus


# 2. Publications distribution

## 2.1 DOI + ISBN

In [None]:
CALCULAR PAGES por CAT, PAGES WITH DOI por CAT

In [54]:
df_cat_pages = df_page_c.groupby('category_id').size().reset_index(name='pages')
df_cat_pages

Unnamed: 0,category_id,pages
0,3,77
1,9,281
2,10,99
3,16,71
4,17,221
...,...,...
1473623,248517444,1
1473624,248517449,7
1473625,248517454,10
1473626,248517455,3


In [108]:
df_cat_pages.sort_values(by='pages', ascending=False).merge(df_cat, how='inner', on='category_id').head(50)

Unnamed: 0,category_id,pages,title
0,173,1002407,Living_people
1,460466,201989,Disambiguation_pages
2,517524,67378,Human_name_disambiguation_pages
3,775,65240,English-language_films
4,1801,64662,Surnames
5,598101,62089,Place_name_disambiguation_pages
6,795870,57989,Association_football_midfielders
7,774,56772,American_films
8,8571501,44932,Association_football_defenders
9,8611016,43347,Association_football_forwards


In [84]:
df_cat_doi = df_page_c.merge(df_page_pubs[df_page_pubs['pub_id'].isin(df_pubs[~df_pubs['doi'].isna()].pub_id.tolist())], how='inner', on='page_id')
df_cat_doi.pub_id = 1
df_cat_doi = df_cat_doi.groupby(['category_id', 'page_id']).agg(doi_total=('pub_id','sum'), doi_pages=('pub_id','min')).reset_index()
df_cat_doi = df_cat_doi.groupby('category_id')[['doi_total', 'doi_pages']].sum().reset_index()
df_cat_doi = df_cat_doi.merge(df_page_c.merge(df_page_pubs[df_page_pubs['pub_id'].isin(df_pubs[~df_pubs['doi'].isna()].pub_id.tolist())], how='inner', on='page_id')[['category_id', 'pub_id']].drop_duplicates().groupby('category_id').size().reset_index(name='doi_unique'), how='inner', on='category_id')
df_cat_doi

Unnamed: 0,category_id,doi_total,doi_pages,doi_unique
0,3,12,6,12
1,9,3,3,3
2,10,17,2,17
3,16,28,11,28
4,17,525,98,505
...,...,...,...,...
293664,248517366,6,3,6
293665,248517372,4,3,4
293666,248517373,9,6,9
293667,248517404,22,4,22


In [85]:
df_cat_pages_doi = df_cat_pages.merge(df_cat_doi, how='left', on='category_id')
df_cat_pages_doi

Unnamed: 0,category_id,pages,doi_total,doi_pages,doi_unique
0,3,77,12.0,6.0,12.0
1,9,281,3.0,3.0,3.0
2,10,99,17.0,2.0,17.0
3,16,71,28.0,11.0,28.0
4,17,221,525.0,98.0,505.0
...,...,...,...,...,...
1473623,248517444,1,,,
1473624,248517449,7,,,
1473625,248517454,10,,,
1473626,248517455,3,,,


In [86]:
df_cat_pages_doi.sort_values(by='doi_unique', ascending=False)

Unnamed: 0,category_id,pages,doi_total,doi_pages,doi_unique
62,173,1002407,82778.0,21508.0,81102.0
255516,518425,28404,22887.0,8534.0,20630.0
871358,235791609,1443,18885.0,1358.0,14311.0
883527,239012017,6913,13192.0,2269.0,12426.0
192961,310057,727,12793.0,704.0,10120.0
...,...,...,...,...,...
1473623,248517444,1,,,
1473624,248517449,7,,,
1473625,248517454,10,,,
1473626,248517455,3,,,


In [87]:
df_cat_pages_doi = df_cat_pages_doi[~df_cat_pages_doi['doi_total'].isna()].sort_values(by='doi_unique', ascending=False).merge(df_cat, how='inner', on='category_id')
df_cat_pages_doi['perc'] = round(100*df_cat_pages_doi.doi_pages/df_cat_pages_doi.pages, 3)
df_cat_pages_doi.sort_values(by='doi_total', ascending=False)

Unnamed: 0,category_id,pages,doi_total,doi_pages,doi_unique,title,perc
0,173,1002407,82778.0,21508.0,81102.0,Living_people,2.146
1,518425,28404,22887.0,8534.0,20630.0,IUCN_Red_List_least_concern_species,30.045
2,235791609,1443,18885.0,1358.0,14311.0,Genes_on_human_chromosome_1,94.109
3,239012017,6913,13192.0,2269.0,12426.0,Taxa_named_by_Carl_Linnaeus,32.822
4,310057,727,12793.0,704.0,10120.0,Transcription_factors,96.836
...,...,...,...,...,...,...,...
229703,428299,1,1.0,1.0,1.0,Films_directed_by_Ardeshir_Irani,100.000
229702,428301,19,1.0,1.0,1.0,Films_directed_by_Jacques_Feyder,5.263
229701,428324,1,1.0,1.0,1.0,543_establishments,100.000
229700,428325,1,1.0,1.0,1.0,753_disestablishments,100.000


In [88]:
df_cat_pages_doi['ratio'] = round(df_cat_pages_doi.doi_unique/df_cat_pages_doi.doi_total,3)
df_cat_pages_doi.sort_values(by='ratio', ascending=True)

Unnamed: 0,category_id,pages,doi_total,doi_pages,doi_unique,title,perc,ratio
257452,188841,530,128.0,128.0,1.0,Localities_in_New_South_Wales,24.151,0.008
262177,217540339,172,103.0,103.0,1.0,Far_West_(New_South_Wales),59.884,0.010
232941,16153562,442,99.0,99.0,1.0,Drilliidae_stubs,22.398,0.010
203733,248240058,130,102.0,102.0,1.0,"Far_West,_New_South_Wales_geography_stubs",78.462,0.010
77002,116439829,806,814.0,804.0,8.0,Scopula,99.752,0.010
...,...,...,...,...,...,...,...,...
116497,203748,17,4.0,1.0,4.0,Military_history_of_the_Netherlands_during_Wor...,5.882,1.000
116498,167389,12,4.0,3.0,4.0,Interactive_geometry_software,25.000,1.000
116499,173224,119,4.0,4.0,4.0,Japanese_artists,3.361,1.000
116673,199630,839,4.0,4.0,4.0,Members_of_the_New_South_Wales_Legislative_Cou...,0.477,1.000


In [100]:
df_cat_pages_doi[df_cat_pages_doi['doi_total']>2500].sort_values(by='ratio', ascending=False)

Unnamed: 0,category_id,pages,doi_total,doi_pages,doi_unique,title,perc,ratio
108,293188,3839,2634.0,600.0,2628.0,Stanford_University_alumni,15.629,0.998
116,100405,3399,2552.0,605.0,2546.0,Cornell_University_alumni,17.799,0.998
113,299356,3357,2570.0,693.0,2560.0,University_of_Chicago_alumni,20.643,0.996
119,254627,3936,2520.0,645.0,2507.0,Princeton_University_alumni,16.387,0.995
69,129069,1637,3490.0,599.0,3474.0,Fellows_of_the_American_Association_for_the_Ad...,36.591,0.995
...,...,...,...,...,...,...,...,...
145,64755,1950,6018.0,1132.0,2176.0,Bayer_objects,58.051,0.362
57,12392534,3676,11021.0,2396.0,3968.0,Henry_Draper_Catalogue_objects,65.180,0.360
53,439236,3709,11168.0,2440.0,4005.0,Hipparcos_objects,65.786,0.359
177,134158,1771,5786.0,1164.0,1985.0,Flamsteed_objects,65.726,0.343


In [103]:
df_cat_pages_doi[df_cat_pages_doi['title']=='Bibliometrics']

Unnamed: 0,category_id,pages,doi_total,doi_pages,doi_unique,title,perc,ratio
12579,342919,15,86.0,9.0,86.0,Bibliometrics,60.0,1.0


In [66]:
df_cat_isbn = df_page_c.merge(df_page_pubs[df_page_pubs['pub_id'].isin(df_pubs[~df_pubs['isbn'].isna()].pub_id.tolist())], how='inner', on='page_id')
df_cat_isbn.pub_id = 1
df_cat_isbn = df_cat_isbn.groupby(['category_id', 'page_id']).agg(isbn_total=('pub_id','sum'), isbn_pages=('pub_id','min')).reset_index()
df_cat_isbn = df_cat_isbn.groupby('category_id')[['isbn_total', 'isbn_pages']].sum().reset_index()
df_cat_isbn = df_cat_isbn.merge(df_page_c.merge(df_page_pubs[df_page_pubs['pub_id'].isin(df_pubs[~df_pubs['isbn'].isna()].pub_id.tolist())], how='inner', on='page_id')[['category_id', 'pub_id']].drop_duplicates().groupby('category_id').size().reset_index(name='isbn_unique'), how='inner', on='category_id')
df_cat_isbn

Unnamed: 0,category_id,isbn_total,isbn_pages,isbn_unique
0,3,20,11,19
1,9,36,21,36
2,10,14,4,14
3,16,77,28,77
4,17,533,112,500
...,...,...,...,...
722089,248517404,164,22,106
722090,248517436,4,1,4
722091,248517454,5,4,2
722092,248517455,4,2,4


In [67]:
df_cat_pages_isbn = df_cat_pages.merge(df_cat_isbn, how='left', on='category_id')
df_cat_pages_isbn

Unnamed: 0,category_id,pages,isbn_total,isbn_pages,isbn_unique
0,3,77,20.0,11.0,19.0
1,9,281,36.0,21.0,36.0
2,10,99,14.0,4.0,14.0
3,16,71,77.0,28.0,77.0
4,17,221,533.0,112.0,500.0
...,...,...,...,...,...
1473623,248517444,1,,,
1473624,248517449,7,,,
1473625,248517454,10,5.0,4.0,2.0
1473626,248517455,3,4.0,2.0,4.0


In [68]:
df_cat_pages_isbn.sort_values(by='isbn_unique', ascending=False)

Unnamed: 0,category_id,pages,isbn_total,isbn_pages,isbn_unique
62,173,1002407,172018.0,74732.0,138593.0
389,775,65240,18616.0,8140.0,11533.0
388,774,56772,17346.0,7891.0,10584.0
1172931,247354635,7661,10401.0,2446.0,9892.0
1201909,247462206,4732,8442.0,1696.0,8298.0
...,...,...,...,...,...
1473619,248517435,2,,,
1473621,248517441,1,,,
1473622,248517442,1,,,
1473623,248517444,1,,,


In [69]:
df_cat_pages_isbn = df_cat_pages_isbn[~df_cat_pages_isbn['isbn_total'].isna()].sort_values(by='isbn_unique', ascending=False).merge(df_cat, how='inner', on='category_id')
df_cat_pages_isbn['perc'] = round(100*df_cat_pages_isbn.isbn_pages/df_cat_pages_isbn.pages, 3)
df_cat_pages_isbn.sort_values(by='isbn_total', ascending=False)

Unnamed: 0,category_id,pages,isbn_total,isbn_pages,isbn_unique,title,perc
0,173,1002407,172018.0,74732.0,138593.0,Living_people,7.455
403,612544,26530,19423.0,11889.0,1493.0,English_Football_League_players,44.813
1,775,65240,18616.0,8140.0,11533.0,English-language_films,12.477
2,774,56772,17346.0,7891.0,10584.0,American_films,13.899
413,121845,22755,17080.0,10243.0,1480.0,English_footballers,45.014
...,...,...,...,...,...,...,...
612329,55826845,7,1.0,1.0,1.0,Japanese_expatriates_in_Spain,14.286
612330,55826802,3,1.0,1.0,1.0,Nose_flutes,33.333
612331,55813927,4,1.0,1.0,1.0,Uninhabited_islands_of_South_Korea,25.000
612332,55812450,2,1.0,1.0,1.0,Uninhabited_islands_of_Bangladesh,50.000


In [101]:
df_cat_pages_isbn['ratio'] = round(df_cat_pages_isbn.isbn_unique/df_cat_pages_isbn.isbn_total,3)
df_cat_pages_isbn.sort_values(by='ratio', ascending=True)

Unnamed: 0,category_id,pages,isbn_total,isbn_pages,isbn_unique,title,perc,ratio
610183,200554591,213,172.0,172.0,1.0,Nakhchivan_Autonomous_Republic_geography_stubs,80.751,0.006
563912,247673859,194,169.0,169.0,1.0,Lists_of_listed_buildings_in_Staffordshire,87.113,0.006
420299,592655,216,369.0,213.0,3.0,Kangxi_radicals,98.611,0.008
414111,246638391,121,360.0,120.0,3.0,Vehicle_registration_plates_of_the_United_Stat...,99.174,0.008
651801,72574970,148,107.0,107.0,1.0,Rijksmonuments_in_Friesland,72.297,0.009
...,...,...,...,...,...,...,...,...
323618,173292264,7,5.0,1.0,5.0,1954_in_French_sport,14.286,1.000
323619,233516223,23,5.0,1.0,5.0,Danish_male_ballet_dancers,4.348,1.000
323620,158850032,5,5.0,1.0,5.0,Ferries_of_China,20.000,1.000
323614,173293052,2,5.0,1.0,5.0,Sport_in_Sudan_by_sport,50.000,1.000


In [102]:
df_cat_pages_isbn[df_cat_pages_isbn['title']=='Bibliometrics']

Unnamed: 0,category_id,pages,isbn_total,isbn_pages,isbn_unique,title,perc,ratio
251031,342919,15,8.0,4.0,8.0,Bibliometrics,26.667,1.0


In [110]:
df_cat_pages_isbn[df_cat_pages_isbn['title']=='Bibliometricians']

Unnamed: 0,category_id,pages,isbn_total,isbn_pages,isbn_unique,title,perc,ratio
225227,248474434,11,10.0,4.0,10.0,Bibliometricians,36.364,1.0
