# Wikimedia Research - Translation Imbalances: Testing hypothesis #1


In this notebook we aim to find answer to research questions by interacting with the data sources relevant to it, __the related statement is__:
> Cultural context content account for a relevant portion of Wikipedia articles (7%-49%) and is less shared on average, which is causing the language imbalance on Wikipedia. Mass media-dominated content, which is more dominated by certain cultures, is more widely shared, and therefore, translated from larger and global languages more often.

__Research questions:__
- RQ 4.1.1 What type of content receives the biggest count?
- RQ. 4.1.2 What type of content receives the lowest count?

__Data sources__
- Data obtained from the APIs by language edition: https://www.mediawiki.org/wiki/API:Main_page
- Predictions of categories by article: https://www.mediawiki.org/wiki/ORES/Articletopic

__Steps:__
1. Define the tool to use to get the top level categories of articles
2. Define the categories to use
3. Get count by of pages categories 
4. Get pages belonging to a category (IDs)
5. Make API calls to get translation counts by page and aggregate by category
6. Explore further the groups of languages being translated and their pairs



__Concerns:__
- When comparing different articles, should be consider the source language or original language it was written in as a factor for diversity?


In [5]:
main_categories = ['Research', 'Library_science', 'Culture', 'The_arts', 'Geography', 
                   'Places', 'Health', 'Self-care', 'Health_care_occupations', 'History',
                   'Events', 'Formal_sciences','Mathematics', 'Logic', 'Mathematical_sciences',
                  'Science', 'Natural_sciences', 'Nature', 'People', 'Personal_life',
                  'Self', 'Surnames', 'Philosophy', 'Thought', 'Religion', 'Belief', 
                  'Society', 'Social_sciences', 'Technology', 'Applied_sciences']

len(main_categories)

30

In [6]:
# https://www.mediawiki.org/wiki/ORES/Articletopic
# https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids=1161522245

# 1. Define the tool to use to get the top level categories of articles

## Tests


### Testing the accuracy of the articletopic model from https://ores.wikimedia.org/

- We are fetching random articles via: https://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10
- We get the page titles and fecth the page URL via: https://en.wikipedia.org/w/api.php?action=help&modules=query%2Binfo
- We go to each page, check the categories at the end and estimate a list of 3-4 expected categories based on this.
- We finally build a list with all of this information.

```
{
    "batchcomplete": "",
    "continue": {
        "rncontinue": "0.141236895977|0.141238067536|13009515|0",
        "continue": "-||"
    },
    "query": {
        "random": [
            {
                "id": 60265289,
                "ns": 0,
                "title": "African Women's classification in the Cape Epic"
            },
            {
                "id": 46434099,
                "ns": 0,
                "title": "Adrien Kela"
            },
            {
                "id": 60533443,
                "ns": 0,
                "title": "Kenneth Bunn"
            },
            {
                "id": 45482666,
                "ns": 0,
                "title": "Cyrtocris fulvicornis"
            },
            {
                "id": 1808010,
                "ns": 0,
                "title": "Islwyn (UK Parliament constituency)"
            },
              {
                "id": 47284944,
                "ns": 0,
                "title": "Capanne, San Marino",
                "url": "https://en.wikipedia.org/wiki/Capanne,_San_Marino"
            },
            
            {
                "id": 28276181,
                "ns": 0,
                "title": "1988 Virginia Slims of Arizona \u2013 Singles",
                "url": "https://en.wikipedia.org/wiki/1988_Virginia_Slims_of_Arizona_%E2%80%93_Singles"
            },
            
            {
                "id": 22369872,
                "ns": 0,
                "title": "Herbert Munk",
                "url": "https://en.wikipedia.org/wiki/Herbert_Munk"
            },
            
            {
                "id": 33607623,
                "ns": 0,
                "title": "Risky Business (House)",
                "url": "https://en.wikipedia.org/wiki/Risky_Business_(House)"
            },
            
            {
                "id": 5864890,
                "ns": 0,
                "title": "Saskatchewan Glacier",
                "url": "https://en.wikipedia.org/wiki/Saskatchewan_Glacier"
            },
            
            {
                "id": 4494842,
                "ns": 0,
                "title": "Printer cable",
                "url": "https://en.wikipedia.org/wiki/Printer_cable"
            },
            
            {
                "id": 43821884,
                "ns": 0,
                "title": "James Carruthers",
                "url": "https://en.wikipedia.org/wiki/James_Carruthers"
            },
            
            {
                "id": 41505702,
                "ns": 0,
                "title": "Hammer of Heaven",
                "url": "https://en.wikipedia.org/wiki/Hammer_of_Heaven"
            },
            
            {
                "id": 70186733,
                "ns": 0,
                "title": "The Kyiv Independent",
                "url": "https://en.wikipedia.org/wiki/The_Kyiv_Independent"
            },
            
            {
                "id": 1137772,
                "ns": 0,
                "title": "State room",
                "url": "https://en.wikipedia.org/wiki/State_room"
            },
            
            {
                "id": 43640804,
                "ns": 0,
                "title": "Trials of Kirstin Lobato",
                "url": "https://en.wikipedia.org/wiki/Trials_of_Kirstin_Lobato"
            },
            
            {
                "id": 62111001,
                "ns": 0,
                "title": "Jessa Dillow Crisp",
                "url": "https://en.wikipedia.org/wiki/Jessa_Dillow_Crisp"
            },
            
            {
                "id": 30634407,
                "ns": 0,
                "title": "Mingqi",
                "url": "https://en.wikipedia.org/wiki/Mingqi"
            },
            
            {
                "id": 8238870,
                "ns": 0,
                "title": "Metronome (artists' and writers' organ)",
                "url": "https://en.wikipedia.org/wiki/Metronome_(artists%27_and_writers%27_organ)"
            },
            
            {
                "id": 9155219,
                "ns": 0,
                "title": "Richard Beauchamp, 2nd Baron Beauchamp",
                "url": "https://en.wikipedia.org/wiki/Richard_Beauchamp,_2nd_Baron_Beauchamp"
            },
            
            {
                "id": 26207504,
                "ns": 0,
                "title": "Single point of failure",
                "url": "https://en.wikipedia.org/wiki/Single_point_of_failure"
            },
            
            {
                "id": 50151305,
                "ns": 0,
                "title": "Ibrahim Al-Subaie",
                "url": "https://en.wikipedia.org/wiki/Ibrahim_Al-Subaie"
            },
            
            {
                "id": 36718302,
                "ns": 0,
                "title": "Biathlon at the 1994 Winter Olympics \u2013 Women's individual",
                "url": "https://en.wikipedia.org/wiki/Biathlon_at_the_1994_Winter_Olympics_%E2%80%93_Women%27s_individual"
            },
            
            {
                "id": 34004381,
                "ns": 0,
                "title": "Viola renifolia",
                "url": "https://en.wikipedia.org/wiki/Viola_renifolia"
            },
            
            {
                "id": 45715050,
                "ns": 0,
                "title": "LNWR 4ft 6in Tank Class",
                "url": "https://en.wikipedia.org/wiki/LNWR_4ft_6in_Tank_Class"
            },
            
            
        ]
    }
}

```

The expected results are:
1. '60265289': ['Sports', 'Africa', 'Culture']
2. '4643409': ['Biography', 'Sports', 'Women']
3. '60533443': ['Biography', 'Sports']
4. '45482666': ['STEM', 'Biology', 'Earth and environment']
5. '1808010': ['Politics', 'History', 'Geography']
6. '47284944': ['Geography', 'Europe', 'Southern Europe', 'History']
7. '28276181' : ['Sports', 'Society', 'Culture']
8. '22369872': ['Biography', 'History', 'Geography']
9. '33607623': ['Media', 'Television', 'Entertainment', 'North America']
10. '5864890': ['North America', 'Americas', 'Geography']
11. '4494842': ['STEM', 'Computing', 'Engineering']
12. '43821884': ['Biography', 'History']

In [7]:
import requests

In [28]:
## Define the expected results

random = [
    {
        "id": 60265289,
        "ns": 0,
        "title": "African Women's classification in the Cape Epic"
    },
    {
        "id": 46434099,
        "ns": 0,
        "title": "Adrien Kela"
    },
    {
        "id": 60533443,
        "ns": 0,
        "title": "Kenneth Bunn"
    },
    {
        "id": 45482666,
        "ns": 0,
        "title": "Cyrtocris fulvicornis"
    },
    {
        "id": 1808010,
        "ns": 0,
        "title": "Islwyn (UK Parliament constituency)"
    },
      {
        "id": 47284944,
        "ns": 0,
        "title": "Capanne, San Marino",
        "url": "https://en.wikipedia.org/wiki/Capanne,_San_Marino"
    },

    {
        "id": 28276181,
        "ns": 0,
        "title": "1988 Virginia Slims of Arizona \u2013 Singles",
        "url": "https://en.wikipedia.org/wiki/1988_Virginia_Slims_of_Arizona_%E2%80%93_Singles"
    },

    {
        "id": 22369872,
        "ns": 0,
        "title": "Herbert Munk",
        "url": "https://en.wikipedia.org/wiki/Herbert_Munk"
    },

    {
        "id": 33607623,
        "ns": 0,
        "title": "Risky Business (House)",
        "url": "https://en.wikipedia.org/wiki/Risky_Business_(House)"
    },

    {
        "id": 5864890,
        "ns": 0,
        "title": "Saskatchewan Glacier",
        "url": "https://en.wikipedia.org/wiki/Saskatchewan_Glacier"
    },

    {
        "id": 4494842,
        "ns": 0,
        "title": "Printer cable",
        "url": "https://en.wikipedia.org/wiki/Printer_cable"
    },

    {
        "id": 43821884,
        "ns": 0,
        "title": "James Carruthers",
        "url": "https://en.wikipedia.org/wiki/James_Carruthers"
    },

    {
        "id": 41505702,
        "ns": 0,
        "title": "Hammer of Heaven",
        "url": "https://en.wikipedia.org/wiki/Hammer_of_Heaven"
    },

    {
        "id": 70186733,
        "ns": 0,
        "title": "The Kyiv Independent",
        "url": "https://en.wikipedia.org/wiki/The_Kyiv_Independent"
    },

    {
        "id": 1137772,
        "ns": 0,
        "title": "State room",
        "url": "https://en.wikipedia.org/wiki/State_room"
    },

    {
        "id": 43640804,
        "ns": 0,
        "title": "Trials of Kirstin Lobato",
        "url": "https://en.wikipedia.org/wiki/Trials_of_Kirstin_Lobato"
    },

    {
        "id": 62111001,
        "ns": 0,
        "title": "Jessa Dillow Crisp",
        "url": "https://en.wikipedia.org/wiki/Jessa_Dillow_Crisp"
    },

    {
        "id": 30634407,
        "ns": 0,
        "title": "Mingqi",
        "url": "https://en.wikipedia.org/wiki/Mingqi"
    },

    {
        "id": 8238870,
        "ns": 0,
        "title": "Metronome (artists' and writers' organ)",
        "url": "https://en.wikipedia.org/wiki/Metronome_(artists%27_and_writers%27_organ)"
    },

    {
        "id": 9155219,
        "ns": 0,
        "title": "Richard Beauchamp, 2nd Baron Beauchamp",
        "url": "https://en.wikipedia.org/wiki/Richard_Beauchamp,_2nd_Baron_Beauchamp"
    },

    {
        "id": 26207504,
        "ns": 0,
        "title": "Single point of failure",
        "url": "https://en.wikipedia.org/wiki/Single_point_of_failure"
    },

    {
        "id": 50151305,
        "ns": 0,
        "title": "Ibrahim Al-Subaie",
        "url": "https://en.wikipedia.org/wiki/Ibrahim_Al-Subaie"
    },

    {
        "id": 36718302,
        "ns": 0,
        "title": "Biathlon at the 1994 Winter Olympics \u2013 Women's individual",
        "url": "https://en.wikipedia.org/wiki/Biathlon_at_the_1994_Winter_Olympics_%E2%80%93_Women%27s_individual"
    },

    {
        "id": 34004381,
        "ns": 0,
        "title": "Viola renifolia",
        "url": "https://en.wikipedia.org/wiki/Viola_renifolia"
    },

    {
        "id": 45715050,
        "ns": 0,
        "title": "LNWR 4ft 6in Tank Class",
        "url": "https://en.wikipedia.org/wiki/LNWR_4ft_6in_Tank_Class"
    }
]




# these are the expected categories by ID
expected_categories = {
'60265289': ['Culture.Sports', 'Geography.Regions.Africa.Africa*', 'Culture.Biography.Women'],
'46434099': ['Culture.Biography.Biography*', 'Culture.Sports'],
'60533443': ['Culture.Biography.Biography*', 'Culture.Sports'],
'45482666': ['STEM.STEM*', 'STEM.Biology', 'STEM.Earth and environment'],
'1808010': ['History and Society.Politics and government', 'History and Society.History', 'Geography.Geographical'],
'47284944': ['Geography.Geographical', 'Geography.Regions.Europe.Europe*', 'Geography.Regions.Europe.Southern Europe', 'History and Society.History'],
'28276181' : ['Culture.Sports', 'Geography.Regions.Americas.North America*', 'Culture.Biography.Women'],
'22369872': ['Culture.Biography.Biography*', 'History and Society.History', 'Geography.Geographical'],
'33607623': ['Culture.Media.Media*', 'Culture.Media.Television', 'Culture.Media.Entertainment', 'North America'],
'5864890': ['Geography.Regions.Americas.North America', 'Geography.Regions.Americas', 'Geography.Geographical'],
'4494842': ['STEM.STEM*', 'STEM.Computing', 'STEM.Engineering'],
'43821884': ['Culture.Biography.Biography*', 'History and Society.History'],
'41505702': ['Culture.Media.Media*', 'Culture.Media.Music','Culture.Media.Radio', 'Culture.Media.Enterntainment'],
'70186733': ['Culture.Media.Media*', 'Culture.Media.Television','Culture.Internet Culture', 'Culture.Media.Enterntainment', 'History and Society.Politics and government'],
'1137772': ['History and Society.History', 'Geography.Geographical', 'Geography.Regions.Europe.Europe*', 'History and Society.Politics and government', 'Culture.Media.Media*', 'Culture.Performing arts', 'Culture.Visual arts.Visual arts*'],
'43640804': ['Culture.Biography.Women', 'History and Society.Society','History and Society.History'],
'62111001': ['Culture.Biography.Biography*', 'Culture.Media.Enterntainment', 'History and Society.History'],
'30634407': ['History and Society.History', 'Geography.Regions.Asia.Asia*', 'Geography.Regions.Asia.North Asia*', 'Culture.Visual arts.Visual arts*'],
'8238870': ['Culture.Media.Media*', 'Culture.Visual arts.Visual arts*'],
'9155219': ['Culture.Biography.Biography*','Geography.Regions.Europe.Europe*'],
'26207504': ['STEM.STEM*', 'STEM.Engineering', 'STEM.Computing', 'STEM.Technology'],
'50151305': ['Culture.Biography.Biography*', 'History and Society.History','Culture.Sports'],
'36718302': ['Culture.Sports', 'Culture.Biography.Women'],
'34004381': ['Geography.Geographical', 'STEM.Biology', 'STEM.Earth and Environment'],
'45715050': ['History and Society.History', 'STEM.STEM*', 'STEM.Engineering', 'History and Society.Transportation'],
    
}

expected_categories

{'60265289': ['Culture.Sports',
  'Geography.Regions.Africa.Africa*',
  'Culture.Biography.Women'],
 '46434099': ['Culture.Biography.Biography*', 'Culture.Sports'],
 '60533443': ['Culture.Biography.Biography*', 'Culture.Sports'],
 '45482666': ['STEM.STEM*', 'STEM.Biology', 'STEM.Earth and environment'],
 '1808010': ['History and Society.Politics and government',
  'History and Society.History',
  'Geography.Geographical'],
 '47284944': ['Geography.Geographical',
  'Geography.Regions.Europe.Europe*',
  'Geography.Regions.Europe.Southern Europe',
  'History and Society.History'],
 '28276181': ['Culture.Sports',
  'Geography.Regions.Americas.North America*',
  'Culture.Biography.Women'],
 '22369872': ['Culture.Biography.Biography*',
  'History and Society.History',
  'Geography.Geographical'],
 '33607623': ['Culture.Media.Media*',
  'Culture.Media.Television',
  'Culture.Media.Entertainment',
  'North America'],
 '5864890': ['Geography.Regions.Americas.North America',
  'Geography.Regions

In [29]:
## Process both results and merge togethers

articles_cat_merged = {}
titles = []

# for each item in the random list
for art in random:
    # get the id
    id_ = str(art['id'])
    article_cat_new = {'id': id_}
    article_cat_new['title'] = art['title'] 
    article_cat_new['expected_cat'] = expected_categories[id_]
    titles.append(art['title'])
    
    # set it
    articles_cat_merged[id_] = article_cat_new
    
articles_cat_merged

{'60265289': {'id': '60265289',
  'title': "African Women's classification in the Cape Epic",
  'expected_cat': ['Culture.Sports',
   'Geography.Regions.Africa.Africa*',
   'Culture.Biography.Women']},
 '46434099': {'id': '46434099',
  'title': 'Adrien Kela',
  'expected_cat': ['Culture.Biography.Biography*', 'Culture.Sports']},
 '60533443': {'id': '60533443',
  'title': 'Kenneth Bunn',
  'expected_cat': ['Culture.Biography.Biography*', 'Culture.Sports']},
 '45482666': {'id': '45482666',
  'title': 'Cyrtocris fulvicornis',
  'expected_cat': ['STEM.STEM*',
   'STEM.Biology',
   'STEM.Earth and environment']},
 '1808010': {'id': '1808010',
  'title': 'Islwyn (UK Parliament constituency)',
  'expected_cat': ['History and Society.Politics and government',
   'History and Society.History',
   'Geography.Geographical']},
 '47284944': {'id': '47284944',
  'title': 'Capanne, San Marino',
  'expected_cat': ['Geography.Geographical',
   'Geography.Regions.Europe.Europe*',
   'Geography.Regions.E

In [30]:
## Get last revision ID

revisions = {}

titles_str = "|".join(titles)   
url = f"https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles={titles_str}&rvprop=ids&format=json"

# Making a GET request
response = requests.get(url)
# See status code
status = response.status_code
res = response.json()

if status == 200:
    # process the results
    display(res)
    page_results = res['query']['pages']
    
    ids = list(articles_cat_merged.keys())
    # revisions - page id
    revisions_pages = {}
    for id_ in ids:
        page = page_results[id_] 
        revision_id = page['revisions'][-1]
        revisions_pages[id_] =  str(revision_id['revid'])
        
revisions_pages
    

{'batchcomplete': '',
 'query': {'pages': {'28276181': {'pageid': 28276181,
    'ns': 0,
    'title': '1988 Virginia Slims of Arizona – Singles',
    'revisions': [{'revid': 1097360621, 'parentid': 1083563900}]},
   '46434099': {'pageid': 46434099,
    'ns': 0,
    'title': 'Adrien Kela',
    'revisions': [{'revid': 1131896564, 'parentid': 885948410}]},
   '60265289': {'pageid': 60265289,
    'ns': 0,
    'title': "African Women's classification in the Cape Epic",
    'revisions': [{'revid': 1074183937, 'parentid': 995883150}]},
   '36718302': {'pageid': 36718302,
    'ns': 0,
    'title': "Biathlon at the 1994 Winter Olympics – Women's individual",
    'revisions': [{'revid': 1114685840, 'parentid': 1072599705}]},
   '47284944': {'pageid': 47284944,
    'ns': 0,
    'title': 'Capanne, San Marino',
    'revisions': [{'revid': 1099758502, 'parentid': 881640249}]},
   '45482666': {'pageid': 45482666,
    'ns': 0,
    'title': 'Cyrtocris fulvicornis',
    'revisions': [{'revid': 104560579

{'60265289': '1074183937',
 '46434099': '1131896564',
 '60533443': '1080503608',
 '45482666': '1045605798',
 '1808010': '1162187838',
 '47284944': '1099758502',
 '28276181': '1097360621',
 '22369872': '1140978713',
 '33607623': '1095806953',
 '5864890': '1114491070',
 '4494842': '844125533',
 '43821884': '1089350121',
 '41505702': '997006968',
 '70186733': '1157444807',
 '1137772': '1112910754',
 '43640804': '1154572734',
 '62111001': '1143997655',
 '30634407': '1158345780',
 '8238870': '1133609915',
 '9155219': '1156496402',
 '26207504': '1144873299',
 '50151305': '1155281273',
 '36718302': '1114685840',
 '34004381': '1013874582',
 '45715050': '1118005998'}

In [31]:
## Get the results

list_revi = list(revisions_pages.values())

revisions_list = "|".join(list_revi)
url = f"https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids={revisions_list}"

# Making a GET request
response = requests.get(url)
# See status code
print(response.status_code)

results_revisions = response.json()['enwiki']['scores']
results_revisions

200


{'1013874582': {'articletopic': {'score': {'prediction': ['STEM.Biology',
     'STEM.STEM*'],
    'probability': {'Culture.Biography.Biography*': 0.0031084317166924823,
     'Culture.Biography.Women': 0.0009803049121985946,
     'Culture.Food and drink': 0.09242336852454172,
     'Culture.Internet culture': 0.0008515363091863077,
     'Culture.Linguistics': 0.00043891420619759344,
     'Culture.Literature': 0.0017456126108263857,
     'Culture.Media.Books': 0.0004604924985425731,
     'Culture.Media.Entertainment': 0.0004764972955101787,
     'Culture.Media.Films': 0.0001387873874122645,
     'Culture.Media.Media*': 0.0027164223946625566,
     'Culture.Media.Music': 9.759401349092416e-05,
     'Culture.Media.Radio': 2.5379689320860085e-05,
     'Culture.Media.Software': 0.0009201269892380221,
     'Culture.Media.Television': 0.00014731442291482467,
     'Culture.Media.Video games': 2.4675836900989654e-05,
     'Culture.Performing arts': 0.00016681160955285787,
     'Culture.Philosophy 

In [32]:
## Compare the expected results with the actual results
passes = 0
total_size = len(articles_cat_merged)

for page_id in revisions_pages:
    revision_id = revisions_pages[page_id]
    item = articles_cat_merged[page_id]
    # get the predicted categories and expected categories
    actual_results = results_revisions[revision_id]['articletopic']['score']['prediction']
    expected_results = item['expected_cat']
    
    
    # compare results by checking for common items
    print(f'\n\nCOMPARING FOR {page_id} with title <<{item["title"]}>>')
    print(f'\nEXPECTED RESULTS: {expected_results} \nACTUAL RESULTS:{actual_results}')
    intersect = list(set(actual_results).intersection(expected_results))
    
    if intersect:
        passes +=1
        print(f'The article with the id {page_id} got exactly {len(intersect)} match(es) in the predicted results list')
        print(f'Status: {passes} passes out of {total_size}')
        
print(f'\n\nFinal result is: {(passes/total_size)*100}%')



COMPARING FOR 60265289 with title <<African Women's classification in the Cape Epic>>

EXPECTED RESULTS: ['Culture.Sports', 'Geography.Regions.Africa.Africa*', 'Culture.Biography.Women'] 
ACTUAL RESULTS:['Culture.Biography.Biography*', 'Culture.Sports', 'Geography.Regions.Africa.Africa*', 'Geography.Regions.Africa.Southern Africa']
The article with the id 60265289 got exactly 2 match(es) in the predicted results list
Status: 1 passes out of 25


COMPARING FOR 46434099 with title <<Adrien Kela>>

EXPECTED RESULTS: ['Culture.Biography.Biography*', 'Culture.Sports'] 
ACTUAL RESULTS:['Culture.Sports', 'Geography.Regions.Oceania']
The article with the id 46434099 got exactly 1 match(es) in the predicted results list
Status: 2 passes out of 25


COMPARING FOR 60533443 with title <<Kenneth Bunn>>

EXPECTED RESULTS: ['Culture.Biography.Biography*', 'Culture.Sports'] 
ACTUAL RESULTS:['Culture.Biography.Biography*', 'Culture.Sports']
The article with the id 60533443 got exactly 2 match(es) in 