# Mass Merge of Relevant Data

In the below file, the data collected through various means thus far in this project will be merged and documented.

In [111]:
import pandas as pd 
import requests
import unicodedata
from tqdm import tqdm
from bs4 import BeautifulSoup
import numpy as np
import seaborn as sns

iso_data = pd.read_json('../js_files/iso6393.json')
glottolog_data = pd.read_csv('../csv_files/glottolog_status_data_with_links.csv', index_col = 0)
wiki_data = pd.read_csv('../csv_files/wiki_languages_most_recent.csv', index_col = 0)
lat_long_dialects = pd.read_csv('../csv_files/languages_and_dialects_geo.csv', index_col = 0)
languoid_data = pd.read_csv('../csv_files/glottolog_languoid.csv', index_col = 0)
extinct_data = pd.read_csv('../csv_files/Extinct languages - DATA SUMMARY.csv', index_col = 0)


In [112]:
glottolog_data = glottolog_data.reset_index()
glottolog_data = glottolog_data.drop(columns = ['Unnamed: 0'])
# glottolog_data

# Update Location Later

Below you see the Glottolog Data (ISO 639-3 Code, Glottocode, Agglomerated Endangerement Status, and Wikipedia page URL) merged to the ISO Data (Language Name, Language Type, Language Scope, ISO 639-3 Code). 

Glottolog Data Source:
* Glottolog Data was gathered for the ISO 639-3 Code, Glottocode, and Agglomerated Endangerement Status from XXX, and can be seen in ../csv_files/glottolog_status_data. 
* The above data frame was then expanded into ../csv_files/glottolog_status_data_with_links by adding the Wikipedia page URL to the data frame using the web scraper in ../scrapers_organized/glottolog_scraper.ipynb

ISO Data Source:
* ISO Data was downloaded from the iso6393.js file listed in the following GitHub repository: https://github.com/wooorm/iso-639-3. This file can be seen in ../js_files/iso6393.js
* The above file was then converted to a .json file using regular expressions, which can be seen in ../js_files/iso6393.json. 

In [113]:
iso_glotto_data = pd.merge(iso_data, glottolog_data, how = 'left', on = 'iso6393')
iso_glotto_data = iso_glotto_data.drop_duplicates(subset = 'iso6393')
print(len(iso_glotto_data) == len(iso_data) if len(iso_data) > len(glottolog_data) else len(iso_glotto_data) == len(glottolog_data))
iso_glotto_data

True


Unnamed: 0,name,type,scope,iso6393,glottocode,aes_status,Wikipedia_Url
0,Ghotuo,living,individual,aaa,ghot1243,not endangered,https://en.wikipedia.org/wiki/Ghotuo_language
1,Alumu-Tesu,living,individual,aab,alum1246,not endangered,https://en.wikipedia.org/wiki/Alumu_language
2,Ari,living,individual,aac,arii1243,moribund,https://en.wikipedia.org/wiki/Ari_language_(Ne...
3,Amal,living,individual,aad,amal1242,shifting,https://en.wikipedia.org/wiki/Amal_language
4,Arbëreshë Albanian,living,individual,aae,arbe1236,threatened,https://en.wikipedia.org/wiki/Arb%C3%ABresh_la...
...,...,...,...,...,...,...,...
7864,Youjiang Zhuang,living,individual,zyj,youj1238,not endangered,https://en.wikipedia.org/wiki/Youjiang_Zhuang
7865,Yongnan Zhuang,living,individual,zyn,yong1275,not endangered,https://en.wikipedia.org/wiki/Yongnan_languages
7866,Zyphe Chin,living,individual,zyp,zyph1238,not endangered,https://en.wikipedia.org/wiki/Zyphe_language
7867,Zaza,living,macrolanguage,zza,,,


Below you see the previously merged dataframe (iso_glotto_data) merged to the Wikipedia Data.

Wikipedia Data Source:
* All data obtained from Wikipedia in this data frame was scraped by the web scraper in the ../scrapers_organized/wikipedia_scraper.ipynb file. This gathered the following fields from the infobox in the top right side of the page, where applicable:
    * Language Name (as listed on Wikipedia)
    * Language Family 
    * Language Dialects 
    * ISO 639-3 Code
    * Glottocode 
    * Number of Speakers 
    * Regions wherein the language is spoken 
    * Nations wherein the language is an official language 
    * Nations wherein the language is a recognized minority language 
    * The Wikipedia URL used the access the page

Note: The name column is preserved from the ISO 639-3 .json file using a left merge, since this represents the internationally recognized name of a given language.

In [114]:
wiki_data['family'].isna().sum()

15

In [115]:
wiki_data.columns

Index(['lang', 'family', 'dialects', 'iso6393', 'glottocode', 'speakers',
       'regions', 'off_lang', 'rec_min_lang', 'Wikipedia_Url'],
      dtype='object')

In [116]:
iso_glotto_wiki_data = pd.merge(iso_glotto_data, wiki_data, how = 'left', on = ['iso6393', 'glottocode', 'Wikipedia_Url'])
# iso_glotto_wiki_data = iso_glotto_wiki_data.drop_duplicates(subset = 'iso6393')
print(len(iso_glotto_wiki_data) == len(iso_glotto_data))
# iso_glotto_wiki_data

False


Below you see the previously merged dataframe (iso_glotto_wiki_data) merged to the Latitude/Longitude Data 

Latitude/Longitude Data Source:
* The Latitude/Longitude Data can be seen in the ../csv_files/languages_and_dialects_geo.csv file. This was downloaded from Glottolog and can be found on the following page: https://glottolog.org/meta/downloads. 

In [117]:
lat_long_dialects = lat_long_dialects.reset_index()
lat_long_dialects = lat_long_dialects.rename(columns = {'isocodes': 'iso6393'})
# lat_long_dialects

In [118]:
iso_glotto_wiki_lat_long_data = pd.merge(iso_glotto_wiki_data, lat_long_dialects, how = 'left', on = ['iso6393', 'glottocode', 'name'])
print(len(iso_glotto_wiki_lat_long_data) == len(iso_glotto_wiki_data))
# iso_glotto_wiki_lat_long_data

True


Below you see the previously merge dataframe (iso_glotto_wiki_lat_long_data) merged with Glottolog's relevant 'Languoid' data, which describes the information listed below about any language in their database. This was acquired via download of the glottolog_languoid.csv.zip file found at the following page: https://glottolog.org/meta/downloads. 

Languoid Information:
* Glottocode 
* Glottolog's Family ID for the family of a given language 
* Glottolog's Parent ID for the parent language of a given language 
* The name of a given language (as listed by Glottolog)
* Glottolog's Bookkeeping value 
    * If this value is true, the languoid listed is not regarded as a 'real languoid' by Glottolog's editors, but has been given a glottocode for bookkeeping purposes. 
* The level of a given language 
* The Latitude and Longitude values of a given language 
* The ISO 639-3 Code
* A description and markup description for a given language
* A count of child families, child lanuages, and child dialects of a given language
* IDs of nations where a language is spoken. 

More information about any of the above descriptors can be found on the following page: 
Glottolog 5.0.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
https://doi.org/10.5281/zenodo.8131084
(Available online at http://glottolog.org, Accessed on 2024-03-11.)

Note: Since the macroarea is not listed in this file, but is listed in the lat_long_dialects file, the latter is not rendered redundant. 

In [119]:
languoid_data = languoid_data.reset_index()
languoid_data = languoid_data.rename(columns = {'id': 'glottocode', 'iso639P3code': 'iso6393'})
# languoid_data

In [120]:
iso_glotto_wiki_lat_long_languoid_data = pd.merge(iso_glotto_wiki_lat_long_data, languoid_data, how = 'left', on = ['iso6393', 'glottocode', 'level', 'latitude', 'longitude'])
iso_glotto_wiki_lat_long_languoid_data['name'] = iso_glotto_wiki_lat_long_languoid_data['name_x'] if iso_glotto_wiki_lat_long_languoid_data['name_x'].notnull().all() else iso_glotto_wiki_lat_long_languoid_data['name_y']
iso_glotto_wiki_lat_long_languoid_data = iso_glotto_wiki_lat_long_languoid_data.drop(columns = ['name_x', 'name_y'])
print(len(iso_glotto_wiki_lat_long_languoid_data) == len(iso_glotto_wiki_lat_long_data))
# iso_glotto_wiki_lat_long_languoid_data

True


Below you see the previous dataframe (iso_glotto_wiki_lat_long_languoid_data) merged with data on extinct languages, as determined by UNESCO. This data can be seen in ../csv_files/extinct_languages_with_info.csv. This was downloaded from the following page: https://www.theguardian.com/news/datablog/2011/apr/15/language-extinct-endangered. Although outdated, this provides more specific information on speakers of many extinct and endangered languages - which is the focus of this project. 

In [121]:
extinct_data = extinct_data.reset_index()
extinct_data = extinct_data.rename(columns = {'Name in English': 'name'})
# extinct_data

In [122]:
iso_glotto_wiki_lat_long_languoid_extinct_data = pd.merge(iso_glotto_wiki_lat_long_languoid_data, extinct_data, how = 'left', on = ['name'])
iso_glotto_wiki_lat_long_languoid_extinct_data = iso_glotto_wiki_lat_long_languoid_extinct_data.drop_duplicates(subset = 'iso6393')
print(len(iso_glotto_wiki_lat_long_languoid_extinct_data) == len(iso_glotto_wiki_lat_long_languoid_data))
# iso_glotto_wiki_lat_long_languoid_extinct_data

False


In [123]:
columns_reordered = ['name', 'lang', 'glottocode', 'iso6393', 'aes_status', 
        'Degree of endangerment', 'family', 'family_id', 'dialects', 'child_dialect_count', 
        'child_family_count', 'child_language_count', 'speakers', 'Number of speakers', 
        'regions', 'macroarea', 'latitude', 'longitude', 'level', 'type', 'scope', 
        'bookkeeping', 'description', 'markup_description', 'country_ids', 'parent_id', 
        'off_lang', 'rec_min_lang', 'Wikipedia_Url']

iso_glotto_wiki_lat_long_languoid_extinct_data = iso_glotto_wiki_lat_long_languoid_extinct_data[columns_reordered]
iso_glotto_wiki_lat_long_languoid_extinct_data = iso_glotto_wiki_lat_long_languoid_extinct_data.drop(columns = ['lang'])
iso_glotto_wiki_lat_long_languoid_extinct_data

Unnamed: 0,name,glottocode,iso6393,aes_status,Degree of endangerment,family,family_id,dialects,child_dialect_count,child_family_count,...,type,scope,bookkeeping,description,markup_description,country_ids,parent_id,off_lang,rec_min_lang,Wikipedia_Url
0,Ghotuo,ghot1243,aaa,not endangered,,"['Niger–Congo', '?\n', 'Atlantic–Congo', 'Volt...",atla1278,,0.0,0.0,...,living,individual,False,,,NG,afen1234,,,https://en.wikipedia.org/wiki/Ghotuo_language
1,Alumu-Tesu,alum1246,aab,not endangered,,"['Niger–Congo', '?\n', 'Atlantic–Congo', 'Benu...",,"['Alumu', 'Tesu']",,,...,living,individual,,,,,,,,https://en.wikipedia.org/wiki/Alumu_language
2,Ari,arii1243,aac,moribund,Severely endangered,"['Papuan Gulf', '\xa0?\n', 'Gogodala–Suki', 'G...",suki1244,,0.0,0.0,...,living,individual,False,,,PG,ariw1234,,,https://en.wikipedia.org/wiki/Ari_language_(Ne...
3,Amal,amal1242,aad,shifting,,"['Sepik', 'Upper Sepik', 'Amal–Kalou', 'Amal']",sepi1257,,0.0,0.0,...,living,individual,False,,,PG,sepi1257,,,https://en.wikipedia.org/wiki/Amal_language
4,Arbëreshë Albanian,arbe1236,aae,threatened,,"['Indo-European', 'Albanian', 'Tosk', 'Souther...",indo1319,"['Vaccarizzo Albanian', 'Palermitan Albanian\n...",4.0,0.0,...,living,individual,False,,,IT,sout3378,,,https://en.wikipedia.org/wiki/Arb%C3%ABresh_la...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7872,Youjiang Zhuang,youj1238,zyj,not endangered,,"['Kra–Dai', 'Tai', 'Northern Tai', ' (', 'Nort...",taik1256,,0.0,0.0,...,living,individual,False,,,CN,nort3189,,,https://en.wikipedia.org/wiki/Youjiang_Zhuang
7873,Yongnan Zhuang,yong1275,zyn,not endangered,,"['Kra–Dai', 'Tai', 'various Zhuang branches', ...",taik1256,,0.0,0.0,...,living,individual,False,,,CN VN,yong1274,,,https://en.wikipedia.org/wiki/Yongnan_languages
7874,Zyphe Chin,zyph1238,zyp,not endangered,,"['Sino-Tibetan', '\n(', 'Tibeto-Burman', ')', ...",,,,,...,living,individual,,,,,,,,https://en.wikipedia.org/wiki/Zyphe_language
7875,Zaza,,zza,,,,,,,,...,living,macrolanguage,,,,,,,,


In [125]:
iso_glotto_wiki_lat_long_languoid_extinct_data.to_csv('../csv_files/mass_merge.csv')