# Mass Merge of Relevant Data

In the below file, the data collected through various means thus far in this project will be merged and documented.

In [129]:
import pandas as pd 
import requests
import unicodedata
from tqdm import tqdm
from bs4 import BeautifulSoup
import numpy as np
import seaborn as sns

iso_data = pd.read_json('../js_files/iso6393.json')
glottolog_data = pd.read_csv('../csv_files/glottolog_status_data_with_links.csv', index_col = 0)
wiki_data = pd.read_csv('../csv_files/wiki_languages_most_recent.csv', index_col = 0)
lat_long_dialects = pd.read_csv('../csv_files/languages_and_dialects_geo.csv', index_col = 0)
languoid_data = pd.read_csv('../csv_files/glottolog_languoid.csv', index_col = 0)
extinct_data = pd.read_csv('../csv_files/Extinct languages - DATA SUMMARY.csv', index_col = 0)


In [130]:
glottolog_data = glottolog_data.reset_index()
# glottolog_data

In [131]:
pd.read_csv('../csv_files/glottolog_status_data.csv', index_col = 0).columns

Index(['iso6393', 'glottocode', 'aes_status'], dtype='object')

# Update Location Later

Below you see the Glottolog Data (ISO 639-3 Code, Glottocode, Agglomerated Endangerement Status, and Wikipedia page URL) merged to the ISO Data (Language Name, Language Type, Language Scope, ISO 639-3 Code). 

Glottolog Data Source:
* Glottolog Data was gathered for the ISO 639-3 Code, Glottocode, and Agglomerated Endangerement Status from XXX, and can be seen in ../csv_files/glottolog_status_data. 
* The above data frame was then expanded into ../csv_files/glottolog_status_data_with_links by adding the Wikipedia page URL to the data frame using the web scraper in ../scrapers_organized/glottolog_scraper.ipynb

ISO Data Source:
* ISO Data was downloaded from the iso6393.js file listed in the following GitHub repository: https://github.com/wooorm/iso-639-3. This file can be seen in ../js_files/iso6393.js
* The above file was then converted to a .json file using regular expressions, which can be seen in ../js_files/iso6393.json. 

In [132]:
iso_glotto_data = pd.merge(iso_data, glottolog_data, how = 'left', on = 'iso6393')
iso_glotto_data = iso_glotto_data.drop_duplicates(subset = 'iso6393')
print(len(iso_glotto_data) == len(iso_data) if len(iso_data) > len(glottolog_data) else len(iso_glotto_data) == len(glottolog_data))
# iso_glotto_data

True


Below you see the previously merged dataframe (iso_glotto_data) merged to the Wikipedia Data.

Wikipedia Data Source:
* All data obtained from Wikipedia in this data frame was scraped by the web scraper in the ../scrapers_organized/wikipedia_scraper.ipynb file. This gathered the following fields from the infobox in the top right side of the page, where applicable:
    * Language Name (as listed on Wikipedia)
    * Language Family 
    * Language Dialects 
    * ISO 639-3 Code
    * Glottocode 
    * Number of Speakers 
    * Regions wherein the language is spoken 
    * Nations wherein the language is an official language 
    * Nations wherein the language is a recognized minority language 
    * The Wikipedia URL used the access the page

Note: The name column is preserved from the ISO 639-3 .json file using a left merge, since this represents the internationally recognized name of a given language.

In [133]:
iso_glotto_wiki_data = pd.merge(iso_glotto_data, wiki_data, how = 'left', on = ['iso6393', 'glottocode', 'Wikipedia_Url'])
iso_glotto_wiki_data = iso_glotto_wiki_data.drop_duplicates(subset = 'iso6393')
print(len(iso_glotto_wiki_data) == len(iso_glotto_data))
# iso_glotto_wiki_data

True


Below you see the previously merged dataframe (iso_glotto_wiki_data) merged to the Latitude/Longitude Data 

Latitude/Longitude Data Source:
* The Latitude/Longitude Data can be seen in the ../csv_files/languages_and_dialects_geo.csv file. This was downloaded from Glottolog and can be found on the following page: https://glottolog.org/meta/downloads. 

In [134]:
lat_long_dialects = lat_long_dialects.reset_index()
lat_long_dialects = lat_long_dialects.rename(columns = {'isocodes': 'iso6393'})
# lat_long_dialects

In [135]:
iso_glotto_wiki_lat_long_data = pd.merge(iso_glotto_wiki_data, lat_long_dialects, how = 'left', on = ['iso6393', 'glottocode', 'name'])
print(len(iso_glotto_wiki_lat_long_data) == len(iso_glotto_wiki_data))
# iso_glotto_wiki_lat_long_data

True


Below you see the previously merge dataframe (iso_glotto_wiki_lat_long_data) merged with Glottolog's relevant 'Languoid' data, which describes the information listed below about any language in their database. This was acquired via download of the glottolog_languoid.csv.zip file found at the following page: https://glottolog.org/meta/downloads. 

Languoid Information:
* Glottocode 
* Glottolog's Family ID for the family of a given language 
* Glottolog's Parent ID for the parent language of a given language 
* The name of a given language (as listed by Glottolog)
* Glottolog's Bookkeeping value 
    * If this value is true, the languoid listed is not regarded as a 'real languoid' by Glottolog's editors, but has been given a glottocode for bookkeeping purposes. 
* The level of a given language 
* The Latitude and Longitude values of a given language 
* The ISO 639-3 Code
* A description and markup description for a given language
* A count of child families, child lanuages, and child dialects of a given language
* IDs of nations where a language is spoken. 

More information about any of the above descriptors can be found on the following page: 
Glottolog 5.0.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
https://doi.org/10.5281/zenodo.8131084
(Available online at http://glottolog.org, Accessed on 2024-03-11.)

Note: Since the macroarea is not listed in this file, but is listed in the lat_long_dialects file, the latter is not rendered redundant. 

In [136]:
languoid_data = languoid_data.reset_index()
languoid_data = languoid_data.rename(columns = {'id': 'glottocode', 'iso639P3code': 'iso6393'})
# languoid_data

In [137]:
iso_glotto_wiki_lat_long_languoid_data = pd.merge(iso_glotto_wiki_lat_long_data, languoid_data, how = 'left', on = ['iso6393', 'glottocode', 'level', 'latitude', 'longitude'])
iso_glotto_wiki_lat_long_languoid_data['name'] = iso_glotto_wiki_lat_long_languoid_data['name_x'] if iso_glotto_wiki_lat_long_languoid_data['name_x'].notnull().all() else iso_glotto_wiki_lat_long_languoid_data['name_y']
iso_glotto_wiki_lat_long_languoid_data = iso_glotto_wiki_lat_long_languoid_data.drop(columns = ['name_x', 'name_y'])
print(len(iso_glotto_wiki_lat_long_languoid_data) == len(iso_glotto_wiki_lat_long_data))
# iso_glotto_wiki_lat_long_languoid_data

True


Below you see the previous dataframe (iso_glotto_wiki_lat_long_languoid_data) merged with data on extinct languages, as determined by UNESCO. This data can be seen in ../csv_files/extinct_languages_with_info.csv. This was downloaded from the following page: https://www.theguardian.com/news/datablog/2011/apr/15/language-extinct-endangered. Although outdated, this provides more specific information on speakers of many extinct and endangered languages - which is the focus of this project. 

In [138]:
extinct_data = extinct_data.reset_index()
extinct_data = extinct_data.rename(columns = {'Name in English': 'name'})
# extinct_data

In [139]:
iso_glotto_wiki_lat_long_languoid_extinct_data = pd.merge(iso_glotto_wiki_lat_long_languoid_data, extinct_data, how = 'left', on = ['name'])
iso_glotto_wiki_lat_long_languoid_extinct_data = iso_glotto_wiki_lat_long_languoid_extinct_data.drop_duplicates(subset = 'iso6393')
print(len(iso_glotto_wiki_lat_long_languoid_extinct_data) == len(iso_glotto_wiki_lat_long_languoid_data))
# iso_glotto_wiki_lat_long_languoid_extinct_data

True


In [140]:
iso_glotto_wiki_lat_long_languoid_extinct_data.to_csv('../csv_files/mass_merge.csv')