## Data ingestion and scraping 

### How to use this notebook

The goal of this notebook is to download the necessary datasets and structurally prepare them for data cleaning and analysis.

Use this notebook after downloading the datasets from their respective URL. In the case of the Wikipedia data, use the code below to scrape the table from the webpage.

### Overview

We will be pulling data from multiple sources that describe various features of a language. This data will be pulled from three main sources, then compiled into one dataset that will be used to train our model.

The datasets and features are as follows: 

1. **Endangered Languages Dataset**

  - *URL:* https://endangeredlanguages.com/userquery/
  - Number of speakers (numeric- discrete)
  - Areas and countries where spoken (categorical)
  - Level of endangerment (categorical)

2. **Wikipedia list of official languages by country and territory and List of Languages by Speaker Count** $^*$

  - *URL:* https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory
  - *URL*: https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers 
  - Recognized as a country's official language by a governement body (binary)
  - Widely spoken, regional, minority or national language (categorical)
  - Number of speakers of non-endangered languages (discrete)

3. **World Bank Indicators**
  - *URL:* https://data.worldbank.org/indicator?tab=all
  - Rate of urbanization in countries where it is spoken (numeric- continuous)
  - Percentage of population using the internet (numeric- continuous)

$^*$ **NOTE:** Under Wikipedia's Creative Commons Attribution-ShareAlike 4.0 International license, you are free to:
  - Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
  - Adapt — remix, transform, and build upon the material for any purpose, even commercially.
  - The licensor cannot revoke these freedoms as long as you follow the license terms.

See more at https://creativecommons.org/licenses/by-sa/4.0/deed.en

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os

### Endangered Langauges Dataset

In [2]:
## Read csv

df1 = pd.read_csv('endangered_languages.csv', header=None)
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,,index,abbrv,official_name,other_names,level,speakers,root_1,root_2,root_3,root_4,country,continent,long_lat
1,1.0,3645,knw,!Xun,Ju; !Xun (Ekoka); Kung-Ekoka; !Kung; Ekoka-!Xû...,"Vulnerable (20 percent certain, based on the e...","14,000-18,000",Kx'a,"Southeastern !Xun, Northwestern !Xun, Central ...",,,South Africa;Namibia;Angola;,Africa,"-28.74358,23.983154; -17.560247, 18.050537; -1..."
2,2.0,3956,bpk,'Ôrôê,Orowe; Boewe; Neukaledonien;,"Endangered (20 percent certain, based on the e...",590,Austronesian; Malayo-Polynesian; Oceanic; New ...,,,,New Caledonia;,Pacific,"-21.4223,165.4678"
3,3.0,1933,taa,(Lower) Tanana,,"Critically Endangered (80 percent certain, bas...",25,Athabaskan-Eyak-Tlingit; Dene (Athabaskan),Minto-Nenana; Salcha; Chena,,Tanana is the language of the Lower Tanana riv...,USA;,North America,"65.157778, -149.37;64.521111, -146.980556;64.5..."
4,4.0,1043,con,A'ingae,Kofane; Cofán; Kofán; A'i; A'ingaé; Colin; Kof...,"Vulnerable (100 percent certain, based on the ...",1500,Isolate; South American,,,,Colombia;Ecuador;,South America,"0.054639, -77.409417"


In [3]:
## Insert column headers
df1 = df1.iloc[:, 1:].reset_index(drop=True)
df1.columns = ['index', 'abbrv', 'official_name', 'other_names', 'level', 'speakers', 'root_1', 'root_2', 'root_3', 
                'root_4', 'country', 'continent', 'long_lat']
df1 = df1[1:]
df1.head()

Unnamed: 0,index,abbrv,official_name,other_names,level,speakers,root_1,root_2,root_3,root_4,country,continent,long_lat
1,3645,knw,!Xun,Ju; !Xun (Ekoka); Kung-Ekoka; !Kung; Ekoka-!Xû...,"Vulnerable (20 percent certain, based on the e...","14,000-18,000",Kx'a,"Southeastern !Xun, Northwestern !Xun, Central ...",,,South Africa;Namibia;Angola;,Africa,"-28.74358,23.983154; -17.560247, 18.050537; -1..."
2,3956,bpk,'Ôrôê,Orowe; Boewe; Neukaledonien;,"Endangered (20 percent certain, based on the e...",590,Austronesian; Malayo-Polynesian; Oceanic; New ...,,,,New Caledonia;,Pacific,"-21.4223,165.4678"
3,1933,taa,(Lower) Tanana,,"Critically Endangered (80 percent certain, bas...",25,Athabaskan-Eyak-Tlingit; Dene (Athabaskan),Minto-Nenana; Salcha; Chena,,Tanana is the language of the Lower Tanana riv...,USA;,North America,"65.157778, -149.37;64.521111, -146.980556;64.5..."
4,1043,con,A'ingae,Kofane; Cofán; Kofán; A'i; A'ingaé; Colin; Kof...,"Vulnerable (100 percent certain, based on the ...",1500,Isolate; South American,,,,Colombia;Ecuador;,South America,"0.054639, -77.409417"
5,3581,aas,Aasáx,"Asax; Asá; Aasá; Assa; Asak; ""Ndorobo""; ""Dorob...",Dormant,0,Afro-Asiatic; Cushitic; South Cushitic,,,,Tanzania;,Africa,"-5.1948,37.738"


In [4]:
## Save updated csv to repo folder

df1.to_csv('endangered_languages.csv')

### Wikipedia Datasets

In [5]:
## Official language: Scrape from Wikipedia page

!pip install lxml

## URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_official_languages_by_country_and_territory"

## Read all tables from the page
tables = pd.read_html(url)

## Inspect how many tables were found
print(f"Number of tables found: {len(tables)}")

## Example: View the first one (this is usually the main table)
df2 = tables[1]
print(df2.head())

Number of tables found: 9
         Country/Region  Number of official (including de facto)  \
0           Abkhazia[a]                                        2   
1  Afghanistan[1][2][3]                                        2   
2            Albania[4]                                        1   
3            Algeria[5]                                        2   
4               Andorra                                        1   

    Official language(s)                               Regional language(s)  \
0         Abkhaz Russian                                                NaN   
1  Persian (Dari) Pashto  Uzbek[b] Turkmen[b] Pashayi[b] Nuristani[b] Ba...   
2               Albanian                                                NaN   
3          Arabic Berber                                                NaN   
4             Catalan[6]                                                NaN   

         Minority language(s)   National language(s)   Widely spoken  
0                  

In [6]:
## Save table to csv in repo folder 

df2.to_csv('official_languages.csv')

In [7]:
## Read new csv

df2 = pd.read_csv('official_languages.csv')
df2.head()

Unnamed: 0.1,Unnamed: 0,Country/Region,Number of official (including de facto),Official language(s),Regional language(s),Minority language(s),National language(s),Widely spoken
0,0,Abkhazia[a],2,Abkhaz Russian,,Georgian,Abkhaz,
1,1,Afghanistan[1][2][3],2,Persian (Dari) Pashto,Uzbek[b] Turkmen[b] Pashayi[b] Nuristani[b] Ba...,,Persian (Dari) Pashto,Persian (Dari)
2,2,Albania[4],1,Albanian,,Greek Macedonian Aromanian,,Italian
3,3,Algeria[5],2,Arabic Berber,,,Arabic Berber,French
4,4,Andorra,1,Catalan[6],,Spanish French Portuguese,,


In [8]:
## Number of Speakers: Scrape from Wikipedia page

!pip install lxml

## URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers"

## Read all tables from the page
tables_2 = pd.read_html(url)

## Inspect how many tables were found
print(f"Number of tables found: {len(tables_2)}")

## Example: View the first one (this is usually the main table)
df5 = tables_2[0]
print(df5.head())

Number of tables found: 7
                                            Language         Family  \
                                            Language         Family   
0                   English (excl. creole languages)  Indo-European   
1  Mandarin Chinese (incl. Standard Chinese but e...   Sino-Tibetan   
2                                 Hindi (excl. Urdu)  Indo-European   
3                   Spanish (excl. creole languages)  Indo-European   
4            Modern Standard Arabic (excl. dialects)   Afro-Asiatic   

       Branch Numbers of speakers (millions)                        \
       Branch           First- language (L1) Second- language (L2)   
0    Germanic                            390                  1138   
1     Sinitic                            990                   194   
2  Indo-Aryan                            345                   264   
3     Romance                            484                    74   
4     Semitic                           0[a]            

In [9]:
## Save table to csv in repo folder 

df5.to_csv('speaker_count.csv')

In [10]:
## Read new csv

df5 = pd.read_csv('speaker_count.csv')
df5.head()

Unnamed: 0.1,Unnamed: 0,Language,Family,Branch,Numbers of speakers (millions),Numbers of speakers (millions).1,Numbers of speakers (millions).2
0,,Language,Family,Branch,First- language (L1),Second- language (L2),Total (L1+L2)
1,0.0,English (excl. creole languages),Indo-European,Germanic,390,1138,1528
2,1.0,Mandarin Chinese (incl. Standard Chinese but e...,Sino-Tibetan,Sinitic,990,194,1184
3,2.0,Hindi (excl. Urdu),Indo-European,Indo-Aryan,345,264,609
4,3.0,Spanish (excl. creole languages),Indo-European,Romance,484,74,558


### World Bank Indicators: urban development

In [11]:
## Read csv- skip rows with extra comments

df3 = pd.read_csv('wb_urban.csv', skiprows= 3)
df3.head()

Unnamed: 0,2,5,Albania,ALB,Urban population (% of total population),SP.URB.TOTL.IN.ZS,30.705,30.943,31.015,31.086,...,58.421,59.383,60.319,61.229,62.112,62.969,63.799,64.603,Unnamed: 70,Unnamed: 71
0,3,6,Andorra,AND,Urban population (% of total population),SP.URB.TOTL.IN.ZS,58.45,60.983,63.462,65.872,...,88.248,88.15,88.062,87.984,87.916,87.858,87.811,87.774,,
1,4,7,Arab World,ARB,Urban population (% of total population),SP.URB.TOTL.IN.ZS,31.010536,31.727163,32.451442,33.194309,...,57.653103,57.862747,58.010484,58.204542,58.496674,58.662927,59.054267,59.443854,,
2,5,8,United Arab Emirates,ARE,Urban population (% of total population),SP.URB.TOTL.IN.ZS,73.5,74.383,75.248,76.093,...,85.965,86.248,86.522,86.789,87.048,87.299,87.543,87.779,,
3,6,9,Argentina,ARG,Urban population (% of total population),SP.URB.TOTL.IN.ZS,73.611,74.217,74.767,75.309,...,91.627,91.749,91.87,91.991,92.111,92.229,92.347,92.463,,
4,7,10,Armenia,ARM,Urban population (% of total population),SP.URB.TOTL.IN.ZS,51.275,52.147,53.019,53.889,...,63.082,63.103,63.149,63.219,63.313,63.431,63.573,63.739,,


In [12]:
## Save updated csv to repo folder

df3.to_csv('wb_urban.csv')

### World Bank Indicators: internet users

In [13]:
## Read csv- skip rows with extra comments

df4 = pd.read_csv('wb_internet.csv', skiprows= 3)
df4.head()

Unnamed: 0,2,5,Albania,ALB,Individuals using the Internet (% of population),IT.NET.USER.ZS,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,59.6,62.4,65.4,68.6,72.2,79.3,82.6,83.1,Unnamed: 70,Unnamed: 71
0,3,6,Andorra,AND,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,...,89.7,91.6,,90.7,93.2,93.9,94.5,95.4,,
1,4,7,Arab World,ARB,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,...,,,,,,,,,,
2,5,8,United Arab Emirates,ARE,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,...,90.6,94.8,98.5,99.2,100.0,100.0,100.0,100.0,,
3,6,9,Argentina,ARG,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,...,71.0,74.3,77.7,79.9,85.5,87.2,88.4,89.2,,
4,7,10,Armenia,ARM,Individuals using the Internet (% of population),IT.NET.USER.ZS,,,,,...,64.3,64.7,68.2,66.5,76.5,78.6,77.0,80.0,,


In [14]:
## Save updated csv to repo folder

df4.to_csv('wb_internet.csv')

### 