# Pre-Processing

In this notebook, we load in the complete dataset ([2020 Global State of Democracy Index](https://www.idea.int/gsod-indices/dataset-resources), which has data from 1975-2019) and split it out into train, test, and query sets for both the data as a whole and the most current year for the purposes of machine learning training.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/gsodi_pv_4.csv', dtype={
    'ID': int, 'ID_country_name': 'category', 'ID_country_code': int, 'ID_year': int, 
    'ID_country_year': int, 'ID_region': 'category', 'ID_subregion': 'category',
})

In [3]:
# data.head()
# data.describe()

## Current Year

Here we will split out the most current year of data (2019) into separate files and training sets. We go ahead and load the complete set of 2019 data into

In [4]:
data_2019 = data[data['ID_year'] == 2019]
data_2019.to_csv('data/complete-2019.csv')

Though we are typically wanting to do analysis at the country level, several rows in our dataset aggregate the information up to a World, region, or sub-region level. Here we drop all of the regions and save a set of the data which only includes the countries.

In [27]:
drop_regions = ['African Union', 'ASEAN', 'European Union', 'OECD', 'OAS', 'East Africa', 'Central Africa', 
                'Southern Africa', 'West Africa', 'North Africa', 'Caribbean', 'Central America', 'South America', 
                'North America', 'Central Asia', 'East Asia','South Asia', 'South-East Asia', 'Oceania', 
                'Middle East', 'East-Central Europe', 'Eastern Europe', 'North/Western Europe', 'Southern Europe', 
                'Africa', 'Latin America/Caribbean','North America', 'Asia/Pacific', 'Middle East', 'Europe', 'World']
countries_2019 = data_2019[-data_2019['ID_country_name'].isin(drop_regions)]
countries_2019.to_csv('data/countries-2019')
# countries_2019

Unnamed: 0,ID,ID_country_name,ID_country_code,ID_year,ID_country_year,ID_region,ID_subregion,C_A1,L_A1,U_A1,...,v_51_02,v_51_03,v_51_04,v_51_05,v_51_06,v_52_01,v_53_01,v_53_02,v_54_01,v_54_02
44,45,United States,2,2019,22019,North America,North America,0.815402,0.751830,0.878974,...,0.683217,1.000000,,,,0.4702,0.000000,1.000000,0.908543,0.867543
89,90,Canada,20,2019,202019,North America,North America,0.829484,0.763739,0.895229,...,0.755226,0.862970,0.579446,0.658272,0.706262,0.6242,0.019206,1.000000,0.969849,0.873963
134,135,Cuba,40,2019,402019,Latin America/Caribbean,Caribbean,0.218813,0.154375,0.283251,...,0.306620,0.087140,0.317786,0.345063,0.092506,0.8082,0.193342,0.333333,0.723618,0.279630
179,180,Haiti,41,2019,412019,Latin America/Caribbean,Caribbean,0.468092,0.402041,0.534143,...,0.540070,0.245576,0.677627,0.551368,0.637333,0.1782,0.000000,0.666667,0.943719,0.465946
224,225,Dominican Republic,42,2019,422019,Latin America/Caribbean,Caribbean,0.648965,0.581965,0.715965,...,0.761905,0.782066,0.422236,0.362909,0.414026,0.6599,0.015365,1.000000,0.000000,0.547988
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6703,6704,Indonesia,850,2019,8502019,Asia/Pacific,South-East Asia,0.710154,0.645492,0.774817,...,0.826655,0.558234,0.578835,0.496998,0.424238,0.8244,0.000000,1.000000,0.967839,0.751840
6721,6722,Timor-Leste,860,2019,8602019,Asia/Pacific,South-East Asia,0.773218,0.708266,0.838170,...,0.613095,0.472948,0.549472,0.669947,0.451719,0.9153,0.023047,1.000000,0.991960,0.857836
6766,6767,Australia,900,2019,9002019,Asia/Pacific,Oceania,0.830800,0.762927,0.898672,...,0.767567,0.501433,,,,0.8079,0.213828,1.000000,0.984925,0.868953
6811,6812,Papua New Guinea,910,2019,9102019,Asia/Pacific,Oceania,0.524209,0.461888,0.586529,...,0.492305,0.568178,,,,1.0000,0.016645,1.000000,0.706533,0.582120


Next, we go ahead and split out into test, query, and training datasets. We want to have a 60-20-20 split for train-query-test sets, and additionally an 80-20 train-test split for any models we create without requiring a query set. To do this, we first create `test_2019` with our 20% test set, and the complementary `train80_2019` which has the other 80% of the data for training in an 80-20 split. The `random_state` seed may be dropped from the `sample` call for an added layer of randomness, but for the sake of reproducibility we keep the seed here.

In [28]:
test_2019 = countries_2019.sample(frac=0.2, random_state=0)
train80_2019 = countries_2019.drop(test_2019.index)

To create the rest of what we need for a 60-20-20 split, we split `train80_2019` into `train_2019` and `query_2019` which will be 60% and 20% of the complete 2019 countries dataset, respectively.

In [29]:
train_2019 = train80_2019.sample(frac=0.75, random_state=0)
query_2019 = train80_2019.drop(train_2019.index)

Finally, we can save these datasets into csv format for future use.

In [30]:
train_2019.to_csv('data/train-2019.csv')
query_2019.to_csv('data/query-2019.csv')
train80_2019.to_csv('data/train80-2019.csv')
test_2019.to_csv('data/test-2019.csv')

## Countries Over Time

We now repeat a similar process as above, but without reducing the dataset to a single year. We drop the rows that describe regions from our dataset, and create our 80-20 and 60-20-20 splits as above.

In [36]:
complete_countries = data[-data['ID_country_name'].isin(drop_regions)]
test_countries = complete_countries.sample(frac=0.2, random_state=0)
train80_countries = complete_countries.drop(test_countries.index)

In [39]:
train_countries = train80_countries.sample(frac=0.75, random_state=0)
query_countries = train80_countries.drop(train_countries.index)

In [45]:
complete_countries.to_csv('data/complete-countries.csv')
train_countries.to_csv('data/train-countries.csv')
query_countries.to_csv('data/query-countries.csv')
train80_countries.to_csv('data/train80-countries.csv')
test_countries.to_csv('data/test-countries.csv')