# Pre-Processing

In this notebook, we load in the complete dataset ([2020 Global State of Democracy Index](https://www.idea.int/gsod-indices/dataset-resources), which has data from 1975-2019) and split it out into train, test, and query sets for both the data as a whole and the most current year for the purposes of machine learning training.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/gsodi_pv_4.csv', dtype={
    'ID': int, 'ID_country_name': 'category', 'ID_country_code': int, 'ID_year': int, 
    'ID_country_year': int, 'ID_region': 'category', 'ID_subregion': 'category',
})

In [3]:
# data.head()
# data.describe()

## Current Year

Here we will split out the most current year of data (2019) into separate files and training sets. We go ahead and load the complete set of 2019 data into

In [4]:
data_2019 = data[data['ID_year'] == 2019]
data_2019.to_csv('data/complete-2019.csv')

Though we are typically wanting to do analysis at the country level, several rows in our dataset aggregate the information up to a World, region, or sub-region level. Here we drop all of the regions and save a set of the data which only includes the countries.

In [5]:
drop_regions = ['African Union', 'ASEAN', 'European Union', 'OECD', 'OAS', 'East Africa', 'Central Africa', 
                'Southern Africa', 'West Africa', 'North Africa', 'Caribbean', 'Central America', 'South America', 
                'North America', 'Central Asia', 'East Asia','South Asia', 'South-East Asia', 'Oceania', 
                'Middle East', 'East-Central Europe', 'Eastern Europe', 'North/Western Europe', 'Southern Europe', 
                'Africa', 'Latin America/Caribbean','North America', 'Asia/Pacific', 'Middle East', 'Europe', 'World']
countries_2019 = data_2019[-data_2019['ID_country_name'].isin(drop_regions)]
countries_2019.to_csv('data/countries-2019.csv')
# countries_2019

Next, we go ahead and split out into test, query, and training datasets. We want to have a 60-20-20 split for train-query-test sets, and additionally an 80-20 train-test split for any models we create without requiring a query set. To do this, we first create `test_2019` with our 20% test set, and the complementary `train80_2019` which has the other 80% of the data for training in an 80-20 split. The `random_state` seed may be dropped from the `sample` call for an added layer of randomness, but for the sake of reproducibility we keep the seed here.

In [6]:
test_2019 = countries_2019.sample(frac=0.2, random_state=0)
train80_2019 = countries_2019.drop(test_2019.index)

To create the rest of what we need for a 60-20-20 split, we split `train80_2019` into `train_2019` and `query_2019` which will be 60% and 20% of the complete 2019 countries dataset, respectively.

In [7]:
train_2019 = train80_2019.sample(frac=0.75, random_state=0)
query_2019 = train80_2019.drop(train_2019.index)

Finally, we can save these datasets into csv format for future use.

In [8]:
train_2019.to_csv('data/train-2019.csv')
query_2019.to_csv('data/query-2019.csv')
train80_2019.to_csv('data/train80-2019.csv')
test_2019.to_csv('data/test-2019.csv')

## Countries Over Time

We now repeat a similar process as above, but without reducing the dataset to a single year. We drop the rows that describe regions from our dataset, and create our 80-20 and 60-20-20 splits as above.

In [9]:
complete_countries = data[-data['ID_country_name'].isin(drop_regions)]
test_countries = complete_countries.sample(frac=0.2, random_state=0)
train80_countries = complete_countries.drop(test_countries.index)

In [10]:
train_countries = train80_countries.sample(frac=0.75, random_state=0)
query_countries = train80_countries.drop(train_countries.index)

In [11]:
complete_countries.to_csv('data/complete-countries.csv')
train_countries.to_csv('data/train-countries.csv')
query_countries.to_csv('data/query-countries.csv')
train80_countries.to_csv('data/train80-countries.csv')
test_countries.to_csv('data/test-countries.csv')