# 01 | Transform the Data
# Introduction
In this notebook, we will deal with the data transformation steps required to proceed with our analysis. This process will be reproduced as .py files later on.

# Goals:
Targeted objectives in this notebook ar checked as follows:

- [ ] Import the raw data
  -  [ ] SIPRI dataset and capitals coordinates
- [ ]  Store the raw data
- [x] Prepare the data
  - [x] Clean each individual table
  - [x] Store the transformed dataset
- Combine both datasets
  - Check if the merging column values match
- Store the final output

# Set up our working environment

In [1]:
# Import required libraries
import pandas as pd
import os

In [2]:
# Create directory folders to store our data
dirname = os.getcwd()

raw_data = f"{dirname}/data/raw/"
transformed_data = f"{dirname}/data/transformed/"
refined_data = f"{dirname}/data/refined/"

paths = [raw_data, transformed_data, refined_data]

for path in paths:
    if not os.path.exists(path):
        os.makedirs(path)

# Data Transformation | SIPRI dataset
Let's start by preparing our main dataset for data analysis.

## Import raw data
We'll first import the raw data that we stored during the previous step. Because we have three indivual tables that must turn into one, we must establish a few steps:
- Import a single table;
- Analyse the required transformation steps we need;
- Apply the transformations;
- Loop through the 3 tables and pass the same steps (if possible);

### Import data
We can start by using table 5 (Constant US$ (2022)). 

In [4]:
sipri_raw = sipri_raw = pd.read_csv(f"{raw_data}sipri_data_raw_tb_5.csv")
sipri_raw.head()

Unnamed: 0,"Military expenditure by country, in millions of US$ at current prices and exchange rates, 1948-2023 © SIPRI 2023",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77
0,"Figures are in US $m., in current prices, conv...",,,,,,,,,,...,,,,,,,,,,
1,Figures in blue are SIPRI estimates. Figures i...,,,,,,,,,,...,,,,,,,,,,
2,""". ."" = data unavailable. ""xxx"" = country did ...",,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,Country,Notes,1948.0,1949.0,1950.0,1951.0,1952.0,1953.0,1954.0,1955.0,...,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022.0,2023.0


**Challenge** – the table contains additional non-tabular information on the first rows. We must remove them. Considering that the number of extra rows may differ between each `.csv` file, we need to dynamically identify what is our header. We need to:
- Read the file.
- Iterate through each line using `enumerate`.
- For each value in each index, if `Country` appears, return the index value `i` and define it as `header_row`
  - It will be used to define the number of rows we need to skip when reading our `.csv` file.

In [5]:
# Read the file in chunks to find the header row
with open(f"{raw_data}sipri_data_raw_tb_5.csv", "r") as file:
    for i, line in enumerate(file):
        if "Country" in line:
            header_row = i # define header if `Country` is a value
            break

In [8]:
sipri_raw_tb5 = pd.read_csv(f"{raw_data}sipri_data_raw_tb_5.csv", skiprows=header_row, header=0)
sipri_raw_tb5.head()

Unnamed: 0,Country,Notes,1948,1949,1950,1951,1952,1953,1954,1955,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,,,,,,,,,,,...,,,,,,,,,,
1,Africa,,,,,,,,,,...,,,,,,,,,,
2,North Africa,,,,,,,,,,...,,,,,,,,,,
3,Algeria,§,...,...,...,...,...,...,...,...,...,9724.379971923256,10412.714002896393,10217.081699569308,10073.364021301344,9583.7242883703,10303.60057521065,9708.277440227255,9112.461105348943,9145.810174207281,18263.96796826213
4,Libya,‡§¶,...,...,...,...,...,...,...,...,...,3755.652496350929,...,...,...,...,...,...,...,...,...


In [9]:
sipri_raw_tb5.tail()

Unnamed: 0,Country,Notes,1948,1949,1950,1951,1952,1953,1954,1955,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
188,Syria,§,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189,Türkiye,‖,...,197.68186020052622,212.97020550380432,231.81397994737964,257.7686126715495,294.0339899025812,332.0770817037616,382.9197184100121,...,17576.538470505897,15668.749999999998,17827.702150796664,17822.738263164494,19648.69382385138,20436.917121238785,17478.41368526983,15567.410029425082,10779.896284618242,15827.853255045886
190,United Arab Emirates,§,...,...,...,...,...,...,...,...,...,22755.071477195375,...,...,...,...,...,...,...,...,...
191,"Yemen, North",§,...,...,...,...,...,...,...,...,...,xxx,xxx,xxx,xxx,xxx,xxx,xxx,xxx,xxx,xxx
192,Yemen,§,...,...,...,...,...,...,...,...,...,1714.8308436874681,...,...,...,...,...,...,...,...,...


Great! Now, we can take a brief look at the structure of our dataset.

In [10]:
sipri_raw_tb5.shape

(193, 78)

In [11]:
sipri_raw_tb5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 78 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  192 non-null    object
 1   Notes    79 non-null     object
 2   1948     174 non-null    object
 3   1949     174 non-null    object
 4   1950     174 non-null    object
 5   1951     174 non-null    object
 6   1952     174 non-null    object
 7   1953     174 non-null    object
 8   1954     174 non-null    object
 9   1955     174 non-null    object
 10  1956     174 non-null    object
 11  1957     174 non-null    object
 12  1958     174 non-null    object
 13  1959     174 non-null    object
 14  1960     174 non-null    object
 15  1961     174 non-null    object
 16  1962     174 non-null    object
 17  1963     174 non-null    object
 18  1964     174 non-null    object
 19  1965     174 non-null    object
 20  1966     174 non-null    object
 21  1967     174 non-null    object
 22  19

In [12]:
sipri_raw_tb5["Country"].unique()

array([nan, 'Africa', 'North Africa', 'Algeria', 'Libya', 'Morocco',
       'Tunisia', 'sub-Saharan Africa', 'Angola', 'Benin', 'Botswana',
       'Burkina Faso', 'Burundi', 'Cameroon', 'Cape Verde',
       'Central African Republic', 'Chad', 'Congo, DR', 'Congo, Republic',
       "Cote d'Ivoire", 'Djibouti', 'Equatorial Guinea', 'Eritrea',
       'Ethiopia', 'Gabon', 'Gambia, The', 'Ghana', 'Guinea',
       'Guinea-Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Madagascar',
       'Malawi', 'Mali', 'Mauritania', 'Mauritius', 'Mozambique',
       'Namibia', 'Niger', 'Nigeria', 'Rwanda', 'Senegal', 'Seychelles',
       'Sierra Leone', 'Somalia', 'South Africa', 'South Sudan', 'Sudan',
       'Eswatini', 'Tanzania', 'Togo', 'Uganda', 'Zambia', 'Zimbabwe',
       'Americas', 'Central America and the Caribbean', 'Belize',
       'Costa Rica', 'Cuba', 'Dominican Republic', 'El Salvador',
       'Guatemala', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua',
       'Panama', 'Trinidad and Toba

After a quick glance, we've identified a few cleaning tasks:
- [ ] Remove unwanted columns
- [ ] Pivot year columns
- [ ] Remove regions
- [ ] Filter data
- [ ] Rename country names
- [ ] Rename column names


We need to perform these tasks for each table. Then, we need to
- [ ] Join the three tables
- [ ] Define each data type for each column
  
### Remove unwanted columns
Remove `Notes`.

In [14]:
columns_to_drop = sipri_raw_tb5.columns[1]
sipri_raw_tb5 = sipri_raw_tb5.drop(columns=columns_to_drop)

In [15]:
sipri_raw_tb5.head()

Unnamed: 0,Country,1948,1949,1950,1951,1952,1953,1954,1955,1956,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,,,,,,,,,,,...,,,,,,,,,,
1,Africa,,,,,,,,,,...,,,,,,,,,,
2,North Africa,,,,,,,,,,...,,,,,,,,,,
3,Algeria,...,...,...,...,...,...,...,...,...,...,9724.379971923256,10412.714002896393,10217.081699569308,10073.364021301344,9583.7242883703,10303.60057521065,9708.277440227255,9112.461105348943,9145.810174207281,18263.96796826213
4,Libya,...,...,...,...,...,...,...,...,...,...,3755.652496350929,...,...,...,...,...,...,...,...,...


### Pivot Year Columns

In [16]:
columns_to_pivot = sipri_raw_tb5.columns[1:]
sipri_raw_tb5 = sipri_raw_tb5.melt(
    id_vars="Country",
    value_vars=columns_to_pivot,
    var_name="year",
    value_name="military_expenditures_usd_2022")

In [17]:
sipri_raw_tb5

Unnamed: 0,Country,year,military_expenditures_usd_2022
0,,1948,
1,Africa,1948,
2,North Africa,1948,
3,Algeria,1948,...
4,Libya,1948,...
...,...,...,...
14663,Syria,2023,...
14664,Türkiye,2023,15827.853255045886
14665,United Arab Emirates,2023,...
14666,"Yemen, North",2023,xxx


### Remove regions
We can perform this task by cleaning null values from our pivoted column.

In [18]:
sipri_raw_tb5 = sipri_raw_tb5.dropna(subset=["military_expenditures_usd_2022"])

In [19]:
sipri_raw_tb5

Unnamed: 0,Country,year,military_expenditures_usd_2022
3,Algeria,1948,...
4,Libya,1948,...
5,Morocco,1948,...
6,Tunisia,1948,...
8,Angola,1948,...
...,...,...,...
14663,Syria,2023,...
14664,Türkiye,2023,15827.853255045886
14665,United Arab Emirates,2023,...
14666,"Yemen, North",2023,xxx
