In [2]:
import pandas as pd
import numpy as np

This Notebook loads and cleans the data for the GDP_Yearly CSV file. The data is downloaded using the `driver.ipynb` notebook, and placed in the folder `data/uncleaned`.

In [3]:
gdp = pd.read_csv('../data/uncleaned/gdp_yearly.csv')
print(gdp.shape)
gdp.head()

(5222, 8)


Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,GDP,TOT,MLN_USD,A,1960,25071.833,
1,AUS,GDP,TOT,MLN_USD,A,1961,25363.455,
2,AUS,GDP,TOT,MLN_USD,A,1962,27953.904,
3,AUS,GDP,TOT,MLN_USD,A,1963,30431.547,
4,AUS,GDP,TOT,MLN_USD,A,1964,32742.466,


We first find the unique values present in the `MEASURE` column:

In [4]:
gdp.MEASURE.unique()

array(['MLN_USD', 'USD_CAP'], dtype=object)

Looking at the unique values in the `FREQUENCY` cell:

In [5]:
gdp.FREQUENCY.unique()

array(['A'], dtype=object)

And finally, looking at the unique values in the `SUBJECT` cell:

In [6]:
gdp.SUBJECT.unique()

array(['TOT'], dtype=object)

The column `Flag Codes` is not useful for us, so we drop it from the dataframe.

In [7]:
gdp = gdp.drop(['Flag Codes'], axis=1)

We are only interested in the MEASURE `MLN_USD`, so we keep only those values.

In [8]:
gdp = gdp[(gdp['MEASURE']=='MLN_USD')]

Checking the head of our dataframe now:

In [9]:
gdp.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value
0,AUS,GDP,TOT,MLN_USD,A,1960,25071.833
1,AUS,GDP,TOT,MLN_USD,A,1961,25363.455
2,AUS,GDP,TOT,MLN_USD,A,1962,27953.904
3,AUS,GDP,TOT,MLN_USD,A,1963,30431.547
4,AUS,GDP,TOT,MLN_USD,A,1964,32742.466


We can check the number of unique values for each column:

In [10]:
gdp.nunique()

LOCATION       67
INDICATOR       1
SUBJECT         1
MEASURE         1
FREQUENCY       1
TIME           63
Value        2675
dtype: int64

We can now drop the columns `INDICATOR`,`SUBJECT`,`FREQUENCY` and `MEASURE` since they have only 1 value in them and do not add any information.

In [11]:
gdp = gdp.drop(['INDICATOR','SUBJECT','FREQUENCY','MEASURE'], axis=1)
gdp.reset_index(drop=True, inplace=True)

Checking our datatypes for the columns:

In [12]:
gdp.dtypes

LOCATION     object
TIME          int64
Value       float64
dtype: object

We want `LOCATION` and `TIME` to be string.

In [13]:
gdp['LOCATION']= gdp['LOCATION'].astype('string')
gdp['TIME']= gdp['TIME'].astype('string')

Checking our datatypes again:

In [14]:
gdp.dtypes

LOCATION    string[python]
TIME        string[python]
Value              float64
dtype: object

We also check for any null values in our dataframe

In [15]:
gdp.isna().sum()

LOCATION    0
TIME        0
Value       0
dtype: int64

Finally, our shape looks like this:

In [16]:
gdp.shape

(2675, 3)

We now need to check if we are missing any data. For all unique values of `LOCATION` in our dataframe, we perform an analysis to find the first year that their data is available from, the last year that the data is available for, and any missing years between this range.

In [17]:
highestmin = 1
highestmincountry = ''
lowestmax = 9999
lowestmaxcountry = ''

for country in gdp.LOCATION.unique():
    tempdf = gdp[gdp['LOCATION']==country]
    minyear = int(tempdf.TIME.min())
    if(highestmin<minyear):
        highestmin=minyear
        highestmincountry = country
        
    maxyear = int(tempdf.TIME.max())
    if(lowestmax>maxyear):
        lowestmax=maxyear
        lowestmaxcountry=country
    missingyear = []
    for i in range(minyear,maxyear):
        if((tempdf['TIME']==str(i)).any() == False):
            tempseries = pd.Series([str(i)])
            missingyear = pd.concat([missingyear,tempseries])
            missingyear.reset_index(inplace=True, drop = True)
    
    print('Country = ', country, 'Minimum Year = ',minyear, 'Maximum Year = ', maxyear, 'Missing Years = ', missingyear)
    
print('Highest Min = ', highestmin, ' Country = ', highestmincountry)
print('Lowest Max = ', lowestmax, ' Country = ', lowestmaxcountry)

Country =  AUS Minimum Year =  1960 Maximum Year =  2021 Missing Years =  []
Country =  AUT Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  BEL Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  CAN Minimum Year =  1961 Maximum Year =  2022 Missing Years =  []
Country =  CZE Minimum Year =  1990 Maximum Year =  2022 Missing Years =  []
Country =  DNK Minimum Year =  1966 Maximum Year =  2022 Missing Years =  []
Country =  FIN Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  FRA Minimum Year =  1960 Maximum Year =  2022 Missing Years =  []
Country =  DEU Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  GRC Minimum Year =  1960 Maximum Year =  2022 Missing Years =  []
Country =  HUN Minimum Year =  1991 Maximum Year =  2022 Missing Years =  []
Country =  ISL Minimum Year =  1970 Maximum Year =  2021 Missing Years =  []
Country =  IRL Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []

Finally, we export our cleaned dataframe into a csv file and place it in `data/temp`.

In [18]:
from pathlib import Path 
filepath = Path('../data/cleaned/gdp_yearly_cleaned.csv')
gdp.to_csv(filepath, mode='wb', index=False)  
