## Investment

This CSV file contains the data for the investments made in each country. We load the CSV file from the uncleaned directory and have an initial look at it.

In [1]:
import pandas as pd
import numpy as np

In [2]:
investment = pd.read_csv('../data/uncleaned/investment_gfcf.csv')
print(investment.shape)
investment.head()

(24597, 8)


Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,GFCF,TOT,AGRWTH,A,1960,5.675337,
1,AUS,GFCF,TOT,AGRWTH,A,1961,1.774823,
2,AUS,GFCF,TOT,AGRWTH,A,1962,6.253368,
3,AUS,GFCF,TOT,AGRWTH,A,1963,9.366632,
4,AUS,GFCF,TOT,AGRWTH,A,1964,7.482981,


We need to check what types of values are present in each of the `MEASURE`, `FREQUENCY` and `SUBJECT` columns.

In [3]:
investment.MEASURE.unique()

array(['AGRWTH', 'MLN_USD', 'QGRWTH'], dtype=object)

In [4]:
investment.FREQUENCY.unique()

array(['A', 'Q'], dtype=object)

In [5]:
investment.SUBJECT.unique()

array(['TOT'], dtype=object)

To start with, we can drop the column `Flag Codes` since it has a lot of NaN values, and is not of any use to us. 

In [6]:
investment = investment.drop(['Flag Codes'], axis=1)

For `FREQUENCY`, we only need the data annually, so we drop our rows with `FREQUENCY==Q`

In [7]:
investment = investment[(investment['FREQUENCY']=='A')]

For `MEASURE`, we are only interested in the values represented in Millions of USD, so we only keep the rows that have `MEASURE==MLN_USD`

In [8]:
investment = investment[(investment['MEASURE']=='MLN_USD')]

In [9]:
investment.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value
62,AUS,GFCF,TOT,MLN_USD,A,1960,7594.023
63,AUS,GFCF,TOT,MLN_USD,A,1961,7555.709
64,AUS,GFCF,TOT,MLN_USD,A,1962,8263.204
65,AUS,GFCF,TOT,MLN_USD,A,1963,9144.787
66,AUS,GFCF,TOT,MLN_USD,A,1964,10182.874


Now, we check the total unique values for every column in our dataframe.

In [10]:
investment.nunique()

LOCATION       67
INDICATOR       1
SUBJECT         1
MEASURE         1
FREQUENCY       1
TIME           63
Value        2635
dtype: int64

We can drop the columns `INDICATOR`, `SUBJECT`, `FREQUENCY` and `MEASURE`, since we don't need them anymore. We also reset our index, since we have deleted a lot of rows.

In [11]:
investment = investment.drop(['INDICATOR','SUBJECT','FREQUENCY','MEASURE'], axis=1)
investment.reset_index(drop=True, inplace=True)

After these changes, our table looks like this:

In [12]:
investment.head()

Unnamed: 0,LOCATION,TIME,Value
0,AUS,1960,7594.023
1,AUS,1961,7555.709
2,AUS,1962,8263.204
3,AUS,1963,9144.787
4,AUS,1964,10182.874


We now need to check the datatypes of each of the columns:

In [13]:
investment.dtypes

LOCATION     object
TIME         object
Value       float64
dtype: object

We want `LOCATION`, `MEASURE` and `TIME` to be strings, so we make those changes in the dataframe.

In [14]:
investment['LOCATION']= investment['LOCATION'].astype('string')
investment['TIME']= investment['TIME'].astype('string')

Now, our datatypes are:

In [15]:
investment.dtypes

LOCATION     string
TIME         string
Value       float64
dtype: object

We need to ensure that all our data is present, so we check for any null values in our columns.

In [16]:
investment.isna().sum()

LOCATION    0
TIME        0
Value       0
dtype: int64

The shape our our final data looks like this:

In [17]:
investment.shape

(2635, 3)

We now need to check if we are missing any data. Fot all unique values of `LOCATION` in our dataframe, we perform an analysis to find the first year that their data is available from, the last year that the data is available for, and any missing years between this range.

In [18]:
highestmin = 1
highestmincountry = ''
lowestmax = 9999
lowestmaxcountry = ''

for country in investment.LOCATION.unique():
    tempdf = investment[investment['LOCATION']==country]
    minyear = int(tempdf.TIME.min())
    if(highestmin<minyear):
        highestmin=minyear
        highestmincountry = country
        
    maxyear = int(tempdf.TIME.max())
    if(lowestmax>maxyear):
        lowestmax=maxyear
        lowestmaxcountry=country
    missingyear = []
    for i in range(minyear,maxyear):
        if((tempdf['TIME']==str(i)).any() == False):
            tempseries = pd.Series([str(i)])
            missingyear = pd.concat([missingyear,tempseries])
            missingyear.reset_index(inplace=True, drop = True)
    
    print('Country = ', country, 'Minimum Year = ',minyear, 'Maximum Year = ', maxyear, 'Missing Years = ', missingyear)
    
print('Highest Min = ', highestmin, ' Country = ', highestmincountry)
print('Lowest Max = ', lowestmax, ' Country = ', lowestmaxcountry)

Country =  AUS Minimum Year =  1960 Maximum Year =  2021 Missing Years =  []
Country =  AUT Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  BEL Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  CAN Minimum Year =  1961 Maximum Year =  2022 Missing Years =  []
Country =  CZE Minimum Year =  1990 Maximum Year =  2022 Missing Years =  []
Country =  DNK Minimum Year =  1966 Maximum Year =  2022 Missing Years =  []
Country =  FIN Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  FRA Minimum Year =  1960 Maximum Year =  2022 Missing Years =  []
Country =  DEU Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []
Country =  GRC Minimum Year =  1960 Maximum Year =  2022 Missing Years =  []
Country =  HUN Minimum Year =  1991 Maximum Year =  2022 Missing Years =  []
Country =  ISL Minimum Year =  1970 Maximum Year =  2021 Missing Years =  []
Country =  IRL Minimum Year =  1970 Maximum Year =  2022 Missing Years =  []

We can now export the data to our `temp` directiry, which contains all the cleaned datasets.

In [19]:
from pathlib import Path 
filepath = Path('../data/temp/investment_gfcf_cleaned.csv')
investment.to_csv(filepath, mode='wb',index=False)