<a id="top"/>


### Come back to [Home](FinalProjectReport.ipynb)


# GDP PROCESSING

>The content of this section is:
>1. [Choosing and Obtaining data](#obtain) 
>1. [Data processing](#proc) 
>1. [Dataset's stats](#stats)

>Here we import the modules that we will need in order to extract the data to manipulate them 

In [1]:
import pandas as pd
import numpy as np
import csv
import json
import country_converter as coco

>In this section we perform the initial preprocessing of the data to a shape which is best for use in further analysis. We will show:
* where we obtain our data
* convert the data to a form that it will be easier to use

## Choosing the dataset and obtaining it
> We decided to correlate COVID-19 and Market data with national GDP values as it allows to have extended overview on how GDP can affect the number of cases and deaths and to give insight into wether further analysis on testing is required.

> We obtained GDP per capita (current US) for each country using the [World Bank Data](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?view=map). Which contain all the values since 1960 to 2019.

In [2]:
df = pd.read_csv('gdp_csv.csv')
df.head(1)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 64
0,Aruba,ABW,GDP per capita (current US$),NY.GDP.PCAP.CD,,,,,,,...,24985.993281,24713.698045,25025.099563,25533.56978,25796.380251,25239.600411,25630.266492,,,


In [3]:
df.shape

(264, 65)

<a id="proc"/>

## Data processing
> Once the data are downloaded, we are going to perform the following actions:
  - melt the dataset such that we have only one column 'year' (instead of having one column per each year) 
  - keep the most updated gdp per each country
  - change the country code from alpha3 to alpha2
  - remove the columns which are not useful

Here we melt the dataset such that we have only one column 'year' (instead of having one column per each year) 

In [4]:
df = pd.melt(df, id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'], value_vars=[str(i) for i in range(1960, 2020)], var_name='gdp_year', value_name='gdp')

In [5]:
df.head(1)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,gdp_year,gdp
0,Aruba,ABW,GDP per capita (current US$),NY.GDP.PCAP.CD,1960,


In [6]:
df.shape

(15840, 6)

Let's keep the most updated gdp per each country

In [7]:
df = df.dropna()
df['max_year'] = df.groupby(['Country Code'])['gdp_year'].transform(max)
df = df[df.gdp_year == df.max_year].drop(['max_year'], axis=1)

In [8]:
df.shape

(258, 6)

Now we change the country code from the alpha3 format to the alpha2. This is because in the covid dataset is used the latter.

In [9]:
def get_country_code_alpha2(alpha3):
    return coco.convert(names=alpha3, to='ISO2', not_found=np.nan)

In [10]:
df['countryCode'] = df['Country Code'].apply(get_country_code_alpha2)



As you can see there are many warnings due to the fact that the dataset contains the gdp of set of countries, continents and other types of which we not interested in.

Finally we remove nan values and the not useful columns.

In [11]:
df = df.drop(columns=['Country Name', 'Indicator Name', 'Indicator Code'], axis=1).dropna()
df = df.rename(columns={'Country Code': 'countryCode3'})

In [12]:
df = df.sort_values(by='gdp', ascending=False).reset_index(drop=True)

<a id="stats"/>

## Dataset's stats

In [13]:
get_number_unique_values = lambda df, col: len(df[col].unique())
def print_info(df, main_feature):
    print("Number of %s:" % main_feature, get_number_unique_values(df, main_feature))
    features = df.columns.to_list()
    print("\nWe have %d features:" % len(features), features)
    print("\nThe total number of (rows, cols) is:", df.shape)
    print("\nIn memory occupies: ~%d MB\n" % (df.memory_usage(index=True).sum() / (2**16)))
    print(df.head(1))

### Original Dataset

In [14]:
df_ori = pd.read_csv('gdp_csv.csv')

In [15]:
print_info(df_ori, "Country Name")

Number of Country Name: 264

We have 65 features: ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', 'Unnamed: 64']

The total number of (rows, cols) is: (264, 65)

In memory occupies: ~2 MB

  Country Name Country Code                Indicator Name  Indicator Code  \
0        Aruba          ABW  GDP per capita (current US$)  NY.GDP.PCAP.CD   

   1960  1961  1962  1963  1964  1965  ...          2011          2012  \
0   NaN   NaN   NaN   NaN   NaN   NaN  ...  24985.993281  24713.698045   

      

In [16]:
df_ori.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,Unnamed: 64
count,132.0,132.0,134.0,134.0,134.0,144.0,147.0,150.0,154.0,154.0,...,253.0,252.0,253.0,252.0,251.0,250.0,247.0,239.0,0.0,0.0
mean,484.657093,504.015717,529.605145,562.032811,610.188657,677.155901,734.04949,750.651535,770.136932,834.474792,...,16133.575074,16051.403267,16588.098623,16592.88196,14960.419108,15089.966357,14918.158187,15115.093976,,
std,618.234795,644.142567,675.552712,715.847268,782.824554,874.410998,945.92076,977.392321,1008.953075,1086.440709,...,24178.276523,23380.42769,24934.219232,25312.160011,22674.148159,22856.559361,21318.087315,22219.914937,,
min,40.537115,40.68939,34.790581,40.752237,41.083814,45.989354,37.488783,46.64277,48.784045,51.794609,...,249.577979,252.35898,256.976003,274.857948,293.455236,282.19313,292.997631,271.752044,,
25%,106.517055,109.11211,114.355157,122.754509,124.090568,140.376734,145.767969,157.983788,159.262062,161.937705,...,1739.14878,1905.937468,2004.504298,2104.200853,2067.475587,2124.675302,2042.465642,2032.214332,,
50%,225.669856,209.908521,229.024563,244.612208,249.981372,269.36368,277.233872,278.460057,291.141585,305.373736,...,6045.495551,6557.846749,6832.456891,6640.856256,6124.491643,5924.917489,6213.501276,6385.461626,,
75%,481.584427,523.724552,598.375139,603.188391,677.406317,783.039774,882.906899,875.547972,805.809962,860.595621,...,19034.149197,19300.530405,19916.019387,19462.312835,17106.400142,17821.571228,17136.270746,17203.9416,,
max,3007.123445,3066.562869,3243.843078,3374.515171,3573.941185,4443.405272,4571.181955,4336.426587,4695.92339,5032.144743,...,168785.940809,157515.899069,177593.351895,189170.895671,167290.939984,169915.80484,167101.759377,185741.279992,,


### Processed Dataset

In [17]:
print_info(df, "countryCode")

Number of countryCode: 211

We have 4 features: ['countryCode3', 'gdp_year', 'gdp', 'countryCode']

The total number of (rows, cols) is: (211, 4)

In memory occupies: ~0 MB

  countryCode3 gdp_year            gdp countryCode
0          MCO     2018  185741.279992          MC


In [18]:
df.describe()

Unnamed: 0,gdp
count,211.0
mean,17915.550291
std,26514.610111
min,271.752044
25%,2147.382923
50%,6941.235848
75%,23168.624671
max,185741.279992


In [19]:
df.to_csv("gdp_csv_processed.csv")

<a id="graphs"/>

#### Come back to the [Back to the top](#top)

#### Come back to [Home](FinalProjectReport.ipynb)