<a href="https://colab.research.google.com/github/ab17254/GV918-Week04/blob/main/Week_04_Class_Exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description

In this exercise, we will use economic growth data taken from AER package in R. https://www.rdocumentation.org/packages/AER/versions/1.2-9/topics/GrowthDJ

The purpose of this exercise is to explore the determinant of the **gdpgrowth**.


#### Variables

- **oil**: Is the country an oil-producing country?
- **inter**: Does the country have better quality data?
- **oecd**: Is the country a member of the OECD?
- **gdp60**: Per capita GDP in 1960.
- **gdp85**: Per capita GDP in 1985.
- **gdpgrowth**: Average growth rate of per capita GDP from 1960 to 1985 (in percent).
- **popgrowth**: Average growth rate of working-age population 1960 to 1985 (in percent).
- **invest**: Average ratio of investment (including Government Investment) to GDP from 1960 to 1985 (in percent).
- **school**: Average fraction of working-age population enrolled in secondary school from 1960 to 1985 (in percent).


In [2]:
url = 'https://vincentarelbundock.github.io/Rdatasets/csv/AER/GrowthDJ.csv'

In [4]:
import pandas as pd
import numpy as np

# Read the data

In [12]:
data_df = pd.read_csv(url)

# Data wrangling

#### Check the data dimensionality using `.shape`

In [7]:
data_df.shape

(121, 11)

#### How many rows with missing values?

- check the funcitonality of `.isna()` and `.dropna()`

- `.isna()` - Detect missing values.
- `.dropna()` - Omit axes labels with missing values.

In [13]:
data_df.isna()

Unnamed: 0.1,Unnamed: 0,oil,inter,oecd,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
116,False,False,False,False,False,False,False,False,False,False,False
117,False,False,False,False,False,True,False,True,False,False,True
118,False,False,False,False,False,False,False,False,False,False,False
119,False,False,False,False,False,False,False,False,False,False,False


#### We in the end drop the rows with missing values...

In [14]:
data_df.dropna()

Unnamed: 0.1,Unnamed: 0,oil,inter,oecd,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
0,1,no,yes,no,2485.0,4371.0,4.8,2.6,24.1,4.5,10.0
1,2,no,no,no,1588.0,1171.0,0.8,2.1,5.8,1.8,5.0
2,3,no,no,no,1116.0,1071.0,2.2,2.4,10.8,1.8,5.0
4,5,no,no,no,529.0,857.0,2.9,0.9,12.7,0.4,2.0
5,6,no,no,no,755.0,663.0,1.2,1.7,5.1,0.4,14.0
...,...,...,...,...,...,...,...,...,...,...,...
115,116,no,yes,no,10367.0,6336.0,1.9,3.8,11.4,7.0,63.0
116,117,no,yes,yes,8440.0,13409.0,3.8,2.0,31.5,9.8,100.0
118,119,no,yes,no,879.0,2159.0,5.5,1.9,13.9,4.1,39.0
119,120,no,yes,yes,9523.0,12308.0,2.7,1.7,22.5,11.9,99.0


# Data Subsetting

Try craeting following datasets

- OECD countries
- Countries with a literacy rate better than average 
 

In [24]:
oecd = data_df[data_df.oecd == 'yes']
mean_literacy = data_df['literacy60'].mean()
literate_countries = data_df[data_df.literacy60 > mean_literacy]

# Data Exploration

#### Calculate the mean and standard deviation of the `gdpgrowth`

In [29]:
gdpgrowth_mean = data_df['gdpgrowth'].mean()


In [28]:
gdpgrowth_std = data_df['gdpgrowth'].std()

#### Run `.describe()` to see the data description

In [30]:
data_df.describe()

Unnamed: 0.1,Unnamed: 0,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
count,121.0,116.0,108.0,117.0,107.0,121.0,118.0,103.0
mean,61.0,3681.818966,5683.259259,4.094017,2.279439,18.157025,5.526271,48.165049
std,35.073732,7492.877637,5688.670819,1.891464,0.998748,7.85331,3.532037,35.354257
min,1.0,383.0,412.0,-0.9,0.3,4.1,0.4,1.0
25%,31.0,973.25,1209.25,2.8,1.7,12.0,2.4,15.0
50%,61.0,1962.0,3484.5,3.9,2.4,17.7,4.95,39.0
75%,91.0,4274.5,7718.75,5.3,2.9,24.1,8.175,83.5
max,121.0,77881.0,25635.0,9.2,6.8,36.9,12.1,100.0


#### Calculate the group averages

- For each categorical variables (`oil`, `inter`, `oecd`), calcurate the mean of `gdpgrowth`

In [49]:
oil_gdpgrowth = data_df.groupby(['oil', 'gdpgrowth']).mean()
inter_gdpgrowth = data_df.groupby(['inter', 'gdpgrowth']).mean()
oecd_gdpgrowth = data_df.groupby(['oecd', 'gdpgrowth']).mean()

#### Calculate the correlation between `gdpgrowth` and possible explanatory variables

In [55]:
data_df.corr()

Unnamed: 0.1,Unnamed: 0,gdp60,gdp85,gdpgrowth,popgrowth,invest,school,literacy60
Unnamed: 0,1.0,0.191958,0.415553,0.024615,-0.125694,0.328623,0.564927,0.69159
gdp60,0.191958,1.0,0.630505,-0.122174,0.291304,0.091031,0.337358,0.257476
gdp85,0.415553,0.630505,1.0,0.139063,-0.222036,0.580661,0.697297,0.72911
gdpgrowth,0.024615,-0.122174,0.139063,1.0,0.242443,0.35051,0.197995,0.160969
popgrowth,-0.125694,0.291304,-0.222036,0.242443,1.0,-0.33193,-0.212766,-0.414744
invest,0.328623,0.091031,0.580661,0.35051,-0.33193,1.0,0.622444,0.639264
school,0.564927,0.337358,0.697297,0.197995,-0.212766,0.622444,1.0,0.818405
literacy60,0.69159,0.257476,0.72911,0.160969,-0.414744,0.639264,0.818405,1.0
