# Progress Report - Team Outliers

## Project Title - Gross Domestic Product and Economic Status of Countries

### Team Members - 
1. Advait Pai - apai21@uic.edu
2. Divyasha Pahuja - dpahuj2@uic.edu
3. Uday Nair - unair2@uic.edu
4. Utsav Sharma - usharm4@uic.edu

## Any Changes

// Development Status i.e. Under Developed etc, to now economic status based on the GNI

Reason - Economic Status ---> GNI
and GNI --> GDP therefore ES --> GDP

## Data

We are using data from the Organisation for Economic Co-operation and Development (OECD) and World Bank for different countries to carry out our predictions. The csv data is freely available on the website, which we have programmatically fetched these datasets from the website. This is done using a `driver.ipynb` notebook, which downloads the files and places them in a folder called `/data`.
These files are downloaded as CSVs from the website, and then are cleaned using the steps shown later in the document.

Our data is split across a number of datasets:


| Sr.no | Dataset | Description | Rows | Columns |
|-------|---------|-------------|------|---------|
| 1 | Share Prices | Share price indices are calculated from the prices of common shares of companies traded on national or foreign stock exchanges. | 35098 | 8 |
| 2 | Consumer Price Index | Inflation measured by consumer price index (CPI) is defined as the change in the prices of a basket of goods and services that are typically purchased by specific groups of households.| 293811 | 8 |
| 3 | Long Term Interest Rates | Long-term interest rates refer to government bonds maturing in ten years. Rates are mainly determined by the price charged by the lender, the risk from the borrower and the fall in the capital value. | 27365 | 8 |
| 4 | Labour Force | The labour force, or currently active population, comprises all persons who fulfil the requirements for inclusion among the employed (civilian employment plus the armed forces) or the unemployed. | 33517 | 8 |
| 5 | Exports and Imports | Defined as the transactions in goods and services between residents and non-residents. It is measured in million USD at 2015 constant prices. | 15922 | 8 |
| 6 | Household Spending | Household spending is the amount of final consumption expenditure made by resident households to meet their everyday needs, such as food, clothing, housing (rent), energy, transport, health costs, leisure.​ | 8862 | 8
| 7 | Government Spending | General government spending provides an indication of the size of government across countries. The large variation in this indicator highlights the variety of countries' approaches to delivering public goods and services and providing social protection, not necessarily differences in resources spent.​ | 10290 | 8 | 
| 8 | Tax Revenue | Tax revenue is defined as the revenues collected from taxes on income and profits, social security contributions, taxes levied on goods and services, payroll taxes, taxes on the ownership and transfer of property, and other tax. | 3549 | 8 |
| 9 | Investment GFCF | Gross fixed capital formation (GFCF), also called "investment", is defined as the acquisition of produced assets (including purchases of second-hand assets), including the production of such assets by producers for their own use, minus disposals. | 24513 | 8 |
| 10 | GDP | Gross domestic product (GDP) is the standard measure of the value added created through the production of goods and services in a country during a certain period. As such, it also measures the income earned from that production, or the total amount spent on final goods and services (less imports). | 5162 | 8 |
| 11 | Economic Category | This dataset contains the labels for the economic status of a country, which we will be using to perform classification. | 239 | 37 |


** Add sentence describing which dataset is used for what **

## Research Questions

1. Prediction of the Gross Domestic Product(GDP) of countries using leading and lagging factors such as inflation, household spending, trade, tax revenue, etc. from Organization for Economic Co-operation and Development(OECD) datasets by carrying out regression. 
2. Using the historical data of Gross Domestic Product(GDP) of countries, we attempt to forecast the GDP using the time series data.  
3. Classification of countries as per their economic development status (Emerging, Developing and Developed Economies) using economic and social indicators gathered from Exploratory Data Analysis (EDA) of OECD data.
4. After conducting exploratory data analysis (EDA) on the OECD dataset, we use the resulting indicators to perform cluster analysis to determine the optimal number of clusters for classifying countries based on their economic development status.

## Data Cleaning - WIP

### GDP Yearly:

To clean this dataset, we load the CSV file from the `data/uncleaned` directory, and check the head to ensure that it loads properly. The initial shape of the dataframe is `(5222,8)` 

Then, we check the unique values in the column `MEASURE`. This tells us that this column has the values `MLN_USD` and `USD_CAP`.
We also check for the columns `FREQUENCY` and `SUBJECT`, where the unique values are `A` and `TOT` respectively.

The column `Flag_Codes` is of no use to us, so we drop it. For `MEASURE`, we keep the values with `MLN_USD` only.

Before we drop any columns, we need to check the number of unique values per column. From this, the columns `INDICATOR`,`SUBJECT`,`FREQUENCY` and `MEASURE` can be dropped now since they have only 1 value in them and do not add any information.

Now, we can check the `null` values for all columns. The dataset that we have no `null` values, so we do not need to handle this.

After our cleaning, the shape of a dataframe is `(2675,3)`, which is then exported to a .csv file and placed in the `data/temp` folder.

In [1]:
import pandas as pd
base_path = '../data/temp/'

# 1. Share Prices Data
df_sp = pd.read_csv(base_path+'share_prices_cleaned.csv')
print("Share Prices Shape:",df_sp.shape)
# 2. Inflation CPI Data
df_cpi = pd.read_csv(base_path+'inflation_cpi_cleaned.csv')
print("Inflation CPI Spend Shape:",df_cpi.shape)
# 3. Long Term Interest Rates
df_lti = pd.read_csv(base_path+'long_term_ir_cleaned.csv')
print("Long Term Interest Rates Shape:",df_lti.shape)
# 4. Labour Force Date
df_lf = pd.read_csv(base_path+'labour_force_cleaned.csv')
print("Labour Force Shape:",df_lf.shape)
# 5. Trade in Government Spend Data
df_exp_imp = pd.read_csv(base_path+'trade_in_gs_cleaned.csv')
print("Trade in Government Spend Shape:",df_exp_imp.shape)
# 6. Household Spend
df_hspend = pd.read_csv(base_path+'household_spend_cleaned.csv')
print("Household Spend Shape:",df_hspend.shape)
# 7. Government Spending
df_gspend = pd.read_csv(base_path+'government_spend_cleaned.csv')
print("Gov Spend Shape:",df_gspend.shape)
# 8. Tax Revenue
df_tax = pd.read_csv(base_path+'tax_revenue_cleaned.csv')
print("Tax Revenue Shape:",df_tax.shape)
# 9. Investment GCFC
df_gfcf = pd.read_csv(base_path+'investment_gfcf_cleaned.csv')
print("Investment GCFC Shape:",df_gfcf.shape)
# 10. GDP Yearly Data
df_gdp = pd.read_csv(base_path+'gdp_yearly_cleaned.csv')
print("GDP Yearly Shape:",df_gdp.shape)
# 11. FDI Data
# df_fdi = pd.read_csv(base_path+'fdi_cleaned.csv')
# print("GDP Yearly Shape:",df_fdi.shape)
# 12. GNI Data
df_gni = pd.read_csv(base_path+'economic_category_cleaned.csv')
print("GNI Yearly Shape:",df_gni.shape)

Share Prices Shape: (2065, 3)
Inflation CPI Spend Shape: (2840, 3)
Long Term Interest Rates Shape: (1587, 3)
Labour Force Shape: (1993, 3)
Trade in Government Spend Shape: (2641, 3)
Household Spend Shape: (1955, 3)
Gov Spend Shape: (654, 3)
Tax Revenue Shape: (1697, 3)
Investment GCFC Shape: (2635, 3)
GDP Yearly Shape: (2675, 3)
GNI Yearly Shape: (7630, 3)


In [2]:
## Renaming Columns

df_sp = df_sp.rename(columns={'Value':'Share_Price'})
df_cpi = df_cpi.rename(columns={'Value':'Inflation_CPI'})
df_lti = df_lti.rename(columns={'Value':'LT_Interest'})
df_lf = df_lf.rename(columns={'Value':'Labor_Force'})
df_exp_imp = df_exp_imp.rename(columns={'NTRVAL':'Trade_Goverment'})
df_hspend = df_hspend.rename(columns={'Value':'H_Spend'})
df_gspend = df_gspend.rename(columns={'Value':'G_Spend'})
df_tax = df_tax.rename(columns={'Value':'Tax'})
df_gfcf = df_gfcf.rename(columns={'Value':'Investment'})
df_gdp = df_gdp.rename(columns={'Value':'GDP'})
df_gni = df_gni.rename(columns={'Value':'GNI'})

In [3]:
df_sp.head()

Unnamed: 0,LOCATION,TIME,Share_Price
0,AUS,1958,2.613002
1,AUS,1959,3.256618
2,AUS,1960,3.966841
3,AUS,1961,3.653984
4,AUS,1962,3.67826


In [4]:
df_cpi.head()

Unnamed: 0,LOCATION,TIME,Inflation_CPI
0,AUS,1949,3.738101
1,AUS,1950,4.063153
2,AUS,1951,4.852566
3,AUS,1952,5.688414
4,AUS,1953,5.943812


In [5]:
df_lti.head()

Unnamed: 0,LOCATION,TIME,LT_Interest
0,PRT,1994,10.47833
1,PRT,1995,11.465
2,PRT,1996,8.559167
3,PRT,1997,6.358333
4,PRT,1998,4.8775


In [6]:
df_lf.head()

Unnamed: 0,LOCATION,TIME,Labor_Force
0,MEX,2005,43631.5
1,MEX,2006,44982.52
2,MEX,2007,45904.54
3,MEX,2008,46769.21
4,MEX,2009,48018.36


In [7]:
df_exp_imp.head()

Unnamed: 0,LOCATION,TIME,Trade_Goverment
0,AUS,1959,1221.21
1,AUS,1960,467.848
2,AUS,1961,3915.866
3,AUS,1962,1787.544
4,AUS,1963,2725.988


In [8]:
df_hspend.head()

Unnamed: 0,LOCATION,TIME,H_Spend
0,AUS,1970,30476.510257
1,AUS,1971,32799.844017
2,AUS,1972,35590.183565
3,AUS,1973,41114.456624
4,AUS,1974,48461.851872


In [9]:
df_gspend.head()

Unnamed: 0,LOCATION,TIME,G_Spend
0,AUS,2007,13737.93
1,AUS,2008,14835.57
2,AUS,2009,15963.85
3,AUS,2010,15802.56
4,AUS,2011,16535.76


In [10]:
df_tax.head()

Unnamed: 0,LOCATION,TIME,Tax
0,AUS,1965,5.608
1,AUS,1966,5.996
2,AUS,1967,6.631
3,AUS,1968,7.405
4,AUS,1969,8.488


In [11]:
df_gfcf.head()

Unnamed: 0,LOCATION,TIME,Investment
0,AUS,1960,7594.023
1,AUS,1961,7555.709
2,AUS,1962,8263.204
3,AUS,1963,9144.787
4,AUS,1964,10182.874


In [12]:
df_gdp.head()

Unnamed: 0,LOCATION,TIME,GDP
0,AUS,1960,25071.833
1,AUS,1961,25363.455
2,AUS,1962,27953.904
3,AUS,1963,30431.547
4,AUS,1964,32742.466


In [13]:
df_gni = df_gni.dropna()
df_gni.head()

Unnamed: 0,LOCATION,TIME,GNI
0,AFG,1987,L
2,DZA,1987,UM
3,ASM,1987,H
6,ATG,1987,UM
7,ARG,1987,UM


## Exploratory Data Analysis

### GDP Dataset

In [14]:
dfs = [df_lti,df_lf,df_exp_imp,df_hspend,df_tax,df_gfcf,df_gdp]
df_merge_gdp = pd.merge(df_sp,df_cpi,how='outer',on=['LOCATION','TIME'])
for d in dfs:
    df_merge_gdp = pd.merge(df_merge_gdp,d,how='outer',on=['LOCATION','TIME'])
df_merge_gdp = df_merge_gdp.dropna()
# df_merge_gdp = df_merge_gdp.drop(columns=['LOCATION','TIME','H_Spend',"Investment","Trade_Goverment"])
df_merge_gdp = df_merge_gdp.drop(columns=['LOCATION','TIME']).reset_index(drop=True)
print("Shape of GDP Dataset:",df_merge_gdp.shape)
df_merge_gdp.head()


Shape of GDP Dataset: (1065, 9)


Unnamed: 0,Share_Price,Inflation_CPI,LT_Interest,Labor_Force,Trade_Goverment,H_Spend,Tax,Investment,GDP
0,7.011195,9.078245,6.646667,5478.225,4651.015,30476.510257,9.566,18051.855,58911.123
1,5.85566,9.635477,6.713333,5622.972,8326.401,32799.844017,11.308,19303.25,64192.384
2,7.038809,10.21593,5.831666,5750.258,8642.889,35590.183565,13.825,20311.818,69871.924
3,6.581208,11.14465,6.933333,5899.899,825.695,41114.456624,19.44,22476.101,78914.75
4,4.882499,12.86278,9.036667,6052.784,2790.903,48461.851872,24.135,22330.928,86394.431


### Income Group Dataset

In [15]:
dfs = [df_lti,df_lf,df_exp_imp,df_hspend,df_tax,df_gfcf,df_gni]
df_merge_gni = pd.merge(df_sp,df_cpi,how='outer',on=['LOCATION','TIME'])
for d in dfs:
    df_merge_gni = pd.merge(df_merge_gni,d,how='outer',on=['LOCATION','TIME'])
df_merge_gni = df_merge_gni.dropna()
# df_merge_gni = df_merge_gni.drop(columns=['LOCATION','TIME','H_Spend',"Investment","Trade_Goverment"])
df_merge_gni = df_merge_gni.drop(columns=['LOCATION','TIME']).reset_index(drop=True)
print("Shape of gni Dataset:",df_merge_gni.shape)
df_merge_gni.head()

Shape of gni Dataset: (943, 9)


Unnamed: 0,Share_Price,Inflation_CPI,LT_Interest,Labor_Force,Trade_Goverment,H_Spend,Tax,Investment,GNI
0,31.04002,43.11586,13.19167,7749.125,16128.982,137662.80679,68.479,72484.821,H
1,26.97814,46.22707,12.10417,7964.5,6266.44,149366.901166,81.716,81271.683,H
2,28.96189,49.70977,13.40833,8224.468,6061.14,162749.855492,88.85,82948.542,H
3,26.76051,53.355,13.18,8443.166,16159.452,169853.765646,90.808,74785.917,H
4,27.82293,55.04992,10.69083,8480.917,20802.917,177805.033539,85.14,71752.851,H


In [16]:
df_merge_gni['GNI'].value_counts()

H     881
UM     61
LM      1
Name: GNI, dtype: int64

## Reflection

### What is the hardest part of the project that you’ve encountered so far?

The hardest part was creating the final datasets that would go into our models.
Here are a few instances that we found:
1. For our x independent variables for Hyp 1, we had to clean and collate seperate datasets for each of these into a single dataset. 
Each dataset had different measures, subjects, and frequency for each variable which required individual inspection and cleaning before collation.
2. The predictors for GNI which will be used to classify the income group of the country, were unlcear and hard to find. Moreover, the historical dataset with labels assigned to income classes of all the countries for previous years was not easily available. Once we found it, we had to unpivot the dataset to make it suitable for our classification model.

### What are your initial insights?

1. Intially for hypothesis 3, we had planned to classify countries into their respective development status such as Developed, Under-developed and Emerging. After reserarching for data sources that would help us extract these labels per country per year, we realized that there is no solid dataset present for this and a better indicator of a countries economic status is it's income class which can be identified from the country's GNI. Hence we have changed the scope of Hyp3 to classifying countries into their respective income groups. This has a two fold benefit as we were able to build a dataset as well as find a direct correlation between GDP and GNI, where GNI is used to classify the countries into their income groups, hence strengthen our research.
2. The Year range is different for each country and also varies in each feature, which on combining of datasets of all the features will leads to loss of all those samples that don't match. 
3. The Government Spending dataset did not have enough data for us to include it in our model. For most of the countires, the data was available going back till only 2007, which is insufficient for the model. Including this in our final dataframe would have resulted in too much data loss, which is why we are dropping this dataset.

### Are there any concrete results you can show at this point? If not, why not?

#### Hypothesis 1: 

In [17]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score,mean_absolute_error

# Extra Trees Regressor Model
model = ExtraTreesRegressor(random_state=236)
X = df_merge_gdp.drop(columns=["GDP"])
Y = df_merge_gdp['GDP'].to_numpy()
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)
X_train,X_test,Y_train,Y_test = train_test_split(X_norm,Y,test_size=0.3,random_state=236)
model.fit(X_train,Y_train)
Y_pred = model.predict(X_test)

# Model Metrics
print("R^2_Score:",r2_score(Y_test,Y_pred))
print("Mean Absolute Error:",mean_absolute_error(Y_test,Y_pred))

#Feature Importances
sum = 0
print("\nFeature Importances:")
for i in range(len(model.feature_importances_)):
    print("\t",X.columns[i]+":",model.feature_importances_[i]*100)
    sum+= model.feature_importances_[i]

print("\nSum:",sum)

R^2_Score: 0.9981114038963094
Mean Absolute Error: 26001.22811578123

Feature Importances:
	 Share_Price: 0.5441364742942831
	 Inflation_CPI: 1.4068186636229303
	 LT_Interest: 1.2097938638674077
	 Labor_Force: 15.787803761899022
	 Trade_Goverment: 8.176110399621143
	 H_Spend: 30.590589213475937
	 Tax: 17.58065495411976
	 Investment: 24.704092669099524

Sum: 1.0


#### Hypothesis 3:

In [18]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score,confusion_matrix
model = ExtraTreesClassifier()
X = df_merge_gni.drop(columns=["GNI"])
Y = df_merge_gni['GNI'].to_numpy()
scaler = MinMaxScaler()
# X = X.iloc[:,:3]
X_norm = scaler.fit_transform(X)
X_train,X_test,Y_train,Y_test = train_test_split(X_norm,Y,test_size=0.3)
model.fit(X_train,Y_train)
Y_pred = model.predict(X_test)
print(accuracy_score(Y_test,Y_pred))
# print(confusion_matrix(Y_test,Y_pred))
print(confusion_matrix(Y_test,Y_pred,labels=df_merge_gni['GNI'].unique()))
print(df_merge_gni['GNI'].unique())


0.9717314487632509
[[259   1   0]
 [  6  16   0]
 [  0   1   0]]
['H' 'UM' 'LM']


# IGNORE FROM HERE

In [19]:
# df_merge['GNI'].value_counts()

In [20]:
# dfs = [df_lf,df_gfcf,df_gni]
# df_merge = pd.merge(df_hspend,df_tax,how='outer',on=['LOCATION','TIME'])
# for d in dfs:
#     df_merge = pd.merge(df_merge,d,how='outer',on=['LOCATION','TIME'])
# df_merge = df_merge.dropna()
# # df_merge = df_merge.drop(columns=['LOCATION','TIME','H_Spend',"Investment","Trade_Goverment"])
# df_merge = df_merge.drop(columns=['LOCATION','TIME'])
# df_merge

In [21]:
# df_temp = pd.merge(df_gdp,df_gni, how="inner",on=['LOCATION','TIME'])
# df_temp['GNI'].value_counts()

In [22]:
# from sklearn.ensemble import ExtraTreesClassifier
# from sklearn.metrics import accuracy_score
# model = ExtraTreesClassifier()
# X = df_temp['GDP'].to_numpy().reshape(-1,1)
# Y = df_temp['GNI']
# scaler = MinMaxScaler()
# # X = X.iloc[:,:3]
# X_norm = scaler.fit_transform(X)
# X_train,X_test,Y_train,Y_test = train_test_split(X_norm,Y,test_size=0.3)
# model.fit(X_train,Y_train)

In [23]:
# Y_pred = model.predict(X_test)
# print(accuracy_score(Y_test,Y_pred,))

In [24]:
# dfs = [df_lti,df_lf,df_exp_imp,df_hspend,df_tax,df_gfcf]


# IGNORE TILL HERRE

### Going forward, what are the current biggest problems you’re facing?

1. Elaborating on what was mentioned in the inital insight, after merging of the datasets of all the features of GDP (on Country and Year), data loss is seen. Moving ahead, this will pose a problem for our Regression model for GDP, as the model will have less data to train on. 
2. Furthermore, on merging the GNI factors(FDI inflow and outflow) with all the GDP indepdent variables, the number of samples returned are low and significant skew in the labels in the dataset is also seen.

### Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

Currently, the project is on schedule as it was projected. 

### Given your initial exploration of the data, is it worth proceeding with your project, and why? If not, how are you going to change your project and why do you think it’s better than your current results?

After data gathering, cleaning, EDA and model results we feel that the project is worth proceeding. 
But as highlighted in the previous reflections, we have redirected our project and research question a little according to the insights from the inital exploration of our Hypothesis. 
We changed course and are now predicting the income classes each country for Hypothesis 3 instead of Development status.

## Next Steps

### What do you plan to accomplish in the next month and how do you plan to evaluate whether your project achieved the goals you set for it?

## Reference

### List all the resources you used.