## GDP Predictor
We are going to predict the GDP of any given country for the periods 2021 to 2030, using 5 dataset from "The World Bank" dataset. Users get to select the country they would like to predict from the web application and the GDP for that selected country (from 2021 to 203
0) is displayed.

We are going to follow the following steps:
1. Import the required libraries
2. Create necessary user defined Functions
3. Load the data
4. Exploratory Data Analysis  
        - Data Preprocessing                                                                                                                      
        - Data Visualization                     
5. Prepare feature matrix X and target vector y
6. Create a training and validation set
7. Compare model
8. Select the best model
9. Re-train the best model on the training set
10. Evaluate the model on the validation set
11. Predict using the testing data


#### Dataset

We are using 5 different datasets from [The World Bank](https://datacatalog.worldbank.org/home) data that has the following indicators listed below. 

Number of Countries: 266

Interested Features from the datasets for each country:
- Year (From 1980 to 2020)      
- The following indicators:
        - Literacy rate, adult total (% of people ages 15 and above)    
        - Population, total     
        - Mortality rate, infant (per 1,000 live births)        
        - Export Value Index (2000 = 100) (current value of exports (f.o.b.) converted to U.S. dollars and expressed as a percentage of the average for the base period (2000)) 

Target Variable Indicator
- GDP (current US$)

#### Libraries

In [172]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
import warnings

#### User Defined Functions

In [173]:
def get_country_data(country_name):
    '''Returns the data for the specific country name provided from the data store for all countries.
    
        For the user specified country name, return:
        - country data,
        consisting of the following features:
        - Country Name, Year, GDP, Literacy Rate, Mortality Rate, Population and Export Value Index
                
    '''
    # Get the list of country names
    countries = GDP_data['Country Name'].unique().tolist()

    # Create a dataframe to store the data for all countries
    df = pd.DataFrame(columns=['Country Name', 'Year', 'GDP', 'Literacy Rate', 'Mortality Rate', 'Population', 'Export Value Index'])
    for country in countries:
        # Extract the data for the country from each data set
        GDP = GDP_data.loc[GDP_data['Country Name'] == country].iloc[0, 1:].tolist()
        literacy_rate = literacy_rate_data.loc[literacy_rate_data['Country Name'] == country].iloc[0, 1:].tolist()
        mortality_rate = mortality_rate_data.loc[mortality_rate_data['Country Name'] == country].iloc[0, 1:].tolist()
        population = population_data.loc[population_data['Country Name'] == country].iloc[0, 1:].tolist()
        EVI = EVI_data.loc[EVI_data['Country Name'] == country].iloc[0, 1:].tolist()
        
        # Combine the data into a list of tuples
        data = list(zip([country]*len(GDP), GDP_data.columns[1:], GDP, literacy_rate, mortality_rate, population, EVI))
        
        # Append the data to the dataframe
        # df = df.append(pd.DataFrame(data, columns=df.columns))
        df = pd.concat([df, pd.DataFrame(data, columns=df.columns)])
    return df


def handle_missing_value_train_data():
    '''Fill the missing values in the training set with the average value and drop missing value of a particular if all values of the column are null
        return:
        - the training data
                
    '''
    # Fill the missing values in the training set
    train_data['Literacy Rate'].fillna((train_data['Literacy Rate'].mean()), inplace=True)
    train_data['Export Value Index'].fillna((train_data['Export Value Index'].mean()), inplace=True)
    train_data['GDP in Ten Years'].fillna((train_data['GDP in Ten Years'].mean()), inplace=True)
    train_data['Mortality Rate'].fillna((train_data['Mortality Rate'].mean()), inplace=True)
    train_data['Population'].fillna((train_data['Population'].mean()), inplace=True)
    train_data.dropna(axis=1, inplace=True) 
    return train_data


def handle_missing_value_test_data():
    '''Fill the missing values in the testing set with the average value and drop missing value of a particular if all values of the column are null
        return:
        - the testing data
                
    '''
    # Fill the missing values in the training set
    test_data['Literacy Rate'].fillna((test_data['Literacy Rate'].mean()), inplace=True)
    test_data['Export Value Index'].fillna((test_data['Export Value Index'].mean()), inplace=True)
    test_data['Mortality Rate'].fillna((test_data['Mortality Rate'].mean()), inplace=True)
    test_data['Population'].fillna((test_data['Population'].mean()), inplace=True)
    test_data.dropna(axis=1, inplace=True)
    return test_data


def get_user_input():
    '''Get user input from list of countries
        return:
        - the country name
                
    '''
    countries = GDP_data['Country Name'].unique().tolist()
    while True:
        try: 
            country_Name = input("Please enter a country name from the country list displayed as written: ")
            if (country_Name in countries):
                return country_Name
            else: 
                raise ValueError("You must enter a valid country name from the country list provided.")         
        except ValueError:
            return ("You must enter a valid country name from the country list provided.")
    return country_Name
    



#### Load the Data

In [174]:
# Read the data
GDP_data = pd.read_excel('dataset\API_NY.GDP.MKTP.CD_DS2_en_excel_v2_4770502.xls', usecols=np.r_[0, 24:65], skiprows=3)
literacy_rate_data = pd.read_excel('dataset\API_SE.ADT.LITR.ZS_DS2_en_excel_v2_4773710.xls', usecols=np.r_[0, 24:65], skiprows=3)
mortality_rate_data = pd.read_excel('dataset\API_SP.DYN.IMRT.IN_DS2_en_excel_v2_4770604.xls', usecols=np.r_[0, 24:65], skiprows=3)
population_data = pd.read_excel('dataset\API_SP.POP.TOTL_DS2_en_excel_v2_4770385.xls', usecols=np.r_[0, 24:65], skiprows=3)
EVI_data = pd.read_excel('dataset\API_TX.VAL.MRCH.XD.WD_DS2_en_excel_v2_4774581.xls', usecols=np.r_[0, 24:65], skiprows=3)

### Exploratory Data Analysis

#### Data Preprocessing
Choose a country and explore the data based on the choice country.

In [175]:
countries = GDP_data['Country Name'].unique().tolist()
print(countries)
country_name = get_user_input()

['Aruba', 'Africa Eastern and Southern', 'Afghanistan', 'Africa Western and Central', 'Angola', 'Albania', 'Andorra', 'Arab World', 'United Arab Emirates', 'Argentina', 'Armenia', 'American Samoa', 'Antigua and Barbuda', 'Australia', 'Austria', 'Azerbaijan', 'Burundi', 'Belgium', 'Benin', 'Burkina Faso', 'Bangladesh', 'Bulgaria', 'Bahrain', 'Bahamas, The', 'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda', 'Bolivia', 'Brazil', 'Barbados', 'Brunei Darussalam', 'Bhutan', 'Botswana', 'Central African Republic', 'Canada', 'Central Europe and the Baltics', 'Switzerland', 'Channel Islands', 'Chile', 'China', "Cote d'Ivoire", 'Cameroon', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Colombia', 'Comoros', 'Cabo Verde', 'Costa Rica', 'Caribbean small states', 'Cuba', 'Curacao', 'Cayman Islands', 'Cyprus', 'Czechia', 'Germany', 'Djibouti', 'Dominica', 'Denmark', 'Dominican Republic', 'Algeria', 'East Asia & Pacific (excluding high income)', 'Early-demographic dividend', 'East Asia & Pacific', 'Euro

In [176]:
# # #Provided country
# country_name = 'Aruba'

df = get_country_data(country_name)
# print(df)
# Display the data for a specific country
model_data = df.loc[df['Country Name'] == country_name]
model_data.head()

Unnamed: 0,Country Name,Year,GDP,Literacy Rate,Mortality Rate,Population,Export Value Index
0,Canada,1980,273853800000.0,,10.3,24515667.0,
1,Canada,1981,306214900000.0,,9.7,24819915.0,
2,Canada,1982,313506500000.0,,9.2,25116942.0,
3,Canada,1983,340547700000.0,,8.7,25366451.0,
4,Canada,1984,355372600000.0,,8.3,25607053.0,


##### Prepare the training and testing data

In [177]:
# Add the GDP in Ten Years (1990 to 2020) for each year from 1980 to 2010
with warnings.catch_warnings():
    warnings.filterwarnings('ignore', category=pd.core.common.SettingWithCopyWarning)
    model_data['Year'] = model_data['Year'].astype(int)
# model_data['Year'] = model_data['Year'].astype(int)
model_GDP = model_data[['Year', 'GDP']]
row_GDP = (model_GDP['Year'] >= 1990) & (model_GDP['Year'] <= 2020)
model_data_GDP = model_GDP.loc[row_GDP, ['GDP']].reset_index(drop=True)
model_data_GDP = model_data_GDP.rename(columns={'GDP': 'GDP in Ten Years'})

country_data = model_data.drop(columns=['GDP'])
row_index = (country_data['Year'] >= 1980) & (country_data['Year'] <= 2010)
columns = ['Year', 'Literacy Rate', 'Population', 'Export Value Index', 'Mortality Rate']
model_data_range = country_data.loc[row_index, columns]

# Save data as training data for the specific country
train_data = pd.concat([model_data_range, model_data_GDP], axis=1)
print(country_name, "Training Data")
train_data.head()

Canada Training Data


Unnamed: 0,Year,Literacy Rate,Population,Export Value Index,Mortality Rate,GDP in Ten Years
0,1980,,24515667.0,,10.3,593929600000.0
1,1981,,24819915.0,,9.7,610328200000.0
2,1982,,25116942.0,,9.2,592387700000.0
3,1983,,25366451.0,,8.7,577170800000.0
4,1984,,25607053.0,,8.3,578139300000.0


In [178]:
# Show the data information for training data
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31 entries, 0 to 30
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Year                31 non-null     int32  
 1   Literacy Rate       0 non-null      float64
 2   Population          31 non-null     float64
 3   Export Value Index  16 non-null     float64
 4   Mortality Rate      31 non-null     float64
 5   GDP in Ten Years    31 non-null     float64
dtypes: float64(5), int32(1)
memory usage: 1.6 KB


In [179]:
# Get the sum of the missing values in the training set
train_data.isna().sum()

Year                   0
Literacy Rate         31
Population             0
Export Value Index    15
Mortality Rate         0
GDP in Ten Years       0
dtype: int64

In [180]:
# Fill the missing values in the training set
train_data = handle_missing_value_train_data()
train_data.head()

Unnamed: 0,Year,Population,Export Value Index,Mortality Rate,GDP in Ten Years
0,1980,24515667.0,107.747811,10.3,593929600000.0
1,1981,24819915.0,107.747811,9.7,610328200000.0
2,1982,25116942.0,107.747811,9.2,592387700000.0
3,1983,25366451.0,107.747811,8.7,577170800000.0
4,1984,25607053.0,107.747811,8.3,578139300000.0


In [181]:
# Prepare the new testing data for the periods from 2011 to 2020
row_index = (model_data['Year'] >= 2011) & (model_data['Year'] <= 2020)
columns = ['Year', 'Literacy Rate', 'Population', 'Export Value Index', 'Mortality Rate']
test_data = model_data.loc[row_index, columns].reset_index(drop=True)
print(country_name, "Testing Data")
test_data

Canada Testing Data


Unnamed: 0,Year,Literacy Rate,Population,Export Value Index,Mortality Rate
0,2011,,34339328.0,163.162287,4.9
1,2012,,34714222.0,164.701312,4.9
2,2013,,35082954.0,165.686768,4.8
3,2014,,35437435.0,172.187604,4.7
4,2015,,35702908.0,148.241911,4.7
5,2016,,36109487.0,140.985986,4.6
6,2017,,36545236.0,152.075042,4.6
7,2018,,37065084.0,162.948281,4.5
8,2019,,37601230.0,161.445246,4.4
9,2020,,38037204.0,141.205834,4.4


In [182]:
# Fill the missing values in the testing set
test_data = handle_missing_value_test_data()
test_data.head()

Unnamed: 0,Year,Population,Export Value Index,Mortality Rate
0,2011,34339328.0,163.162287,4.9
1,2012,34714222.0,164.701312,4.9
2,2013,35082954.0,165.686768,4.8
3,2014,35437435.0,172.187604,4.7
4,2015,35702908.0,148.241911,4.7


#### Data Visualization

In [183]:
plotTarget = sns.displot(data=train_data, x=train_data.Year, y=train_data.Population)
plotTarget.set(title='Population Distribution for Selected Country in Different Years')
plotTarget.set(xlabel='Year', ylabel='Population')

<seaborn.axisgrid.FacetGrid at 0x1ccc6ee8af0>

In [184]:
plotTarget1 = sns.displot(data=train_data, x=train_data.Year)
plotTarget1.set(title='Data Distribution for Different Years')
plotTarget1.set(xlabel='Year')

<seaborn.axisgrid.FacetGrid at 0x1ccc65028b0>

### Machine Learning Model

#### Feature Matrix, X and Target Vector, y

In [185]:
# Drop the target feature from the train data
X = train_data.drop('GDP in Ten Years', axis=1)
y = train_data['GDP in Ten Years']

# Shape and dimension
print("Dimension of X  = {}\nType of X  = {}\n\nDimension of y  = {}\nType of y  = {}".format(X.shape, type(X), y.shape, type(y)))

Dimension of X  = (31, 4)
Type of X  = <class 'pandas.core.frame.DataFrame'>

Dimension of y  = (31,)
Type of y  = <class 'pandas.core.series.Series'>


##### Correlation heatmap of features 

To understand if pairs of features are potentially related, contain similar information, pair-wise cross-correlation can be calculated. Models benefit most from un-correlated features.

In [186]:
dataMap = sns.heatmap(X.corr(), vmin= -1, vmax= 1, annot= True, cmap='BrBG')

In [187]:
#add boxplot of features
plot = sns.boxplot(data= X)

In [188]:
# Create a training and a validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=298)
X_train.shape, X_test.shape

((24, 4), (7, 4))

#### Compare model

In [189]:
# Train the machine learning model
init_models = { 'Linear Regression': LinearRegression(),
                'Random forest': RandomForestRegressor(random_state=64),
                'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=79),
               }
R2 = []
models_names = []
for i, (key,model) in enumerate(init_models.items()):
    model.fit(X_train, y_train)
    models_names.append(key)
    R2.append(np.mean(cross_val_score(model, X_train, y_train, cv=5)))
models_scores = pd.DataFrame({'Model Name': models_names, 'R2 Score': R2})
models_scores.head()

Unnamed: 0,Model Name,R2 Score
0,Linear Regression,0.845649
1,Random forest,0.896805
2,Gradient Boosting Regressor,0.898047


In [190]:
# Choose the best model with the highest R2 score
models_scores.sort_values('R2 Score', ascending=False, inplace=True)
best_model = models_scores.iloc[0]

print('Best Model:')
print(best_model)

Best Model:
Model Name    Gradient Boosting Regressor
R2 Score                         0.898047
Name: 2, dtype: object


In [191]:
best_model_name = best_model['Model Name']
best_model = init_models[best_model_name]
best_model.fit(X_train, y_train)

In [192]:
# R2_train = []
# R2_test = []
# y_pred_train = best_model.predict(X_train)
# y_pred_test = best_model.predict(X_test)
# R2_train.append(r2_score(y_train, y_pred_train))
# R2_test.append(r2_score(y_test, y_pred_test))

# # Create a dataframe with the R2 scores for each model
# models_scores = pd.DataFrame({'Model Name': best_model,'R2 Train': R2_train, 'R2 Test': R2_test})
# models_scores

# y_pred_test

**Predict using the test data**

In [193]:
prediction = best_model.predict(test_data)
prediction

array([1.72102102e+12, 1.72102102e+12, 1.72102102e+12, 1.72102102e+12,
       1.61274884e+12, 1.61274884e+12, 1.61274884e+12, 1.72102102e+12,
       1.72102102e+12, 1.61274884e+12])

In [194]:
# Create a conclusion dataframe and append 'Year' columns
conclusion = pd.DataFrame()
conclusion['Year'] = test_data['Year'] + 10
conclusion['GDP'] = prediction
conclusion

Unnamed: 0,Year,GDP
0,2021,1721021000000.0
1,2022,1721021000000.0
2,2023,1721021000000.0
3,2024,1721021000000.0
4,2025,1612749000000.0
5,2026,1612749000000.0
6,2027,1612749000000.0
7,2028,1721021000000.0
8,2029,1721021000000.0
9,2030,1612749000000.0


In [195]:
# Save model prediction as a csv file
conclusion.to_csv(country_name+'_prediction.csv', index=False)