## Importing necessary libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

## Initial Check on Dataset

In [None]:
df = pd.read_csv("world (1).csv")
df.head()

In [None]:
df.info()

Dataset has 20 Columns with 227 Entries

In [None]:
df.describe()

Only three columns are having proper numeric values. We can see in the previous table that most of the columns are having object as the datatype. This has to be changed.

In [None]:
## Changing the Datatype

for col in ['Country', 'Region']:
    df[col] = df[col].astype('category')
    
for col in ['Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)','Net migration','Infant mortality (per 1000 births)','Literacy (%)','Phones (per 1000)','Arable (%)','Crops (%)','Other (%)','Climate','Birthrate','Deathrate','Agriculture','Industry','Service']:
    df[col] = df[col].astype('str')
    df[col] = df[col].str.replace(",",".").astype(float)   


Country and Region columns are converted to **Category** Datatype while rest of numeric data is converted to **float**. The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more efficiently store the data.

In [None]:
df.info()

In [None]:
df.describe()

## Understanding more about the Dataset

A few of the columns: Climate, Agriculture, Industry, and Service have not been explained exactly what they include as values. We need to understand it better.

In [None]:
df.loc[:, ['Country', 'Region', 'Climate', 'Agriculture', 'Industry', 'Service']].head()

It looks like Agriculture , Industry and Service Columns represent the percent of Economy or GDP of a country that is being contributed by the respective economic activity. To understand Climate column, we can look at the distinct values and see which rows are coming together under the same value.

In [None]:
df['Climate'].unique()

In [None]:
h = {}
for cat in [1, 2, 3, 4, 1.5, 2.5]:
    h[cat] = df.loc[:, ['Country', 'Region', 'Climate']][df['Climate'] == cat].head()

pd.concat([h[1], h[2], h[3], h[4], h[1.5], h[2.5]])


A guess for what the categories are pointing to is:

**1**   - Countries that are desert kind/hot. \
**1.5** - Countries that are both hot and tropical. \
**2**   - Countries with a tropical climate.\
**2.5** - Countries that are both cold and tropical.\
**3**   - Countries with cold Climate.\
**4**   - These countries also seem to have cold climate. Not mentioned why it is separated from Category 3. 

## Data Cleaning

In [None]:
## Finding the Null Value in each Column Percentage

num_missing = df.isnull().sum()
missing_value_df = pd.DataFrame({'Column_name': df.columns,'num_missing': num_missing})
missing_value_df

There is a very little percentage of data in each column that is missing. We can view it in a heatmap to get a different visual analysis of it. 

In [None]:
sns.set(rc={'figure.figsize':(11,8)})
sns.heatmap(df.isnull()).set(title = 'Missing Data', xlabel = 'Columns', ylabel = 'Data Points')

It is seen that there are significantly low values of **NULL** in some of the columns : **{"Net Migration", "Infant Mortality", "GDP", "Literacy", "Phones", "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture", "Industry", "Service"}**. The Rows with these values can be dealt with later for now.

In [None]:
## Checking Rows in which null values are present for each column

df1 = df[df['Net migration'].isna()]
df1

## Changes suggested for these Rows with NaN values

| Feature   |      Number of missing Values    |  Change |
|:----------|:-------------:|------:|
| Net migration | 3 |  Belong to very small nations. Change to 0.|
| Infant mortality (per 1000 births) |  3   |Belong to very small nations. Change to 0.  |
| GDP ($ per capita) | 1| From Google search, it is \$2500. Change to same.|
|Literacy (\%)|18| Replace by the mean literacy of each missing value's region|
|Phones (per 1000)|4|Replace by the mean phones of each missing value's region|
|Arable (\%)|2|Very small islands.Change to 0.|
|Crops (\%)|2|Very small islands.Change to 0.|
|Other (\%)|2|Very small islands.Change to 0.|
|Climate|22|Change to 0. It represents "unknown" category.|
|Birthrate|3|Replace with their region's mean rates|
|Deathrate|4|Replace with their region's mean rates|
|Agriculture|15|Calculated guess seeing how similar countries have. Change to 0.15.|
|Industry|16|Calculated guess seeing how similar countries have. Change to 0.05.|
|Service|15|Calculated guess seeing how similar countries have. Change to 0.8.|

In [None]:
change1 = [("Net migration", 0), ("Infant mortality (per 1000 births)", 0), ("GDP ($ per capita)", 2500), ("Arable (%)", 0), ("Crops (%)", 0),("Other (%)",0),("Climate",0),("Agriculture",0.15), ("Industry", 0.05), ("Service", 0.8) ]
for col in change1:
    df[col[0]].fillna(col[1], inplace = True)
    
change2 = ["Literacy (%)", "Phones (per 1000)", "Birthrate", "Deathrate"]
for col in change2:
    df[col].fillna(df.groupby('Region')[col].transform('mean'), inplace= True)

In [None]:
print(df.isnull().sum())

# EDA
## Correlation Heatmap

In [None]:
fig, ax = plt.subplots(figsize=(16,16)) 

# Only take numeric columns to avoid string-to-float conversion errors
numeric_df = df.select_dtypes(include='number')

sns.heatmap(numeric_df.corr(), annot=True, ax=ax, cmap='Spectral').set(
    title='Feature Correlation', xlabel='Columns', ylabel='Columns'
)
plt.show()


## Insights
**Expected Strong Correlation between :** 
1. Infant mortality and Birthrate 
2. Gdp per capita and Phones

**Expected Strong Anticorrelation between:**
1. Infant mortality and Literacy
2. Arable and Other 
3. Birthrate and Literacy

**Unexpected Strong Correlation between:**
1. Infant mortality and Agriculture

**Unexpected Strong Anticorrelation between:**
1. Birthrate and Phones

In [None]:
f = sns.pairplot(df[['Population', 'Area (sq. mi.)', 'Net migration', 'GDP ($ per capita)', 'Climate']], hue = "Climate")
f.fig.suptitle('Feature Relations')
plt.show()

There is a fair correlation between GDP and migration, which makes sense, since migrants tend to move to countries with better opportunities and higher GDP per capita.

## Regional Analysis

Checking the number of Countries in each region, the GDP per capita, population and migration to get some insights.

In [None]:
fig = plt.figure(figsize=(15, 20))
plt.title('Regional Analysis')
ax1 = fig.add_subplot(4, 1, 1)
ax2 = fig.add_subplot(4, 1, 2)
ax3 = fig.add_subplot(4, 1, 3)
ax4 = fig.add_subplot(4, 1, 4)
sns.countplot(data= df, y= 'Region', ax= ax1, palette="flare")
sns.barplot(data= df, y= 'Region', x= 'GDP ($ per capita)', ax= ax2, palette="flare", ci= None)
sns.barplot(data= df, y= 'Region', x= 'Net migration', ax= ax3, palette="flare", ci= None)
sns.barplot(data= df, y= 'Region', x= 'Population', ax= ax4, palette="flare", ci= None)
plt.show()

## Insights
1. Sub-Saharan Africa and Latin America & Caribbean regions have the most countries.
2. Western Europe and North America have the highest GDP per capita, while Sub-Saharan Africa has the lowest GDP per capita.
3. Asia, North America, and North Europe, are the main regions where migrants from other regions go to.
4. Asia has the largest population, Oceania has the smallest.

# GDP Analysis

The relation between GDP and Infant Mortality rate, Literacy, Arable Land is studied

In [None]:
sns.jointplot(data= df, x= 'Literacy (%)', y= 'GDP ($ per capita)', kind= "hist",color='coral')
sns.jointplot(data= df, x= 'Arable (%)', y= 'GDP ($ per capita)', kind= "hist", color='coral')
sns.jointplot(data= df, x= 'Infant mortality (per 1000 births)', y= 'GDP ($ per capita)', kind= "hist",color='coral')
plt.show()

## Analysis 

1. Higher the country's GDP, the more literate the population is, and vice versa.
2. No clear relationship between GDP and \% of Arable land. It shows that Agriculture is not the strongest factor economically.
3. Poor countries suffer more from Infant mortality.

## Data Pre conditioning

1. Transform 'Region' column into numerical values.
2. Split data set into Training and Testing parts (80/20).
3. Trying to analyse (with/without Feature Selection, with/without Feature Scaling.

In [None]:
# Importing libraries

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

In [None]:
# Transporming Region

df_final = pd.concat([df,pd.get_dummies(df['Region'], prefix='Region')], axis=1).drop(['Region'],axis=1)
print(df_final.info())

Now it has 227 entries and 30 Columns.

In [None]:
# Without scaling , the full dataset
y = df_final["GDP ($ per capita)"]
X = df_final.drop(["GDP ($ per capita)",'Country'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

# With Scaling
sc_X = StandardScaler()
X2_train = sc_X.fit_transform(X_train)
X2_test = sc_X.fit_transform(X_test)
y2_train = y_train
y2_test = y_test

# Without scaling, Feature selected Dataset (corr > +/-0.3)
y3 = y
X3 = df_final.drop(['GDP ($ per capita)','Country','Population', 'Area (sq. mi.)', 'Coastline (coast/area ratio)', 'Arable (%)',
                      'Crops (%)', 'Other (%)', 'Climate', 'Deathrate', 'Industry'], axis=1)
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=101)


# With scaling
sc_X4 = StandardScaler()
X4_train = sc_X4.fit_transform(X3_train)
X4_test = sc_X4.fit_transform(X3_test)
y4_train = y3_train
y4_test = y3_test

# Linear Regression 

Basic Regression Technique is seen first to see if any linear relationship exists. Model Training is done for all 4 datasets, predictions are done and it is Evaluated to see if any improvement is seen with Feature Selection or Feature Scaling. 


In [None]:
# Model Training
lm1 = LinearRegression()
lm1.fit(X_train,y_train)

lm2 = LinearRegression()
lm2.fit(X2_train,y2_train)

lm3 = LinearRegression()
lm3.fit(X3_train,y3_train)

lm4 = LinearRegression()
lm4.fit(X4_train,y4_train)

# Predictions
lm1_pred = lm1.predict(X_test)
lm2_pred = lm2.predict(X2_test)
lm3_pred = lm3.predict(X3_test)
lm4_pred = lm4.predict(X4_test)

# Evaluation Function 
def eval(cond, y, pred):
    print(cond)
    print("____________________________\n")
    print('MAE:', metrics.mean_absolute_error(y, pred))
    print('RMSE:', np.sqrt(metrics.mean_squared_error(y, pred)))
    print('R2_Score: ', metrics.r2_score(y, pred))
    print("*****************************\n\n")
    
eval("All features, No scaling:",y_test,lm1_pred)
eval("\nAll features, with scaling:",y2_test,lm2_pred)
eval("\nSelected features, No scaling:",y3_test,lm3_pred)
eval("\nSelected features, with scaling:",y4_test,lm4_pred)  

## Analysis
1. **Feature Selection** helps in reducing the errors. It is needed for this model.
2. **Feature Scaling** did not have that significant effect on the prediction performance. 
3. Decent predictions obtained with both **Selection** and **Scaling**.

# SVM

In [None]:
# Model Training
svm1 = SVR(kernel='rbf')
svm1.fit(X_train,y_train)

svm2 = SVR(kernel='rbf')
svm2.fit(X2_train,y2_train)

svm3 = SVR(kernel='rbf')
svm3.fit(X3_train,y3_train)

svm4 = SVR(kernel='rbf')
svm4.fit(X4_train,y4_train)

# Predictions
svm1_pred = svm1.predict(X_test)
svm2_pred = svm2.predict(X2_test)
svm3_pred = svm3.predict(X3_test)
svm4_pred = svm4.predict(X4_test)

# Evaluation
eval("All features, No scaling:",y_test,svm1_pred)
eval("\nAll features, with scaling:",y2_test,svm2_pred)
eval("\nSelected features, No scaling:",y3_test,svm3_pred)
eval("\nSelected features, with scaling:",y4_test,svm4_pred)  

## Analysis

1. **Feature Scaling** and **Feature Selection**, made almost no difference in the prediction performance of the SVM algorithm.

2. The results of **SVM is worse than LR**.

## Optimising SVM
Using **Grid Search**

In [None]:
param_grid = {'C': [1, 10, 100], 'gamma': [0.01,0.001,0.0001], 'kernel': ['rbf']} 
grid = GridSearchCV(SVR(),param_grid,refit=True,verbose=3)
grid.fit(X4_train,y4_train)

In [None]:
print("Best Parameters are : {}".format(grid.best_params_))
print("Best Estimators are : {}".format(grid.best_estimator_))
grid_predictions = grid.predict(X4_test)
eval("\nSelected features, with scaling:",y4_test,grid_predictions)

It has **improved but performance is still lower** than LR.

## Random Forest

Scaling doesn't work in this model so it is not analysed.

In [None]:
# Model Training
rf1 = RandomForestRegressor(random_state=101, n_estimators=200)
rf3 = RandomForestRegressor(random_state=101, n_estimators=200)
rf1.fit(X_train, y_train)
rf3.fit(X3_train, y3_train)

# Prediction
rf1_pred = rf1.predict(X_test)
rf3_pred = rf3.predict(X3_test)

# Evaluation
eval("All features, No scaling:",y_test,rf1_pred)
eval("\nSelected features, No scaling:",y3_test,rf3_pred)

## Optimising Random Forest

**Grid Search** will be used to get optimal parameters. Only parameters chosen are n-estimators, min_samples_leaf, max_features, bootstrap.

In [None]:
## Choosing params
rf_param_grid = {'max_features': ['sqrt', 'auto'],
              'min_samples_leaf': [1, 3, 5],
              'n_estimators': [100, 500, 1000],
             'bootstrap': [False, True]} 

rf_grid = GridSearchCV(estimator= RandomForestRegressor(), param_grid = rf_param_grid,  n_jobs=-1, verbose=0)
rf_grid.fit(X_train,y_train)

In [None]:
print("Best Parameters are : {}".format(rf_grid.best_params_))
print("Best Estimators are : {}".format(rf_grid.best_estimator_))
rf_grid_predictions = rf_grid.predict(X_test)
eval("\nAll features, no scaling:",y_test,rf_grid_predictions)

## Analysis

1. Optimization process on RF regressor **has not changed the performance** in a significant manner.

# Gradient Boosting

In [None]:
# Model training
gbm1 = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100, min_samples_split=2, min_samples_leaf=1, max_depth=3,
                                 subsample=1.0, max_features= None, random_state=101)
gbm3 = GradientBoostingRegressor(learning_rate=0.1, n_estimators=100, min_samples_split=2, min_samples_leaf=1, max_depth=3,
                                 subsample=1.0, max_features= None, random_state=101)

gbm1.fit(X_train, y_train)
gbm3.fit(X3_train, y3_train)

# Prediction
gbm1_pred = gbm1.predict(X_test)
gbm3_pred = gbm3.predict(X3_test)

# Evaluation
eval("All features, No scaling:",y_test,gbm1_pred)
eval("\nSelected features, No scaling:",y3_test,gbm3_pred)

# Optimising GBM
Grid Search will be used to get optimal parameters. Only parameters chosen are n-estimators , learning_rate , max_depth , subsample , min_samples_leaf , min_samples_split , max_features.

In [None]:
## Choosing params
gbm_param_grid = {'learning_rate':[1,0.1, 0.01, 0.001], 
           'n_estimators':[100, 500, 1000],
          'max_depth':[3, 5, 8],
          'subsample':[0.7, 1], 
          'min_samples_leaf':[1, 20],
          'min_samples_split':[10, 20],
          'max_features':[4, 7]}

gbm_tuning = GridSearchCV(estimator =GradientBoostingRegressor(random_state=101),
                          param_grid = gbm_param_grid,
                          n_jobs=-1,
                          cv=5)
gbm_tuning.fit(X_train,y_train)

In [None]:
print("Best Parameters are : {}".format(gbm_tuning.best_params_))
print("Best Estimators are : {}".format(gbm_tuning.best_estimator_))
gbm_grid_predictions = gbm_tuning.predict(X_test)
eval("\nAll features, no scaling:",y_test,gbm_grid_predictions)

## Analysis

1. Gradient Boosting **gave a good performance** even before Optimisation.
2. Grid search **actually decreased the GBM performance** a bit. In general, we can say that GBM has a similar performance to that of Random Forest on our dataset.

In [None]:
## Conclusion Plots 
fig, axs = plt.subplots(3, 2, figsize=(16,15))
axs[0, 0].scatter(y4_test,lm4_pred,color='coral', linewidths=2, edgecolors='k')
axs[0, 0].set_title('Linear Regression Prediction Performance (features selected and scaled)')
axs[0, 1].scatter(y4_test,grid_predictions,color='coral', linewidths=2, edgecolors='k')
axs[0, 1].set_title('Optimized SVM prediction Performance (with feature selection, and scaling)')
axs[1, 0].scatter(y_test,rf1_pred,color='coral', linewidths=2, edgecolors='k')
axs[1, 0].set_title('Random Forest prediction Performance (No feature selection)')
axs[1, 1].scatter(y_test,rf_grid_predictions,color='coral', linewidths=2, edgecolors='k')
axs[1, 1].set_title('Optimized Random Forest prediction Performance (No feature selection)')
axs[2, 0].scatter(y_test,gbm1_pred,color='coral', linewidths=2, edgecolors='k')
axs[2, 0].set_title('Gradient Boosting prediction Performance (No feature selection)')
axs[2, 1].scatter(y_test,gbm_grid_predictions,color='coral', linewidths=2, edgecolors='k')
axs[2, 1].set_title('Optimized Gradient Boosting prediction Performance')

for ax in axs.flat:
    ax.set(xlabel='True GDP per Capita', ylabel='Predictions')

# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()

**Random Forest shows the best prediction performance**

## Feature Importance



In [None]:
gbm_opt = GradientBoostingRegressor(learning_rate=0.01, n_estimators=500,max_depth=5, min_samples_split=10, min_samples_leaf=1, 
                                    subsample=0.7,max_features=7, random_state=101)
gbm_opt.fit(X_train,y_train)
feat_imp2 = pd.Series(gbm_opt.feature_importances_, list(X_train)).sort_values(ascending=False)
fig = plt.figure(figsize=(12, 6))
feat_imp2.plot(kind='bar', title='Importance of Features (Optimized)', color= 'skyblue')
plt.ylabel('Feature Importance Score')
plt.grid()
plt.show()

## Analysis

1. This shows significant importance shown by some features like Phones, Agriculture, Infant mortality etc.
2. Comparatively, the importance of Arable or Area is very less.

# Conclusion

4 different learning regressors **(Linear Regression, SVM, Random Forest, and Gradiant Boosting)** were tested to predict GDP, and the best prediction performance was seen in the order : \
\
**Random Forest > Gradiant Boosting > Linear Regression > SVM**

The Metrics for the best prediction performance using Random Forest regressor, using all features in the dataset is:

1. MAE: 2125.24
2. RMSE: 3051.71
3. R2_Score:  0.8873

In [None]:
# model save for deployment
joblib.dump(rf3,"random_forest_model.pkl")