

# Investor Risk Tolerance and Robo advisors

The goal of this case study is to build a machine learning model to predict the risk tolerance or risk aversion of an investor, and use the model in a robo-advisor dashboard.


## Content

* [1. Problem Definition](#0)
* [2. Getting Started - Load Libraries and Dataset](#1)
    * [2.1. Load Libraries](#1.1)    
    * [2.2. Load Dataset](#1.2)
* [3. Data Preparation and Feature Selection](#2)
    * [3.1. Preparing the predicted variable](#2.1)    
    * [3.2. Feature Selection-Limit the Feature Space](#2.2)
* [4.Evaluate Algorithms and Models](#4)        
    * [4.1. Train/Test Split](#4.1)
    * [4.2. Test Options and Evaluation Metrics](#4.2)
    * [4.3. Compare Models and Algorithms](#4.3)
* [5. Model Tuning and Grid Search](#5)  
* [6. Finalize the Model](#6)  
    * [6.1. Results on test dataset](#6.1)
    * [6.2. Feature Importance](#6.1)
    * [6.2. Feature Intuition](#6.3)


<a id='0'></a>
# 1. Problem Definition

"This initiative focuses on developing an advanced machine learning system geared towards portfolio management optimization. The primary goal is to design an algorithm that can effectively diagnose investor risk profiles, construct efficient investment frontiers, and provide data-driven recommendations for maximizing portfolio returns. This sophisticated system caters to a spectrum of users, including both newcomers and seasoned investors, delivering insights that have the potential to reshape portfolio management strategies through data analytics and predictive modeling techniques.

## Phase 2 ##
This notebook marks the second phase of our project, where we delve into training a specialized machine learning model to forecast risk tolerance. Leveraging an array of pertinent features, the algorithm is designed to predict an investor's risk appetite based on their individual characteristics. This predictive prowess can substantially enhance our ability to tailor investment strategies, thus deepening the data-driven aspects of portfolio management."

#Disclamer#

"This project is built upon the foundation laid by Hariom Tatsat in his book 'Machine Learning and Data Science Blueprints for Finance.' Specifically, it draws inspiration from Chapter 5's exploration of Supervised Learning, Regression, and Time Series models. The project closely follows Case Study 3, which delves into the critical area of Investor Risk Tolerance and Robo-advisors. The objective is to extend and build upon Tatsat's insights, methodologies, and findings to create an innovative machine learning application that optimizes investment portfolio management."

For this case study the data used is from survey of Consumer Finances which is conducted by the Federal Reserve
Board. The data source is : 
https://www.federalreserve.gov/econres/scfindex.htm

<a id='1'></a>
# 2. Getting Started- Loading the data and python packages

<a id='1.1'></a>
## 2.1. Loading the python packages

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from RTfuncs import *
import numpy as np
import pandas as pd
#import pandas_datareader.data as web
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
import copy 
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neural_network import MLPRegressor

#Libraries for Deep Learning Models
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.layers import LSTM
#from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import KFold, cross_val_score


#Libraries for Statistical Models
import statsmodels.api as sm

#Libraries for Saving the Model
from pickle import dump
from pickle import load

<a id='1.2'></a>
## 2.2. Loading the Data

In [None]:
path = '../raw_data/clean_dataset.csv'
dataset = pd.read_csv(path)

In [None]:
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')

<a id='2'></a>
## 3. Data Preparation and Feature Selection

<a id='2.1'></a>
## 3.1. Preparing the predicted variable

In [None]:
dataset.head()

In [None]:
#Risk Tolerance 2019
RT= calculate_risk_tolerance(dataset)

In [None]:
deep_copy(dataset)

In [None]:
print('Null Values =',dataset.isnull().values.any())

In [None]:
preprocessed_dataset, null_values_exist = preprocess_dataset(dataset)
print('Null Values Exist:', null_values_exist)

Let us plot the risk tolerance of 2019. 

In [None]:
sns.distplot(preprocessed_dataset['RT'], hist=True, kde=False, 
             bins=int(180/5), color = 'blue',
             hist_kws={'edgecolor':'black'})

Clearly, the behavior of the individuals reversed in 2019.

<a id='2.2'></a>
## 3.2. Feature Selection-Limit the Feature Space 

<a id='2.2.2'></a>
### 3.2.2.  Features elimination

In [None]:
keep_list2 = ['HHSEX',
              'AGE',
              'EDCL',
              'MARRIED',
              'KIDS',
              'FAMSTRUCT',
              'OCCAT1',
              'INCOME',
              'WSAVED',
              'YESFINRISK',
              'NETWORTH',
              'RT'
              ]
dataset = preprocess_and_drop_columns(preprocessed_dataset, keep_list2)

Let us look at the correlation among the features.

In [None]:
correlation = dataset.corr()
plt.figure(figsize=(15,15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='inferno')

In [None]:
def plot_scatter_matrix(dataset, figsize=(15, 15)):
    scatter_matrix(dataset, figsize=figsize)
    plt.show()
    
plot_scatter_matrix(dataset)

<a id='4'></a>
# 4. Evaluate Algorithms and Models

Let us evaluate the algorithms and the models. 

<a id='4.1'></a>
## 4.1. Train Test Split

Performing a train and test split in this step. 

In [None]:
X_train, X_validation, Y_train, Y_validation = prepare_train_validation_sets(dataset, 'RT', test_size=0.2, random_state=3)

<a id='4.2'></a>
## 4.2. Test Options and Evaluation Metrics


In [None]:
# test options for regression
num_folds = 10
scoring = 'neg_mean_squared_error'
scoring ='neg_mean_absolute_error'
scoring = 'r2'

<a id='4.3'></a>
## 4.3. Compare Models and Algorithms

### Regression Models

In [None]:
regression_models = create_regression_models()

### K-folds cross validation

In [None]:
model_names, model_results, best_model = perform_cross_validation_and_store_results_with_best_model(regression_models, X_train, Y_train, num_folds=10, seed=3)

### Algorithm comparison

In [None]:
# compare algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(model_results)
ax.set_xticklabels(model_names)
fig.set_size_inches(15,8)
plt.show()

The non linear models perform better than the linear models, which means that a non linear relationship between the risk tolerance and the difference variables use to predict it. Given random forest regression is one of the best methods, we use it for further grid search. 

<a id='5'></a>
# 5. Model Tuning and Grid Search

Given that the Random Forest is the best model, Grid Search is performed on Random Forest.

In [None]:
perform_grid_search_random_forest(X_train, Y_train, best_model, num_folds=10, seed=3)

Random forest with number of estimators 350, is the best model after grid search. 

<a id='6'></a>
# 6. Finalise the Model

Finalize Model with best parameters found during tuning step.

<a id='6.1'></a>
## 6.1. Results on the Test Dataset

In [None]:
model= create_and_fit_model(X_train, Y_train, RandomForestRegressor, n_estimators=400)

In [None]:
predictions_train = model.predict(X_train)
r2_train = calculate_r2_score(predictions_train, Y_train)
print("R^2 Score (Train):", r2_train)

In [None]:
mse, r2 = evaluate_regression_model(model, X_validation, Y_validation)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

From the mean square error and R2 shown above for the test set, the results look good. 

<a id='6.2'></a>
## 6.2. Feature Importance and Features Intuition

Looking at the details above Random forest be worthy of further study.
Let us look into the Feature Importance of the RF model

In [None]:
import pandas as pd
import numpy as np
model = RandomForestRegressor(n_estimators= 200,n_jobs=-1)
model.fit(X_train,Y_train)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

From the chart above, income and networth followed by age and willingness to take risk are the key variables to decide the risk tolerance. These variables have been considered as the key variables to model the risk tolerance across several literature. 

<a id='6.3'></a>
## 6.3. Save Model for Later Use

In [None]:
model_filename = 'finalized_model.sav'
save_model_to_pickle(model, model_filename)

In [None]:
model_filename = 'finalized_model.sav'
load_and_evaluate_model(model_filename, X_validation, Y_validation)


__Conclusion__:

We showed that machine learning models might be able to objectively
analyze the behavior of different investors in a changing market and attribute these
changes to variables involved in determining risk appetite. With an increase in the
volume of investor’s data and availability of rich machine learning infrastructure,
such models might prove to be more useful.

We saw that there is a non-linear relationship between the variables and the risk tolerance. Income and net worth followed by age and willingness to take risk are the key variables to decide the risk tolerance. These variables have been considered as the key variables to model the risk tolerance across several literature.
