<h1>Machine Learning for Data Analysis</h1>
<i><h2>Week 1 - Decision Trees</h2></i>

The data set I am using is the Gapminder data set. A series of observational variables that have independent sources that have been centralised into one set.

I am investigating the association between incomeperperson and armedforcesrate. The association, or hypothesis, I am expecting is for higher incomeperperson there will be a lower armedforcesrate. So my coefficient to my explanetory variable should be less than 0. I will also be looking at the type of government, the polityscore variable, and seeing if that has a confounding effect.

For the purpose of <i>growing</i> an analysable decision tree I will keep my two explanetory variables but will also categorise one of them, incomeperperson.

<h3>SET UP</h3>

<i>Import appropraite packages and set appropriate options</i>

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import tree

In [2]:
pd.set_option('display.float_format', '{:,.2f}'.format)

<i>Read in the data set</i>

In [3]:
usecols = ['incomeperperson', 'armedforcesrate', 'polityscore', 'country', 'lifeexpectancy', 'alcconsumption']
gap_data = pd.read_csv('../gapminder.csv', usecols = usecols, index_col='country')

<i>Replace spaces with blanks and coerce into numeric, replacing the spaces with an empty string is needed for converting to numeric</i>

In [4]:
gap_data.loc[:,'incomeperperson'] = pd.to_numeric(gap_data['incomeperperson'].replace(' ',''))
gap_data.loc[:,'armedforcesrate'] = pd.to_numeric(gap_data['armedforcesrate'].replace(' ',''))
gap_data.loc[:,'polityscore'] = pd.to_numeric(gap_data['polityscore'].replace(' ',''))
gap_data.loc[:,'lifeexpectancy'] = pd.to_numeric(gap_data['lifeexpectancy'].replace(' ',''))
gap_data.loc[:,'alcconsumption'] = pd.to_numeric(gap_data['alcconsumption'].replace(' ',''))

<i>Remove any rows where there are nulls</i>

In [5]:
gap_data.dropna(inplace = True)

<i>Categorise the response variable so that the decision tree is an analyseable size</i>

<b><i>armedforcesrate</i></b>

In [6]:
gap_data['armedforcesrate_cat'] = gap_data.loc[:,'armedforcesrate'].copy()

In [7]:
gap_data.loc[(gap_data['armedforcesrate_cat'] < 1) & (gap_data['armedforcesrate_cat'] >= 0), 'armedforcesrate_cat'] = 0
gap_data.loc[(gap_data['armedforcesrate_cat'] >= 1), 'armedforcesrate_cat'] = 1

In [8]:
gap_data.loc[:,'armedforcesrate_cat'] = gap_data.loc[:,'armedforcesrate_cat'].astype(int).astype(str)

<i>Categorise the explanetory variables to keep the decision tree interpretable</i>

<b><i>incomeperperson</i></b>

In [9]:
gap_data['incomeperperson_cat'] = gap_data.loc[:,'incomeperperson'].copy()

In [10]:
gap_data.loc[(gap_data['incomeperperson_cat'] < 5000) & (gap_data['incomeperperson_cat'] >= 0), 'incomeperperson_cat'] = 1
gap_data.loc[(gap_data['incomeperperson_cat'] >= 5000), 'incomeperperson_cat'] = 2

In [11]:
gap_data.loc[:,'incomeperperson_cat'] = gap_data.loc[:,'incomeperperson_cat'].astype(int)

<b><i>polityscore</i></b>

In [12]:
gap_data['polityscore_cat'] = gap_data.loc[:,'polityscore'].copy()

In [13]:
gap_data.loc[(gap_data['polityscore_cat'] >= 0), 'polityscore_cat'] = 1
gap_data.loc[(gap_data['polityscore_cat'] < 0), 'polityscore_cat'] = 2

In [14]:
gap_data.loc[:,'polityscore_cat'] = gap_data.loc[:,'polityscore_cat'].astype(int)

<i>Look at some information about the data set</i>

In [15]:
gap_data.dtypes

incomeperperson        float64
alcconsumption         float64
armedforcesrate        float64
lifeexpectancy         float64
polityscore            float64
armedforcesrate_cat     object
incomeperperson_cat      int32
polityscore_cat          int32
dtype: object

In [16]:
gap_data.describe()

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,lifeexpectancy,polityscore,incomeperperson_cat,polityscore_cat
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,6816.62,6.91,1.38,68.92,3.88,1.36,1.28
std,9891.11,5.1,1.54,9.99,6.2,0.48,0.45
min,103.78,0.05,0.0,47.79,-10.0,1.0,1.0
25%,561.71,2.69,0.45,62.09,-2.0,1.0,1.0
50%,2231.99,6.12,0.93,72.64,7.0,1.0,1.0
75%,7381.31,10.08,1.58,76.13,9.0,2.0,2.0
max,39972.35,23.01,9.82,83.39,10.0,2.0,2.0


<h3>MODEL DEVELOPMENT ('growing' a decision tree)</h3>

<i>Set up the constants for the model</i>

In [17]:
predictors = gap_data[['incomeperperson_cat','polityscore_cat']]
targets = gap_data['armedforcesrate_cat']

<i>Split the data set up into sections for use in training and testing the model</i>

In [18]:
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size = 0.4)

<i>This splits the data into 60% training data, 40% test data set as we will see below</i>

In [19]:
print('Predictors : training',pred_train.shape, \
      '\nPredictors : testing ',pred_test.shape, \
      '\nTarget     : training',tar_train.shape, \
      '\nTarget     : testing ',tar_test.shape)

Predictors : training (89, 2) 
Predictors : testing  (60, 2) 
Target     : training (89,) 
Target     : testing  (60,)


In [20]:
print("training : ",round((89 / 149) * 100,2), "%", "\ntest     : ",round((60 / 149) * 100,2), "%")

training :  59.73 % 
test     :  40.27 %


<i>Model</i>

In [21]:
classifier = DecisionTreeClassifier().fit(pred_train, tar_train)

In [22]:
predictions = classifier.predict(pred_test)

<i>Checking the model</i>

In [23]:
np.set_printoptions(threshold = np.nan)

In [24]:
sklearn.metrics.confusion_matrix(tar_test, predictions)

array([[33,  1],
       [24,  2]])

In [25]:
print(round(sklearn.metrics.accuracy_score(tar_test, predictions) * 100, 2), "%")

58.33 %


<i>Outputting the decision tree to an external file</i>

In [26]:
tree.export_graphviz(classifier, out_file = 'tree.dot', \
                     class_names = ['forces rate < 1', 'forces rate >=1'], \
                     filled = True, rounded = True, \
                     feature_names = ['incomeperperson','polityscore'])

<h3>END ANALSYSIS</h3>

Code from spyder that outputted the decision tree, file created in the same directory as the tree.dot file is in: 

<div style = "margin-left : 50px ;"><i>
import pydotplus
<br>import pydot
<br>from IPython.display import Image
<br>
<br>file = open('tree.dot', 'r')
<br>text=file.read()
<br>
<br>(graph,)=pydot.graph_from_dot_data(text)
<br>
<br>Image(graph.create_png())
</i></div>

I did it in spyder as it wasn't working in jupyter notebook. I will investigate how to do this.

<i>My Decision Tree</i>

<img src="decision_tree.png">

<i>The process</i>

A decision tree analysis was performed to test no linear relationships between the explanetory variables incomeperperson and polityscore, and the repose variable armedforcesrate which was transformed into a binary variable. In <i>growing</i> a decision tree all possible cut points of a quantitative variable and all potential seperations of a categorical variable are tested and followed. There is no pruning done of the branches in this instance as it isn't uspported in the packages used so I will analyse the full tree.

All the variables, both explanetory and response, have been categorised into 

<i>The tree</i>

The categorised polotyscore variable was the first variable to seperate the training sample into two sets. Countries with a polityscore category 1 (actual value between 0 and 10), 75.3% of the sample group, go to the left branch and the countries with a polityscore category 2 (actual value between -10 and -1), 24.7% of the sample group, go to the right branch.

Following the left branch, that now has 67 sample elements, there is a subdivision by incomeperperson. The countries with incomeperperson category 1 (actual value between \$0 and \$5000), 59.7% of the sample group, go to the left leaf node which has class 'forces rate < 1' meaning the model will predict an armedforcesrate category 1 (actual value between 0% and 1%). The countries with incomeperperson category 2 (actual value greater than \$5000), 40.3% of the sample group, go to the right leaf node which also has class 'forces rate < 1'. So in both instances the model predicts that: if the polityscore category is 1 (actual value between 0 and 10, indicating a more democratic government style) then regardless of the incomeperperson value the predicted armedfrocesrate category is 1 (actual value between 0% and 1%, indicating a low proportion of the available workforce is on active duty in the military).

Following the right branch, that has 22 sample elements, there is a subdivision by incomeperperson. The countries with incomeperperson category 1 (actual value between \$0 and \$5000), 72.7% of the sample group, go to the left leaf node which has class 'forces rate < 1' meaning the model will predict an armedforcesrate category of 1 (actual value between 0% and 1%). The countries with incomeperperson category 2 (actual value greater than \$5000), 27.3% of the sample group, go to the right leaf node with class 'forces rate >= 1' meaning the model will predict an armedforcesrate category of 2 (actual value greater than 1%). So this means the model predicts that: for countries with polityscore category 2 (actual value between -10 and -1) then the countries with lower incomeperperson will have a lower armedforcesrate than countries with higher incomeperperson.

From looking at the whole decision tree most of the values lead to the class 'forces rate < 1' which means most countries will have the predicted armedforcesrate category 1 (actual value between 0% and 1%).

The total model has an accuracy of 58.33%, so 58.33% of the test sample countries had a correctly predicted <u>armedforcesrate</u> group from the model. This is not a great accuracy value but it is currently a very simple model, with only binary variables included. The accuracy would potentially increase if more information about the individual values of the variables was included. I didn't include them in this instance because the tree was so large when I did the analysis.