# Modelling 1 - Supervised Machine Learning
The goal of this notebook is to provide reference material for the implementation of the models seen in class. This should be especially useful when working on your final projects.

## Modelling Process
1. Task Type Identification
2. Import Dataset
3. Feature Engineering
4. Train/Test Split
5. Model Selection and Fitting
6. Model Evaluation
7. Model Use

## Data Science Tasks (From Class 1)

|Tasks|Definitions|Typical Questions|Model Type|
|-|
|**Classification and class probability estimation**|Predict which class, out of a given set of classes, the individual belongs to.|Is this customer likely to churn? Is this customer likely to respond?|Supervised|
|**Regression**|Predict a numerical value for some individual.|How much will this customer use the service if she accepts the offer?|Supervised|
|**Similarity matching**|Identify similar individuals or items.|Lookalike analysis; product recommendations|Both|
|**Clustering**|Group individuals based on similarity.|Do customers form natural groups or segments?|Unsupervised|
|**Co-occurrence grouping**|Find associations between items based on transactions involving them.|What items are commonly purchased together?|Unsupervised|
|**Profiling**|What is a typical behavior of an individual or population?|What is a typical product usage by this segment? What and where is this customer likely to buy?|Unsupervised|
|**Link prediction**|Predict existence or strength of connection between items.|What unlinked people is a social network platform a customer is likely to know? Will a customer like this post if it is shown to her?|Both|
|**Data reduction**|Compress a large data set to a smaller data set.|Is there a small set of features representing most of the variability in customer behavior?|Both|
|**Causal modelling**|Identify events or actions that actually influence some outcome or a variable.|Has the marketing campaign actually improved chances of customer acquisition?|Supervised|

## Titanic Prediction Problem
You are given a dataset with passenger information and it is your job to predict if a passenger survived the sinking of the Titanic or not. 

This is a classic Kaggle challenge which can be found here: https://www.kaggle.com/c/titanic

### 1. Task Type Identification

What type of task?

What type of models?

### 2. Import Dataset

In [1]:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [2]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### 3. Feature Engineering

Identify important features, fill missing values, adjust data representation.
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

Observations:
* We have numerical, categorical, and boolean variables. (*Some models will not work unless we perform data normalization / preprocessing*)
* Lets make a hypothesis that Pclass, Sex, and Age will have high information content. (*Normally would do a feature selection process*)

Data Preparation:
1. Fill missing values with median
2. Convert gender to boolean

In [5]:
# check for null values
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
# Fill missing values with median
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)

In [7]:
# Convert sex to boolean
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

### 4. Train/Test Split

We need to extract the features we selected and our target labels from the training data.

In [8]:
predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
y_train = train['Survived'].values

X_target = test[predictors].values

### 5. Model Selection and Fitting

We now use a variety of models to try and predict survival. In this example, I am using the most general models however you will find each model has a variety of parameters that help prevent overfitting. Finding the right parameters for your situation is where data science gets really funky.

In [9]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [10]:
# Support Vector Machine
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [11]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### 6. Model Evaluation

To evaluate our model, we select an evaluation criteria that is important to solving our problem and then compare various versions of our model.

For this example I used cross validation and accuracy score. Other evaluation metrics were discussed in class 3 and 4.

API documentation from skLearn on model selection and evaluation can be found here: http://scikit-http://scikit-learn.org/stable/modules/model_evaluation.html

In [12]:
from sklearn.model_selection import cross_val_score

In [13]:
log_reg_scores = cross_val_score(log_reg, X_train, y_train, cv=10)
log_reg_scores.mean()



0.7900942004312791

In [14]:
svm_clf_scores_acc = cross_val_score(svm_clf, X_train, y_train, cv=10, scoring='accuracy')
svm_clf_scores_acc.mean()



0.8002442969016001

In [15]:
tree_scores_acc = cross_val_score(tree, X_train, y_train, cv=10, scoring='accuracy')
tree_scores_ll = cross_val_score(tree, X_train, y_train, cv=10, scoring='neg_log_loss')

(tree_scores_acc.mean(),tree_scores_ll.mean())

(0.8193210191805698, -3.282816044753294)

To improve this result further, you could:
* Compare many more models and tune hyperparameters using cross validation and grid search,
* Do more feature engineering, for example:
  * replace **SibSp** and **Parch** with their sum,
  * try to identify parts of names that correlate well with the **Survived** attribute (e.g. if the name contains "Countess", then survival seems more likely),
* try to convert numerical attributes to categorical attributes: for example, different age groups had very different survival rates (see below), so it may help to create an age bucket category and use it instead of the age. Similarly, it may be useful to have a special category for people traveling alone since only 30% of them survived (see below).

### 7. Model Use

In [16]:
# To obtain predictions on test set
y_predict = log_reg.predict(X_target)
y_confidence = log_reg.predict_proba(X_target)

In [17]:
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

In [18]:
y_confidence[:10]

array([[0.91880683, 0.08119317],
       [0.5654757 , 0.4345243 ],
       [0.89476325, 0.10523675],
       [0.90126679, 0.09873321],
       [0.3886951 , 0.6113049 ],
       [0.86282419, 0.13717581],
       [0.44432891, 0.55567109],
       [0.75194179, 0.24805821],
       [0.36183809, 0.63816191],
       [0.88488113, 0.11511887]])