# Real World: AI, Machine Learning & Data Science 

---

# Note: on data collection

- Collect all the data you can! (storage is cheap)

---

# Business value from real example

- Make correct business decisions
- Ask the right questions (fair help from consultants, startups or data analytics companies)

# Demystify

This is a real world example of how you'd solve a Machine Learning prediciton problem.

**Use Cases:**
- Discover churn risk of customers
- Predict optimal price levels (investments / retail)
- Predict future revenues
- Build recommendation systems
- Customer value scoring
- Fraud detection
- Customer insights (characteristics)
- Predict sentiment of text
- Object detecton in images
- etc etc...

## Why Python?

Python is general purpose and can do Software development, Web development, AI. Python has experienced incredible growth over the last couple of years, and many of the state of the art Machine Learning libraries being developed today have support for Python.

<img src='https://zgab33vy595fw5zq-zippykid.netdna-ssl.com/wp-content/uploads/2017/09/growth_major_languages-1-1400x1200.png' width=400px></img>

Source: https://stackoverflow.blog/2017/09/06/incredible-growth-python/

# Everything is free!

The best software today is open source and it's also enterprise-ready. Anyone can download and use them for free (even for business purposes).

**Examples of great, free AI libraries:**
* Anaconda
* Google's TensorFlow
* Scikit-learn
* Pandas
* Keras
* Matplotlib
* SQL
* Spark
* Numpy

## State-of-the-Art algorithms

No matter what algorithm you want to use (Linear Regression, Random Forests, Neural Networks, or Deep Learning), **all of the latest methods are implemented optimized for Python**.

## Big Data

Python code can run on any computer. Therefore, you can scale your computations and utilize for example cloud resources to run big data jobs.

**Great tools for Big Data:**
- Spark
- Databricks
- Hadoop / MapReduce
- Kafka
- Amazon EC2
- Amazon S3

---

----
# Real world example of AI: Titanic Analysis

Titanic notebook is open source. All of our material is online. Anyone can develop the most sophisticated AI solutions.

## The difficult part is never to implement the algorithm

The hard part of a machine learning problem is to get data into the right format so you can solve the problem. We'll illustrate this below.

![data-x](http://oi64.tinypic.com/o858n4.jpg)


# __Titanic Survivor Analysis__


**Sources:** 
* **Training + explanations**: https://www.kaggle.com/c/titanic

___
___



# Understanding the connections between passanger information and survival rate

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

### **Our task is to train a machine learning model that analyzes the trend and the information in the data in order to predict if the passengers survived or not.**

# Import packages

In [None]:
# No warnings
import warnings
warnings.filterwarnings('ignore') # Filter out warnings

# data analysis and wrangling
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB # Gaussian Naive Bays
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier

import xgboost as xgb

from plot_distribution import plot_distribution
plt.rcParams['figure.figsize'] = (9, 5)

### Load Data

In [None]:
df = pd.read_csv('data/train.csv')

<a id='sec3'></a>
___
## Part 2: Exploring the Data
**Data descriptions**

<img src="data/Titanic_Variable.png">

In [None]:
# preview the data
df.head(7)

In [None]:
# General data statistics
df.describe()

### Histograms

In [None]:
df.hist(figsize=(13,10));

In [None]:
# Balanced data set?
df['Survived'].map({0:'Deceased',1:'Survived'}).value_counts()

___

> #### __Brief Remarks Regarding the Data__

> * `PassengerId` is a random number (incrementing index) and thus does not contain any valuable information. 

> * `Survived, Passenger Class, Age, Siblings Spouses, Parents Children` and `Fare` are numerical values (no need to transform them) -- but, we might want to group them (i.e. create categorical variables). 

> * `Sex, Embarked` are categorical features that we need to map to integer values. `Name, Ticket` and `Cabin` might also contain valuable information.

___

### Dropping Unnecessary data
__Note:__ It is important to remove variables that convey information already captured by some other variable. Doing so removes the correlation, while also diminishing potential overfit.

In [None]:
# Drop columns 'Ticket', 'Cabin', 'Fare' need to do it 
# for both test and training

df = df.drop(['PassengerId','Ticket', 'Cabin','Fare'], axis=1)

<a id='sec4'></a>
____
## Part 3: Transforming the data

### 3.1 _The Title of the person can be used to predict survival_

In [None]:
# List example titles in Name column
df.Name

In [None]:
# Create column called Title

df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [None]:
# Double check that our titles makes sense (by comparing to sex)

pd.crosstab(df['Title'], df['Sex'])

In [None]:
df['Title'] = df['Title'].\
              replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr',\
             'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

df['Title'] = df['Title'].replace('Mlle', 'Miss') #Mademoiselle
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs') #Madame

In [None]:
# We now have more logical (contemporary) titles, and fewer groups

df[['Title', 'Survived']].groupby(['Title']).mean()

In [None]:
# We can plot the survival chance for each title

sns.countplot(x='Survived', hue="Title", data=df, order=[1,0])
plt.xticks(range(2),['Survived','Deceased']);

In [None]:
# Title dummy mapping: Map titles to binary dummy columns

binary_encoded = pd.get_dummies(df.Title)
df[binary_encoded.columns] = binary_encoded

In [None]:
# Remove unique variables for analysis (Title is generally bound to Name, so it's also dropped)
df = df.drop(['Name', 'Title'], axis=1)

In [None]:
df.head()

### Map Gender column to binary (male = 0, female = 1) categories

In [None]:
# convert categorical variable to numeric

df['Sex'] = df['Sex']. \
    map( {'female': 1, 'male': 0} ).astype(int)

df.head()

### Handle missing values for age

In [None]:
df.Age = df.Age.fillna(df.Age.median())

### Split age into bands and look at survival rates

In [None]:
# Age bands
df['AgeBand'] = pd.cut(df['Age'], 5)
df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False)\
                    .mean().sort_values(by='AgeBand', ascending=True)

### Suvival probability against age

In [None]:
# Plot distributions of Age of passangers who survived 
# or did not survive

plot_distribution( df , var = 'Age' , target = 'Survived' ,\
                  row = 'Sex' )

# Recall: {'male': 0, 'female': 1}

In [None]:
# Change Age column to
# map Age ranges (AgeBands) to integer values of categorical type 

df.loc[ df['Age'] <= 16, 'Age'] = 0
df.loc[(df['Age'] > 16) & (df['Age'] <= 32), 'Age'] = 1
df.loc[(df['Age'] > 32) & (df['Age'] <= 48), 'Age'] = 2
df.loc[(df['Age'] > 48) & (df['Age'] <= 64), 'Age'] = 3
df.loc[ df['Age'] > 64, 'Age']=4
df = df.drop(['AgeBand'], axis=1)

df.head()

# Note we could just run 
# df['Age'] = pd.cut(df['Age'], 5,labels=[0,1,2,3,4])

### Travel Party Size

How did the number of people the person traveled with impact the chance of survival?

In [None]:
# SibSp = Number of Sibling / Spouses
# Parch = Parents / Children

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Survival chance against FamilySize
df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=True) \
                                .mean().sort_values(by='Survived', ascending=False)

In [None]:
# Plot it, 1 is survived

sns.countplot(x='Survived', hue="FamilySize", data=df, order=[1,0]);

In [None]:
# Create binary variable if the person was alone or not

df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=True).mean()

In [None]:
# We will only use the binary IsAlone feature for further analysis

df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1, inplace=True)

df.head()

# Feature construction

In [None]:
# We can also create new features based on intuitive combinations
# Here is an example when we say that the age times socioclass is a determinant factor

df['Age*Class'] = df.Age.values * df.Pclass.values

df.loc[:, ['Age*Class', 'Age', 'Pclass']].head()

## Port the person embarked from
Let's see how that influences chance of survival

<img src= "data/images/titanic_voyage_map.png">
>___

> #### __Interesting Fact:__ 

> Third Class passengers were the first to board, with First and Second Class passengers following up to an hour before departure. 

> Third Class passengers were inspected for ailments and physical impairments that might lead to their being refused entry to the United States, while First Class passengers were personally greeted by Captain Smith.

In [None]:
# Fill NaN 'Embarked' Values in the dfs
freq_port = df['Embarked'].dropna().mode()[0]
df['Embarked'] = df['Embarked'].fillna(freq_port)
    
df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=True) \
                    .mean().sort_values(by='Survived', ascending=False)

In [None]:
# Create categorical dummy variables for Embarked values

binary_encoded = pd.get_dummies(df.Embarked)
df[binary_encoded.columns] = binary_encoded
df.drop('Embarked', axis=1, inplace=True)

df.head()

### Finished -- Preprocessing Complete!

In [None]:
# All features are approximately on the same scale
# no need for feature engineering / normalization

df.head(7)

### Sanity Check: View the correlation between features

In [None]:
# Uncorrelated features are generally more powerful predictors

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df.corr().round(2)\
            ,linewidths=0.1,vmax=1.0, square=True, cmap=colormap, \
            linecolor='white', annot=True);

<a id='sec5'></a>
___
### Machine Learning, Prediction and Artifical Intelligence
Now we will use Machine Learning algorithms in order to predict if the person survived. 

**We will choose the best model from:**
1. Logistic Regression
2. K-Nearest Neighbors (KNN) 
3. Support Vector Machines (SVM)
4. Perceptron
5. XGBoost
6. Random Forest
7. Neural Network (Deep Learning)

### Setup Training and Validation Sets

In [None]:
X = df.drop("Survived", axis=1) # Training & Validation data
Y = df["Survived"]              # Response / Target Variable

print(X.shape, Y.shape)

In [None]:
# Split training set so that we validate on 20% of the data
# Note that our algorithms will never have seen the validation 

np.random.seed(1337) # set random seed for reproducibility

from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val = \
                train_test_split(X, Y, test_size=0.2)

print('Training Samples:', X_train.shape, Y_train.shape)
print('Validation Samples:', X_val.shape, Y_val.shape)

___
> ## General ML workflow
> 1. Create Model Object
> 2. Train the Model
> 3. Predict on _unseen_ data
> 4. Evaluate accuracy.

___

## Compare Different Prediciton Models

### 1. Logistic Regression

In [None]:
logreg = LogisticRegression()           # create
logreg.fit(X_train, Y_train)            # train
acc_log_2 = logreg.score(X_val, Y_val)  # predict & evaluate

print('Logistic Regression accuracy:',\
      str(round(acc_log_2*100,2)),'%')

### 2. K-Nearest Neighbour

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5)                  # instantiate
knn.fit(X_train, Y_train)                                    # fit
acc_knn = knn.score(X_val, Y_val)                            # predict + evaluate

print('K-Nearest Neighbors labeling accuracy:', str(round(acc_knn*100,2)),'%')                                

### 3. Support Vector Machine

In [None]:
# Support Vector Machines Classifier (non-linear kernel)
svc = SVC()                                                  # instantiate
svc.fit(X_train, Y_train)                                    # fit
acc_svc = svc.score(X_val, Y_val)                            # predict + evaluate

print('Support Vector Machines labeling accuracy:', str(round(acc_svc*100,2)),'%')

### 4. Perceptron

In [None]:
perceptron = Perceptron()                                    # instantiate 
perceptron.fit(X_train, Y_train)                             # fit
acc_perceptron = perceptron.score(X_val, Y_val)              # predict + evalaute

print('Perceptron labeling accuracy:', str(round(acc_perceptron*100,2)),'%')

### 5. Gradient Boosting

In [None]:
# XGBoost, same API as scikit-learn
gradboost = xgb.XGBClassifier(n_estimators=1000)             # instantiate
gradboost.fit(X_train, Y_train)                              # fit
acc_xgboost = gradboost.score(X_val, Y_val)                  # predict + evalute

print('XGBoost labeling accuracy:', str(round(acc_xgboost*100,2)),'%')

### 6. Random Forest

In [None]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=500)   # instantiate
random_forest.fit(X_train, Y_train)                         # fit
acc_rf = random_forest.score(X_val, Y_val)                  # predict + evaluate

print('Random Forest labeling accuracy:', str(round(acc_rf*100,2)),'%')

### 7. Neural Networks (Deep Learning)

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import adam

In [None]:
model = Sequential()
model.add( Dense(units=300, activation='relu', input_shape=(13,) ))
model.add( Dense(units=100, activation='relu'))
model.add( Dense(units=50, activation='relu'))
model.add( Dense(units=1, activation='sigmoid') )

In [None]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.fit(X_train, Y_train, epochs = 100, batch_size= 50)

In [None]:
# # Evaluate the model Accuracy on test set
print('Neural Network accuracy:',str(round(model.evaluate(X_val, Y_val, batch_size=50,verbose=False)[1]*100,2)),'%')

### Importance scores in the random forest model

In [None]:
# Look at importnace of features for random forest

def plot_model_var_imp( model , X , y ):
    imp = pd.DataFrame( 
        model.feature_importances_  , 
        columns = [ 'Importance' ] , 
        index = X.columns 
    )
    imp = imp.sort_values( [ 'Importance' ] , ascending = True )
    imp[ : 10 ].plot( kind = 'barh' )
    print ('Training accuracy Random Forest:',model.score( X , y ))

plot_model_var_imp(random_forest, X_train, Y_train)

<a id='sec6'></a>
___

## Appendix I:
#### Why are our models maxing out at around 80%?


#### __John Jacob Astor__

<img src= "data/images/john-jacob-astor.jpg"> 

John Jacob Astor perished in the disaster even though our model predicted he would survive. Astor was the wealthiest person on the Titanic -- his ticket fare was valued at over 35,000 USD in 2016 -- it seems likely that he would have been among of the approximatelly 35 percent of men in first class to survive. However, this was not the case: although his pregnant wife survived, John Jacob Astor’s body was recovered a week later, along with a gold watch, a diamond ring with three stones, and no less than 92,481 USD (2016 value) in cash.

<br >


#### __Olaus Jorgensen Abelseth__

<img src= "data/images/olaus-jorgensen-abelseth.jpg">

Avelseth was a 25-year-old Norwegian sailor, a man in 3rd class, and not expected to survive by classifier. However, once the ship sank, he survived by swimming for 20 minutes in the frigid North Atlantic water before joining other survivors on a waterlogged collapsible boat.

Abelseth got married three years later, settled down as a farmer in North Dakota, had 4 kids, and died in 1980 at the age of 94.

<br >

### __Key Takeaway__ 

As engineers and business professionals, we are trained to as ourselves, what could we do to improve on an 80 percent average. As it is often the case, it’s easy to forget that these data points represent real people. Each time our model was wrong we should be glad -- in such misclasifications we will likely find incredible stories of human nature and courage triumphing over extremely difficult odds. 

__It is important to never lose sight of the human element when analyzing data that deals with people.__ 

<a id='sec7'></a>
___
## Appendix II: Resources and references to material we won't cover in detail

> * **Gradient Boosting:** http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

> * **Jupyter Notebook (tutorial):** https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook

> * **K-Nearest Neighbors (KNN):** https://towardsdatascience.com/introduction-to-k-nearest-neighbors-3b534bb11d26

> * **Logistic Regression:** https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4

> * **Naive Bayes:** http://scikit-learn.org/stable/modules/naive_bayes.html

> * **Perceptron:** http://aass.oru.se/~lilien/ml/seminars/2007_02_01b-Janecek-Perceptron.pdf

> * **Random Forest:** https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

> * **Support Vector Machines (SVM):** https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989


<br>
___
___

![](http://i67.tinypic.com/2jcbwcw.png)