# Data Science for Manufacturing - Workshop 7-1: Introduction to Machine Learning

## Objectives
- A realistic dataset
  - Numeric and categorical data types
  - Missing values
  - Mixsure of non-useful and useful information
- A complete to data analysis process using tools introduced previously
  - Introduction of the dataset
  - Univariate analysis
  - Bivariate anlysis
  - Feature engineering and data carpentry
  - Inferencing
- Different Machine Learning (ML) models
  - Introduction to different ML models
  - Apply and compare ML models

## 1. Introduction to the dataset and libraries

![Sinking of the Titanic](https://upload.wikimedia.org/wikipedia/commons/6/6e/St%C3%B6wer_Titanic.jpg)


![The movie](https://resizing.flixster.com/-XZAfHZM39UwaGJIFWKAE8fS0ak=/v3/t/assets/p20056_v_h9_ab.jpg)

**Titanic dataset**  

Background:  
- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
- One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.

<br>

Problem to solve:
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. Can we observe such patterns and conclude from the given dataset?
 - Women had a higher survival rate?
 - Children had a higher chance of survival?
 - Upper-class passengers were more likely to have survived?

- Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not?

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier

In [None]:
url1 = 'https://raw.githubusercontent.com/drewsherlock/dsim_workshops/main/train.csv'
df = pd.read_csv(url1)

url2 = 'https://raw.githubusercontent.com/drewsherlock/dsim_workshops/main/test.csv'
test = pd.read_csv(url2)

In [None]:
test.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Most columns are self-explanatory:
- 'PassengerId' is the ID number
- 'Survived' means whether a passenger survived or not
- 'Pclass' means the class of the ticket the passenger purchased
- 'Name' is the passenger's name
- 'Sex' is the passenger's sex
- 'Age' is the passenger's age
- 'SibSp' means the number of siblings
- 'Parch' means the number of parents or children aboard
- 'Ticket' means the ticket number
- 'Fare' means the amount of money passenger paid for the ticket
- 'Cabin' means the cabin number
- 'Embarked' means the place where the passenger boarded the Titanic.
 - There are 3 ports: C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**Observations:**

Datatypes of different columns:  
- Categorical: Survived, Sex, Embarked, Pclass
- Numerical: Age, Fare, SibSp, Parch


*
Further categorisations of datatypes:
- Categorical: Survived, Sex, and Embarked.
- Ordinal: Pclass
- Continous: Age, Fare.
- Discrete: SibSp, Parch.

Columns containing null values:
- Cabin, Age, Embarked columns contain a number of null values

Columns containing string type or mixtures of different datatypes:
- Name, Sex, Ticket, Cabin, Embarked

In [None]:
df['Name'].map(type).value_counts()

<class 'str'>    891
Name: Name, dtype: int64

In [None]:
df['Sex'].map(type).value_counts()

<class 'str'>    891
Name: Sex, dtype: int64

In [None]:
df['Ticket'].map(type).value_counts()

<class 'str'>    891
Name: Ticket, dtype: int64

In [None]:
df['Cabin'].map(type).value_counts()

<class 'float'>    687
<class 'str'>      204
Name: Cabin, dtype: int64

In [None]:
df['Embarked'].map(type).value_counts()

<class 'str'>      889
<class 'float'>      2
Name: Embarked, dtype: int64

`describe()` method is used to get a brief summary of the dataframe. Moreover, it can be used to get a summary of columns of `object` type of data.

In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


**Observations:**
- In addition to the problems in the train dataset, the test dataset has an extra problem: the Fare column has missing values.

## 2. Univariate analysis

In [None]:
"""
homework: show the distributions of age, sex and pclass columns
"""

## 3. Bivariate analysis

### 3.1 Pclass and survival

#### 3.1.1 `sns.FacetGrid(data, row=None, col=None)`

- this is the multi-plot grid for plotting conditional relationships. Each plot grid is a 'facet'.
- `row`,`col` define the variable based on which the dataframe is divided.
- `sns.FacetGrid.map( )` apply a plotting function to each facet's  subset of the dataframe.

Observation:
- Pclass affects the survival rate.

### 3.2 Sex and survival

In [None]:
"""
homework: use FacetGrid() to plot sex distribution conditioning on sex
"""

Observation:
- Sex affects the survival rate.

### 3.3 Age and survival

Observations:
- Age column has a lot of values, but we do not need to treat age at a high level of accuracy, i.e. 24 and 24,5 and 28 should not be treated very differently. It might be useful to band different age groups.

Observation:
- Age affects the survival rate.

## 4. Multivariate analysis
In the last notebook, multivariate analysis is correlation-focused, where a matrix of pairwise correlations is calculated and visualised. Here, more tools for multivariate analysis are introduced, where multiple variables are simultaneously inputs into the analysis functions.

### 4.1 Pclass, age and survival

### 4.2 Pclass, sex and survival

In [None]:
"""
homework: look at the sex distribution conditioning on survived and pclass
"""

### 4.3 Pclass, sex, embarked and survival

### 4.4 Embarked, sex, fare and survival

In [None]:
"""
homework: look at how the four variables interact using FacetGrid and barplot
"""

## 5. Data carpentry and feature engineering
- Deal with missing values and wrong data type
- Deal with categorical data to prepare for inferences

### 5.1 Drop non-useful columns

### 5.2 Feature engineering for Name column
Problems:
- Contains non-useful strings and useful titles simultaneously
- Categorical 'str' type of data

£££ To make use of categorical values together with numeric balues, map categorical values to numeric values.

### 5.3 Feature engineering for Sex column
Problems:
- Categorical data

### 5.4 Data carpentry and feature engineering for Age column
Problems:
- Redundant accuracy level that does not help inference, e.g. 35.5, 38.5 and 36 should be considered the same age band
- Contains missing values

#### 5.4.1 Deal with missing values

- solution 1: replace with mean value or the most frequent value
- solution 2: refer the missing age based on other columns, such as Pclass and Gender
 - there are 2*3 median values we need to calculate, because there are 3 Pclass values and 2 Gender values.

#### 5.4.2 Deal with redundant accuracy

### 5.5 Data carpentry and feature engineering for Embarked column
Problems:
- 'object' type of data
- contains missing values

#### 5.5.1 Deal with missing values

#### 5.5.2 Convert catogorical data into numeric data

### 5.6 Data carpentry for Fare column in the test dataset

## 6. Inference with ML models

### 6.1 Drop non-useful columns
We need X, the feature data and Y, the label data for inferences.
- For label data: Survived column
- For feature data: other columns after engineering


Columns we select for the feature data:
- Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Title

Other columns:
- PassengerId, Cabin, Ticket are not used in inferences


Note: there is a convention in Machine Learning and Deep Learning that people use uppercase letters and lower case letters specifically for feature and label data: X_train, y_train, X_test, y_test. The reasons for this convention:
- X usually represent a matrix or a higher dimension of features, and y represents a lower dimensional data
- This convention is followed in many machine learning libraries, including Scikit-learn.

### 6.2 Logistic regression model

### 6.3 K-NN model

### 6.4 Naive Bayes

### **6.5 Neural network

#### 6.5.1 Single-layer perceptron

#### 6.5.2 Multi-layer perceptron

### 6.6 Random Forest

### 6.7 Decision tree

### 6.8 Model comparison