## Recap

1. Revisited the concept of One Hot Encoding technique using pd.get_dummies()
2. Revisited the concept of One Hot Encoding technique using sklearn preprocessing library
3. max_depth parameter for Decision Trees.
4. min_samples_split parameter for Decision Trees.

## Agenda

1. Support Vector Machine for Classification
2. Support Vector Machine for Regression

## Loading the standard libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Loading the dataset

In [2]:
data = pd.read_excel('Titanic.xlsx')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,,Row Labels,Count of Sex
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,,female,314
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,,male,577
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,,Grand Total,891
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,,,


In [3]:
data.shape

(891, 15)

## Data Preprocessing

In [4]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Unnamed: 12    891
Unnamed: 13    887
Unnamed: 14    887
dtype: int64

#### Drop the columns with more than 30% missing values

In [5]:
data = data.drop(['Cabin', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], axis = 1)
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Embarked         2
dtype: int64

In [6]:
## Fill the missing values in Age and Embarked columns

from sklearn.impute import SimpleImputer
sim = SimpleImputer(strategy = 'most_frequent')
sim

SimpleImputer(strategy='most_frequent')

In [7]:
data[['Age', 'Embarked']] = sim.fit_transform(data[['Age', 'Embarked']])
data.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [8]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


## Observations:

1. Passengerid, Name, Ticket are of no use for our analysis because Name and Ticket just contain the names of passengers and name of ticket respectively. and PassengerId columns just contains the id in serial order like 1, 2, 3, ...

2. Sibsp = Siblings and Spouse of the passenger
3. Parch = Parents and children of the passenger

In [9]:
data = data.drop(['PassengerId', 'Name', 'Ticket'], axis = 1)
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


## Encoding the Sex and Embarked columns

In [10]:
## Manual Encoding Sex Column

dic = {'male' : 0, 'female' : 1}
data['Sex'] = data['Sex'].replace(dic)
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,7.25,S
1,1,1,1,38.0,1,0,71.2833,C
2,1,3,1,26.0,0,0,7.925,S
3,1,1,1,35.0,1,0,53.1,S
4,0,3,0,35.0,0,0,8.05,S


In [12]:
## The Embarked column contain the names of port where the passengers are boarding the titanic. Names have no order. Hence
## perform one Hot encoding

In [13]:
data_ohe = pd.get_dummies(data['Embarked'])
data_ohe

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [14]:
## Join the encoded columns to original data

In [16]:
data = pd.concat([data, data_ohe], axis = 1)
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,C,Q,S
0,0,3,0,22.0,1,0,7.25,S,0,0,1
1,1,1,1,38.0,1,0,71.2833,C,1,0,0
2,1,3,1,26.0,0,0,7.925,S,0,0,1
3,1,1,1,35.0,1,0,53.1,S,0,0,1
4,0,3,0,35.0,0,0,8.05,S,0,0,1


In [17]:
### since we get the encoded values of Embarked in original data, there is no use of Embarked column hence drop it

In [18]:
data = data.drop('Embarked', axis = 1)
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,C,Q,S
0,0,3,0,22.0,1,0,7.25,0,0,1
1,1,1,1,38.0,1,0,71.2833,1,0,0
2,1,3,1,26.0,0,0,7.925,0,0,1
3,1,1,1,35.0,1,0,53.1,0,0,1
4,0,3,0,35.0,0,0,8.05,0,0,1


## Feature Scaling

In [19]:
## Only Age and Fare columns have difference in values hence perform Feature scaling

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms

MinMaxScaler()

In [21]:
data[['Age', 'Fare']] = mms.fit_transform(data[['Age', 'Fare']])
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,C,Q,S
0,0,3,0,0.271174,1,0,0.014151,0,0,1
1,1,1,1,0.472229,1,0,0.139136,1,0,0
2,1,3,1,0.321438,0,0,0.015469,0,0,1
3,1,1,1,0.434531,1,0,0.103644,0,0,1
4,0,3,0,0.434531,0,0,0.015713,0,0,1


## Seperate X and y from the data

In [22]:
X = data.drop('Survived', axis = 1)
y = data['Survived']

## Split the data into Train test split

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state = 0)

## Apply Support Vector Machine for Classification on X_train and y_train

In [24]:
from sklearn.svm import SVC
sv = SVC()
sv

SVC()

In [25]:
sv.fit(X_train, y_train)

SVC()

## Perform Predictions on X_test

In [27]:
y_pred = sv.predict(X_test)
y_pred

array([0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0], dtype=int64)

In [28]:
X_test

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,C,Q,S
495,3,0,0.296306,0,0,0.028221,1,0,0
648,3,0,0.296306,0,0,0.014737,0,0,1
278,3,0,0.082684,4,1,0.056848,0,1,0
31,1,1,0.296306,1,0,0.285990,1,0,0
255,3,1,0.359135,0,2,0.029758,1,0,0
...,...,...,...,...,...,...,...,...,...
263,1,0,0.497361,0,0,0.000000,0,0,1
718,3,0,0.296306,0,0,0.030254,0,1,0
620,3,0,0.334004,1,0,0.028213,1,0,0
786,3,1,0.220910,0,0,0.014631,0,0,1


## Perform Evaluation

In [29]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.7985074626865671

In [30]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred, y_test)

array([[146,  32],
       [ 22,  68]], dtype=int64)

## Observations:

1. 146 + 68 = 214 ==> ARE THE CORRECT CLASSIFICATIONS
2. 22 + 32 = 54 ==> ARE THE INCORRECT CLASSIFICATIONS