<img src="https://github.com/datastrider/titanic_svm/blob/main/titanic_comp_pic.jpg?raw=true" ></img>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Table of Contents</p>

[Project Motivation](#project_motivation)
_________________
1. [Import Modules](#import_modules)
2. [Load Data](#load_data)
3. [Data Exploration](#data_exploration)<br>
4. [Data Cleaning & Preprocessing](#data_preprocessing)<br>
    4.1 [Checking Missing Values](#check_missing_values)<br>
    4.2 [Checking Unusual/Invalid Values](#unusual_values)<br>
    4.3 [Encoding Categorical Data](#encode_data)<br>
    4.4 [Removing Columns/Attributes](#remove_cols)<br>
5. [Model Training: No Sklearn](#no_sklearn)
6. [Model Training: Sklearn](#sklearn)
7. [Kaggle Submission](#kaggle_sub)<br>
    7.1 [Cleaning and Processing Test Data](#clean_test_data)<br>
    7.2 [Create sumbission csv](#create_submission_csv)<br>

<a class="anchor" id="project_motivation"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Project Motivation</p>

The motivation of this project was to practice and improve my methodologies used when investigating a dataset, cleaning and preprocessing it, and then using it with a machine learning algorithm.

It was also an opportunity to learn more about a specific machine learning algorithm: <b>Support Vector Machine</b><br>

Here I try to implement the base algorithm <b>from scratch</b> (no hyper-parameter optimisation), using gradient descent to calculate the weights and bias of the algorithm, and then use and optimise the <b>*Sklearn* implementation</b> to see how it compares, and what results can be achieved after fitting on the full training dataset, and then making predictions off of the test dataset and checking against Kaggle's answers.

The results obtained after uploading to Kaggle are not the best (about 0.66), but I enjoyed this project nontheless.

<a class="anchor" id="import_modules"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Import Modules</p>


In [None]:
import time
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

from skopt import BayesSearchCV

<a class="anchor" id="load_data"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Load Data</p>

In [None]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
train_data.head(n=10)

<a class="anchor" id="data_exploration"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Data Exploration</p>

In this section, we will get a quick overview of some basic statistics of the dataset, and then look at how different attributes relate to who survived or not.

In [None]:
train_data.describe()

In [None]:
train_data.pivot_table(train_data, index=["Survived"])

<b>How does Age and Fare correlate with survivability</b>

Below, we can see that there was not a large disparity of the survival rate of different ages. The median age of passengers that survived and those that did not were the same. There was a difference spotted with the fare price, where those who paid more tended to be more likely to survive.

In [None]:
fig, axes = plt.subplots(1,2)
fig.set_figwidth(25)
fig.set_figheight(8)

check_cols = ["Age", "Fare"]
sns.set_style("dark")
for i in range(len(check_cols)):
    
    sns.kdeplot(data=train_data.loc[train_data["Survived"] == 1, check_cols[i]],
                  ax=axes[i],
                  label="Survived",
                  color='blue',
                  shade=True)

    sns.kdeplot(data=train_data.loc[train_data["Survived"] == 0, check_cols[i]],
                  ax=axes[i],
                  label="Did not survive",
                  color='red',
                  shade=True)

    # plot vertical lines
    axes[i].axvline(train_data.loc[train_data["Survived"] == 1, check_cols[i]].median(),
                   color='blue')

    axes[i].axvline(train_data.loc[train_data["Survived"] == 0, check_cols[i]].median(),
                   color='red')
    
    # plot annotations of values corresponding to the vertical lines
    axes[i].annotate("Median {} Survived: {}".format(check_cols[i],
                                                        train_data.loc[train_data["Survived"] == 1, check_cols[i]].median()),
                                                        xy=(0.45, 0.95),
                                                        xycoords='axes fraction',
                                                        fontsize=15)
    
    axes[i].annotate("Median {} Not Survived: {}".format(check_cols[i],
                                                         train_data.loc[train_data["Survived"] == 0, check_cols[i]].median()),
                                                         xy=(0.45, 0.90),
                                                         xycoords='axes fraction',
                                                         fontsize=15)
    
    axes[i].title.set_text("{} and Survival".format(check_cols[i]))
    axes[i].legend()
plt.show()
plt.close()

<b>How does Pclass and Sex correlate with survivability</b>

We can see that those of a lower socio-economic status (Pclass) were more likely to have died than those of a higher class. This could potentially caused by people of higher class being let on life-boats, and/or people of higher socio-economic class were able to spend more. It was shown earlier that those who paid a higher fare were also more likely to have survived than those who paid a lower fare.

<b>It is clear that *'Age'*, *'Fare'*, *'Pclass'* and *'Sex'* are important attributes to consider when deciding on whether a passenger was likely to have survived or not </b>

In [None]:
fig, axes = plt.subplots(1,2)
fig.set_figwidth(15)
fig.set_figheight(5)
sns.countplot(x="Pclass", hue="Survived", data=train_data, palette=["#F34D4D", "#2B72D9"], ax=axes[0])
axes[0].set_xticks([1,2,3])
sns.countplot(x="Sex", hue="Survived", data=train_data, palette=["#F34D4D", "#2B72D9"], ax=axes[1])
axes[0].set_title("Pclass and Survival")
axes[1].set_title("Sex and Survival")
plt.show()
plt.close()

<a class="anchor" id="data_preprocessing"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Data Cleaning & Preprocessing</p>

<b>Steps involved in Data Cleaning and Preprocessing</b>

* <b>Look for missing values and correct them</b>
    1. Delete observations with missing categorical data
    2. Replace missing numeric data with the median of that particular attribute
    <br><br>

* <b>Look for unusual values and correct them</b>
    1. Variations of the same text categorical data e.g. 'Male" and 'male'
    2. Invalid values of numeric data e.g Age < 0
    3. Strong outlier values of numeric data
    <br><br>

* <b>Encode categorical data using the One-Hot encoder</b>
<br>
    Creates a column for each categorical value, and populates them with a 1 or 0
    e.g. 



|Embarked|
|--|
|S|
|C|
|Q|

changes to

|S|C|Q|
|--|--|--|
|1|0|0|
|0|1|0|
|0|0|1|


* <b>Remove columns deemed as uneccessary</b>

In [None]:
print(train_data.isnull().sum())

<a class="anchor" id="check_missing_values"></a>
## Checking Missing Values

<b>Correting the *'Age'* column </b>
<br>
Replace the missing *'Age'* values with the <u>median</u> of all the present *'Age'* values.
<br>

<b>Why the median?</b>

Below, a histrogram is plotted showing the distribution of ages of passengers. From the graph, it can be seen that the data is right-skewed, meaning that the distribution has a long right tail. In this case, it is better to use the median of the data over the mean, as the mean is affected more by extreme/outlier data, or when data is skewed. Below, the mean age is higher than the median age, as can be seen with the 2 plotted vertical lines. Thus, the missing values will be replaced with the median of the values in the *'Age'* column.

In [None]:
sns.set(rc={"figure.figsize":(15,5)}) # set size of figure plotted
sns.set_style("dark")
sns.histplot(train_data["Age"], kde=True, bins=20, color="teal")
plt.axvline(train_data["Age"].median(), c="red", label="Median Age: {:.1f}".format(train_data["Age"].median()))
plt.axvline(train_data["Age"].mean(), c="blue", label="Mean Age: {:.1f}".format(train_data["Age"].mean()))
plt.legend()
plt.suptitle("Age of Passengers: Right Skewed", fontsize=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

Replacing values in the *'Age'* column

In [None]:
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)

# Check if missing values have been filled for the 'Age' column
assert(train_data['Age'].isna().sum() == 0)
print("Missing 'Age' values:", train_data['Age'].isna().sum())

<b>Correcting the *'Embarked'* column</b>
<br>
Delete the rows with missing values in the *'Embarked'* column (There are only 2 missing, so this should be OK)

In [None]:
train_data = train_data.loc[train_data['Embarked'].notna(), :]

# Check if missing values have been filled for the 'Embarked' column
assert(train_data['Embarked'].isna().sum() == 0)
print("Missing 'Embarked' values:", train_data['Embarked'].isna().sum())

<b>Correcting the *'Cabin'* column</b>
<br>
There are too many missing values. It may be best to delete the column

In [None]:
train_data.drop("Cabin", axis=1, inplace=True)

# Check if 'Cabin' column has been deleted
assert('Cabin' not in train_data)
print("'Cabin' column present:", ('Cabin' in train_data))

<a class="anchor" id="unusual_values"></a>
## Checking Unusual/Invalid Values

### Categorical data

In [None]:
train_data.head()

In [None]:
categorical_cols = ["Survived", "Pclass", "Sex", "Embarked"]

for col in categorical_cols:
    print(col, ":", train_data[col].unique())

### Numerical Data

<b>Checking *'Age'* column</b>
<br>
Will look at the spread of *'Age'* values, and replace negative '*Age*' values with the median *'Age'* value. Negative ages are invalid.
<br>

From the previous histogram of *'Age'*, it is clear that there are no ages that are too big. To be safe, the largest age is checked

In [None]:
# replace negative ages with the median age
train_data.loc[train_data["Age"] < 0, "Age"] = train_data["Age"].median()

# assert that there are no negative ages
assert((train_data["Age"] > 0).all())

<b>Checking *'SibSp'* and *'Parch*' columns</b>
<br>
Will look at the spread of the values of these columns, and replace negative values with 0

In [None]:
# replace negative values with 0
train_data.loc[train_data["SibSp"] < 0, "SibSp"] = 0
train_data.loc[train_data["Parch"] < 0, "Parch"] = 0

# assert that there are no negative values
assert((train_data["SibSp"] >= 0).all())
assert((train_data["Parch"] >= 0).all())

<b>Checking *'Fare'* column</b>
<br>
Will look at the spread of *'Fare'* values, and replace negative '*Fare*' values with the median *'Fare'* value. Negative ages are invalid

In [None]:
sns.set(rc={"figure.figsize":(15,5)}) # set size of figure plotted
sns.set_style("dark")
sns.histplot(train_data["Fare"], kde=True, bins=50, color="teal")
plt.axvline(train_data["Fare"].median(), c="red", label="Median Fare: {:.1f}".format(train_data["Fare"].median()))
plt.axvline(train_data["Fare"].mean(), c="blue", label="Mean Fare: {:.1f}".format(train_data["Fare"].mean()))
plt.legend()
plt.suptitle("Fare paid by Passengers: Right Skewed", fontsize=20)
plt.xlabel("Fare")
plt.ylabel("Count")
plt.show()

From this histogram we can see that the data is strongly right-skewed. Let us take a closer look at the tail end of fare prices
<br>

Below the 50 highest fare prices in the dataset are printed out. From this, it can be seen that out of the 819 observations, a very small number of them are very high. This could correspond to the cabin prices, or perhaps extra luxuries afforded to them by paying more.

In [None]:
#train_data["Fare"].sort_values().tail(n=15)

survive_fare_df = train_data.loc[:, ["Survived","Fare"]].sort_values("Fare", ascending=False)
print(survive_fare_df["Fare"].head(n=50).to_numpy())

These outlier values may have a significant impact on the implementation of our model. Let us create a plot to investigate if this influences the survivability of the passengers

In [None]:
fig, axes = plt.subplots(1,2)
sns.countplot(x=survive_fare_df["Survived"][:50], ax=axes[0], palette=['#f75e5e',"#5da0e8"])
sns.countplot(x=survive_fare_df["Survived"][50:], ax=axes[1], palette=['#f75e5e',"#5da0e8"])
axes[0].set_title("Top 50 most expensive Fares & Survival")
axes[1].set_title("Excluding Top 50 most expensive Fares & Survival")
plt.show()

From this investigation and the graphs plotted above, we can confirm that people who paid the most fares were more likely to survive
#### given everything else is equal!

These results do not take into account other factors, but is an indication that at first glance <u>we will not be removing these outliers</u>

<a class="anchor" id="encode_data"></a>
## Encoding Categorical Data

Here, we will encode the categorical attributes using the One-Hot Encoder.

<b>Pclass</b>

In [None]:
enc = OneHotEncoder()
res = enc.fit_transform(train_data[["Pclass"]]).toarray()
res = pd.DataFrame(res, columns=["1", "2", "3"], dtype='int8')

In [None]:
train_data = pd.merge(train_data, res, left_index=True, right_index=True)
train_data.head()

<b>Sex</b>

Note: It may not be necessary to have both *'Male'* and *'Female'* columns, as a 0 in the *'Male'* column can only mean that the passenger is a female. We have kept both columns though, as there doesn't seem to be a great need to save on datasize.

In [None]:
res = enc.fit_transform(train_data[["Sex"]]).toarray()
res = pd.DataFrame(res, columns=["Female", "Male"], dtype='int8')

In [None]:
train_data = pd.merge(train_data, res, left_index=True, right_index=True)
train_data.head()

<b>Embarked</b>

In [None]:
res = enc.fit_transform(train_data[["Embarked"]]).toarray()
res = pd.DataFrame(res, columns=["C", "Q", "S"], dtype='int8')

In [None]:
train_data = pd.merge(train_data, res, left_index=True, right_index=True)
train_data.head()

<a class="anchor" id="remove_cols"></a>
## Removing Columns/Attributes

From the training data, it appears that *'PassengerId'*, *'Name'* and *'Ticket'* will provide little use to the model, as it is difficult to make sense to how they relate to who survived. These columns will be removed

<b>Remove category columns</b>

In [None]:
train_data.drop(["PassengerId", "Name", "Ticket"], axis=1, inplace=True)

assert("PassengerId" not in train_data)
assert("Name" not in train_data)
assert("Ticket" not in train_data)

In [None]:
train_data.head()

The *'SibSp'* and *'Parch'* columns can be merged together. We can also create another column representing whether the passenger was alone or not

In [None]:
train_data["Family"] = train_data["SibSp"] + train_data["Parch"]
train_data.drop(["SibSp", "Parch"], axis=1, inplace=True)

train_data["Alone"] = (~(train_data["Family"] > 0)).astype(int) # alone is true if family > 0

assert("SibSp" not in train_data)
assert("Parch" not in train_data)
assert("Alone" in train_data)

In [None]:
train_data.head()

Removing *'Pclass'*, *'Sex'* and *'Embarked'* as they have already been processed using the OneHotEncoder and are no longer needed

In [None]:
train_data.drop(["Pclass", "Sex", "Embarked"], axis=1, inplace=True)

assert("Pclass" not in train_data)
assert("Sex" not in train_data)
assert("Embarked" not in train_data)

In [None]:
train_data.head()

<a class="anchor" id="no_sklearn"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Model Training: From Scratch</p>

## How does a SVM work?

A SVM tries to create a hyperplace in a multi-dimensional space that divides the 2 classes. The margin (distance between the 2 closest points to the hyperplane) is something that we also try to maximise. If we maximise the margin, then the points of the different classes will be far away from each other and the seperating hyperplane. This will increase the general accuracy, but there may be miss-classifications, as this approach allows for that.<br>

<b>*Support Vectors*</b> are the datapoints that touch the positive and negative hyperplanes. In the picture below, there are 2 support vectors touching the positive hyperplane, and 1 touching the negative hyperplane<br>

Image source: https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:SVM_margin.png
<img src="https://github.com/datastrider/titanic_svm/blob/main/svm_diagram.jpg?raw=true" style="width:50%" align="left"/><br>


### Algorithm<br>

From the diagram, we can see that one class should not cross the line $w \cdot x - b = 1$, whereas the other class should not cross the line $w \cdot x - b = -1$ <br>

For this classifier, $y_{i}$ has to be either -1 or 1

if $y_{i} = 1$, then $w \cdot x_{i} - b >= 1$ <br>
if $y_{i} = -1$, then $w \cdot x_{i} - b <= -1$

The goal of the classifier is also to maximise the distance between the data points of the 2 classes.

### Cost Function<br>

In order to know how well the classifier does, we need an algorithm that can produce a result based on the results of the predictions. For the SVM classifier, we will use a Hinge Loss Function

Hinge Loss: $l(y) = max(0, 1-y_{i}(w \cdot x_{i} - b))$

This will return $0$ if $y >= 1$, otherwise $1-y_{i}(w \cdot x_{i} - b)$. <br>

This means that correct predictions do not increase the loss function, only incorrect predictions. <br>

### Regularisation

It is also important to consider regularisation. This takes the form of <br>

$\lambda ||w||^2 $ <br>

Thus, the function we will want to optimise using gradient descent is: <br>

$J(\theta) = \frac{\lambda}{2}||w||^2 + \frac{1}{n} \sum_{i=1}^{n} max(0, 1-y_{i}(w \cdot x_{i} - b))$ <br>

The equation above is the <b>*Primal Form*</b> of the SVM. <br>

Making $\lambda$ smaller makes the distance between the positive and negative hyperplanes larger. This leads to a hyperplane being drawn that seperates the data, with data representing the 2 different classes being far apart. Using the Hinge Loss function, and a smaller $\lambda$ with a SVM is known as a <b>*Soft Margin*</b>, because it allows for miss-clasifications.
_______________________________________

Regularisation is explained here, with visualisations: https://datascience.stackexchange.com/questions/4943/intuition-for-the-regularization-parameter-in-svm. As lambda tends to infinity, the solution tends to a <b>*Hard Margin*</b>, where no miss-classifications are allowed.

It is, that $C \sim \frac{1}{\lambda}$, and that it depends on the formulation of the SVM (https://stats.stackexchange.com/a/298886). We are not using the equation with C, but it achieves the same goal. The C version: <br>

$\frac{1}{2}||w||^2 + C \frac{1}{n}\sum_{i=1}^{n} max(0, 1-y_{i}(w \cdot x_{i} - b))$

The 'C' parameter is explained very well here:<br>
https://stats.stackexchange.com/a/159051<br>
https://medium.com/@kushaldps1996/a-complete-guide-to-support-vector-machines-svms-501e71aec19e<br>

### Gradient Descent

As we are trying to minimuse the loss function, we will be using gradient descent. We are trying to find the optimal values for $w$ and $b$, so we calculate the derivatives of the loss function with respect to each variable. There are 2 cases to take into account. <br>

$y_{i}(w \cdot x_{i} - b) >= 1$ and $y_{i}(w \cdot x_{i} - b) < 1$<br>

let $h()$ be the Hinge Loss function with regularisation, then... <br>
________________________________
if $y_{i}(w \cdot x_{i} - b) >= 1$ <br>

$\frac{\partial J}{\partial w} = 2 \lambda w$

And

$\frac{\partial J}{\partial b} = 0$

_________________________________

if $y_{i}(w \cdot x_{i} - b) < 1$ <br>

$\frac{\partial J}{\partial w} = 2 \lambda w - \frac{1}{n}\sum_{i=1}^{n} x_{i} \cdot y_{i}$

And

$\frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} y_{i}$

_______________________________
### Update Value Rules

$w = w - \alpha \cdot dw$<br>
$b = b - \alpha \cdot db$ <br>

where $\alpha$ is the learning rate
________________________________

Before applying the learning rate to $dw$ and $db$, we need to calculate the summation values first: <br>

$dw$: $\frac{1}{n}\sum_{i=1}^{n} x_{i} \cdot y_{i}$ <br>

$db$: $\frac{1}{n}\sum_{i=1}^{n}y_{i}$

Only after this, do you add $2 \lambda w$ to $dw$ and then apply $\alpha$, as shown in the update rules

After each iteration, $w$ and $b$ will change, with influence from $\alpha$ and $\lambda$ constants, until the gradient descent settles in a local minima.

It is possible when using gradient descent to not get the optimal values for $w$ and $b$, as the algorithm can get stuck in a local minima. This happens when there are more than 1 minimas. There are ways to handle this, but it is not investigated in this project.<br>

Image source: https://en.wikipedia.org/wiki/Maxima_and_minima#/media/File:Extrema_example_original.svg
<img src="https://github.com/datastrider/titanic_svm/blob/main/minimas.jpg?raw=true" style="width:50%" align="left"/><br>

In [None]:
class SVMClassifier:

    def __init__(self, lam=0.01, lr=0.001, n_iter=1000):

        self.lr = lr # learning rate
        self.n_iter = n_iter # number of iterations
        self.lam = lam
        self.w = 0 # weights of attributes of X
        self.b = 0 # bias
        self.loss_history = []

    def _hinge_loss(self, X: np.array, y: np.array, w: np.array, b: float):
        '''
        Returns the hinge loss of  a linear equation

        #param X: observed datapoints
        @param y: classification of corresponding x observations
        @param w: weights of the linear equation
        @param b: bias
        @return: sum of losses
        '''

        loss = [max(0, 1-y_ * (np.dot(w, x_.T) - b)) for y_, x_ in zip(y, X)]

        return np.sum(loss)

    def fit(self, X: np.array, y: np.array):
        '''
        Fits the model using the data, making use of the
        Hinge Loss Function and gradient descent to find the
        values of 'w' and 'b'
        
        @param X: observed datapoints
        @param y: classification of corresponding x observations
        @return:
        '''

        y_temp = np.where(y == 0, -1, 1)
        n_samples = X.shape[0]
        n_features = X.shape[1]

        self.w = np.zeros(n_features)
        self.b = 0

        for _ in range(self.n_iter):

            loss = self._hinge_loss(X, y, self.w, self.b)
            self.loss_history.append(loss)

            # both used for summing values of corresponding derivatives
            dw = np.zeros(n_features)
            db = 0

            for i, x_temp in enumerate(X):
                
                if (y_temp[i] * np.dot(x_temp, self.w) - self.b) >= 1:
                    pass
                    # hinge function returns 0
                else:
                    # summing values within the Sigma/Sum symbol
                    dw += -np.dot(x_temp, y_temp[i]) #summation part of derivative
                    db += self.lr * y_temp[i] # summation part of derivative

                # adding lambda where appropriate and applying learning rate
                self.w -= self.lr * (self.lam * self.w * 2 + dw/n_samples) # dw * 1/n as in the equation
                self.b -= self.lr * db



    def predict(self, X: np.array):
        '''
        Makes predictions on given datapoints and returns them
        as either 0 or 1
        
        @param X: observed datapoints
        @return pred: predictions
        '''
        
        pred = np.dot(X, self.w) - self.b
        pred = np.sign(pred)
        pred = np.where(pred < 0, 0, 1)
        
        return pred



In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_data.iloc[:, 1:].to_numpy(),
                                                    train_data.iloc[:, 0].to_numpy(),
                                                    random_state=123,
                                                    train_size=0.8)

In [None]:
clf = SVMClassifier()

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))

### Our SVM Classifier Results

From this simple implementation, with no optimisation, we can see that it scores closely with the sklearn implementation that we will see in the next [section](#sklearn), where we implement and optimise the sklearn implementation for the Kaggle submission.

Below is a graph that shows the improvement of the loss function during the fitting/training of our SVM Classifier. It does seem to be erratic at times (this can be fixed by doing multiple runs and taking an average), but it is clear that there is a downward trend in the loss function, meaning our SVM Classifier is improving.

In [None]:
plt.figure()
plt.plot(clf.loss_history)
plt.title("Loss Hisotry of Training Own SVM Classifier")
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.show()

<a class="anchor" id="sklearn"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Model Training: Sklearn</p>

In [None]:
random_state = 123
X_train, X_test, y_train, y_test = train_test_split(train_data.iloc[:, 1:], # X
                                                    train_data.iloc[:, 0], # y
                                                    random_state=random_state,
                                                    train_size=0.8)

In [None]:
clf = SVC()

clf.fit(X_train, y_train)

pred = clf.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, pred))

We can see that the SVM Classifier coded from scratch is performing better than the Sklearn SVC version on the same train/test data and random_seed. <br> Perhaps this will soon change. We will now try to optimise the hyper-parameters.

In [None]:
params = {"C": (0.1, 2),
          "kernel": ["linear", "poly", "sigmoid"],
          "degree": (1, 3),
          "coef0": (0.0, 1.0)}

opt = BayesSearchCV(clf,
                    params,
                    cv=10,
                    random_state=random_state)

In [None]:
opt.fit(X_train, y_train)

In [None]:
print("Accuracy:",opt.score(X_test, y_test))

In [None]:
opt.best_params_

<a class="anchor" id="kaggle_sub"></a>
# <p style="background-color:#0d101c;font-family:arial;color:#ffffff;font-size:150%;text-align:center;border-radius:20px;">Kaggle Submission</p>

<a class="anchor" id="clean_test_data"></a>
## Cleaning and Processing Test Data

We will process the test data to bring it in line with the training data used.
<br>
All steps carried out below were carried out on the training data above.

In [None]:
test_id = test_data["PassengerId"]

test_data['Age'].fillna(test_data['Age'].median(), inplace=True)
test_data["Fare"].fillna(test_data["Fare"].median(), inplace=True)

# Check if missing values have been filled for the 'Age' column
assert(test_data['Age'].isna().sum() == 0)

test_data = test_data.loc[test_data['Embarked'].notna(), :]

# Check if missing values have been filled for the 'Embarked' column
assert(test_data['Embarked'].isna().sum() == 0)

test_data.drop("Cabin", axis=1, inplace=True)

# Check if 'Cabin' column has been deleted
assert('Cabin' not in test_data)

categorical_cols = ["Pclass", "Sex", "Embarked"]

# replace negative ages with the median age
test_data.loc[test_data["Age"] < 0, "Age"] = test_data["Age"].median()

# assert that there are no negative ages
assert((test_data["Age"] > 0).all())

# replace negative values with 0
test_data.loc[test_data["SibSp"] < 0, "SibSp"] = 0
test_data.loc[test_data["Parch"] < 0, "Parch"] = 0

# assert that there are no negative values
assert((test_data["SibSp"] >= 0).all())
assert((test_data["Parch"] >= 0).all())

enc = OneHotEncoder()
res = enc.fit_transform(test_data[["Pclass"]]).toarray()
res = pd.DataFrame(res, columns=["1", "2", "3"], dtype='int8')
test_data = pd.merge(test_data, res, left_index=True, right_index=True)

res = enc.fit_transform(test_data[["Sex"]]).toarray()
res = pd.DataFrame(res, columns=["Female", "Male"], dtype='int8')
test_data = pd.merge(test_data, res, left_index=True, right_index=True)

res = enc.fit_transform(test_data[["Embarked"]]).toarray()
res = pd.DataFrame(res, columns=["C", "Q", "S"], dtype='int8')
test_data = pd.merge(test_data, res, left_index=True, right_index=True)

test_data.drop(["PassengerId", "Name", "Ticket"], axis=1, inplace=True)

assert("PassengerId" not in test_data)
assert("Name" not in test_data)
assert("Ticket" not in test_data)

test_data["Family"] = test_data["SibSp"] + test_data["Parch"]
test_data.drop(["SibSp", "Parch"], axis=1, inplace=True)

test_data["Alone"] = (~(test_data["Family"] > 0)).astype(int) # alone is true if family > 0

assert("SibSp" not in test_data)
assert("Parch" not in test_data)
assert("Alone" in test_data)

test_data.drop(["Pclass", "Sex", "Embarked"], axis=1, inplace=True)

assert("Pclass" not in test_data)
assert("Sex" not in test_data)
assert("Embarked" not in test_data)

<a class="anchor" id="create_submission_csv"></a>
## Create sumbission csv

In [None]:
clf = SVC()
clf.set_params(**opt.best_params_)

clf.fit(train_data.iloc[:,1:].to_numpy(), # X
        train_data.iloc[:,0].to_numpy()) # y

pred = clf.predict(X_test)
print(pred)

submit = pd.merge(test_id, pd.DataFrame(pred, columns=["Survived"]), left_index=True, right_index=True)
submit.to_csv("submission.csv", index=False)

## Thanks for Viewing