# COMP534 Lab 2 Exercise

*Try to get the best model possible and beat your peers' models.*

Different from the toy dataset, we can download the Titanic dataset (JUST THE TRAIN SET) from the [website](https://www.kaggle.com/c/titanic/data) and upload it into Colab (just like last session).

The goal of this dataset is to predict whether a passenger survived the sinking of the Titanic using attributes such as gender, age, etc. In order to use this dataset as input for machine learning, we need to clean and prepare it first.


Clean and prepare the dataset:
  - remove unnecessary variables - you can use [Pandas Drop Function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) for this
  - use plots to identify data issues, such as outliers and missing data, and address these issues - you can use [different functions](https://pandas.pydata.org/docs/reference/frame.html#missing-data-handling) for this
  - encode variables if necessary
  - normalize/standardize variables if necessary

In [12]:
# coding here - create new code cells if necessary
import numpy as np
import pandas as pd

raw_train = pd.read_csv('train.csv')
raw_train = raw_train.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis=1)  # unnecessary variables
raw_train = raw_train.dropna()  # drop data points with missing values
print(raw_train.info())

raw_train['Embarked'] = raw_train['Embarked'].astype('category')
raw_train['Sex'] = raw_train['Sex'].astype('category')
cat_columns = raw_train.select_dtypes(['category']).columns
raw_train[cat_columns] = raw_train[cat_columns].apply(lambda x: x.cat.codes)  # encode categorical variables
print(raw_train.head())

X = raw_train.iloc[:, 1:]
y = raw_train.iloc[:, 0]

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  712 non-null    int64  
 1   Pclass    712 non-null    int64  
 2   Sex       712 non-null    object 
 3   Age       712 non-null    float64
 4   SibSp     712 non-null    int64  
 5   Parch     712 non-null    int64  
 6   Fare      712 non-null    float64
 7   Embarked  712 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 50.1+ KB
None
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0         0       3    1  22.0      1      0   7.2500         2
1         1       1    0  38.0      1      0  71.2833         0
2         1       3    0  26.0      0      0   7.9250         2
3         1       1    0  35.0      1      0  53.1000         2
4         0       3    1  35.0      0      0   8.0500         2


Now that the dataset is clean, we can split it to train and test machine learning models. As seen in the lectures, there are several ways to split the data - you can check out some of these implemented in SKLearn [here](https://scikit-learn.org/stable/api/sklearn.model_selection.html).

In our case, we will use the [regular train/test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split).
The cell below splits your data into train (75%) and test (25%).

- X is a tensor with your attributes - if you have a variable with a different name, just change it here
- y is an array with your target variable (Survived) - if you have a variable with a different name, just change it here

- X_train, y_train - the data and label you will use to **train**
- X_test, y_test - the data and label you will use to **test**

**CHANGE ONLY THE NAME OF THE VARIABLES - DO NOT CHANGE ANYTHING ELSE**

In [13]:
from sklearn.model_selection import train_test_split

# DO NOT CODE HERE - change the name of the variables only if necessary
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
                                                    random_state=42, stratify=y)

Now that your dataset is ready to use, what is the best accuracy you can achieve?
Feel free to experiment with different methods. Try tweaking the hyperparameters (like k for kNN or n_estimators for RandomForest).

For all trained models, use the [balanced accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) **on the test set** (X_test and y_test) to evaluate your models.

Once you're happy with your results, report them [here](https://docs.google.com/spreadsheets/d/1dKtIFVtamqOYAnhdTA5nVlSpy_iOu89NKMmlqaPBtBQ/edit?usp=sharing) and see if you can beat your colleagues' models.

In [16]:
# coding here - create new code cells if necessary
from sklearn.metrics import balanced_accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

rf = RandomForestClassifier(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
print("Random Forest: ", balanced_accuracy_score(y_test, rf_pred))

sv = svm.SVC(kernel="linear", C=1)
sv.fit(X_train, y_train)
sv_pred = sv.predict(X_test)
print("SVM:", balanced_accuracy_score(y_test, sv_pred))

Random Forest:  0.75973630831643
SVM: 0.7538539553752536
