**Data Mining Lab 2: Classifiers, University of Victoria, Summer 2023** <br>
**TA: Jonas Buro (buro@uvic.ca)**

# Lab 2
1. Binary Classification
2. Performance Measures for Classifiers
3. Classification Excercise

# 1) Binary Classification

The task in binary classification is to design or learn a function $f$, which when applied to feature vector $x$ drawn from the feature space $X$, outputs predicted label $\hat{y}$ which is equal to $x$'s true label $y$ with high probability.

$$ f(x) = \hat{y}$$

Sometimes, it is clear how the the features of $x$ impact its true label $y$. If this relationship is easy to express, then we can explicitly define  $f$ and be sure that it will output the correct classification for any $x \in X$.

Often, it is not clear how the features combine to determine its true label, or, if it is, the relationship is difficult to express. In such instances, supervised machine learning methods can be used to approximate $f$.

Some applications of binary classification include:
- Detecting spam (is this email spam?)
- Disease diagnosis (does this patient have cancer?)
- Object recognition (does this image contain a dog?)



Some common machine learning methods for binary classification are:
- Decision Trees
- Random Forests
- SVMs
- Neural Networks
- Logistic Regressors

The selection of which model to use depends on the type and quantity of the data; each model has tradeoffs. By experimenting with multiple models and evaluating their performance, the analyst can determine the best choice for the problem at hand, as well as develop intuition for which models fare better under different circumstances.

# 2) Performance Metrics for Classifiers

 In predictive modeling using regression, where the goal is to predict a numerical $\hat{y}$, the analyst is much more interested in knowing by HOW MUCH the predictions are incorrect, not if they ARE incorrect, hence the use of error based evaluation metrics, as opposed to accuracy (ratio of correct predictions).

Determining the performance of classifiers is not as straightforward as it is for regressors. For classifiers, accuracy is sometimes not an acceptable metric. Consider a case where the data polarizes heavily towards one class, in this case, the model might achieve high accuracy by learning to just predict the majority class, thus performing poorly on the minority class.

A better indicator is to look at the Confusion Matrix:

<p align="center">
<img src="https://static.packt-cdn.com/products/9781838555078/graphics/C13314_06_05.jpg" alt="Confusion Matrix" width="35%" height="35%">
</p>

<p align="center">
<img src="https://onestopdataanalysis.com/wp-content/uploads/2020/02/confusion_matrix.png" alt="Confusion Matrix" width="35%" height="35%">
</p>

## Precision and Recall

Accuracy evaluates proportion of all correct predictions: $$ a = \frac{TP + TN}{TP + TN + FP + FN} $$

Precision evaluates proportion of positive predictions which are actually correct: $$p = \frac{TP}{TP + FP}$$

Recall evaluates proportion of actual positives which are correctly identified:  $$r = \frac{TP}{TP + FN}$$

These two metrics are in an inverse relationship: Improving precision means reducing FP, which means the model is less likely to make positive predictions. This might result in increasing FNs, reducing recall. 

Conversely, improving recall means reducing FN, which means the model is more likely to make positive predictions. This might result in more FP, reducing precision. $$ FP \propto FN $$

Context determines which of these is the more valuable metric. Determine which is more important: FP or FN.

Consider the problem of cancer prediction in patients. Here, the analyst wants a model with high recall: if someone has cancer, and the model classifies them as being cancer-free (FN), this might be catastrophic as it might be a matter of life or death.

In the spam detection problem, the analyst wants a model which has high precision: if a mail is not spam, but is classified as such (FP), it is inconvenient for the user.


## F measure
Method for striking balance between precision and recall. $$ F_{\beta} = (1 + \beta^2) \frac{Precision * Recall}{(\beta^2 * Precision) + Recall}  $$

Here, $\beta$ is a tuning parameter, which allows the analyst to specify the importance of one metric relative to the other.



# 3) Excercise: Titanic Survivor Prediction

The task is to use data from the sinking of the Titanic to predict whether or not a passenger survived or not.

The data and information about it is located here: https://www.kaggle.com/competitions/titanic/data

Steps:

1. Get the data
2. Investigate the data
3. Prepare the data for ML
4. Select a model and train it
5. Evaluate your model

In [None]:
import os
import urllib.request

# Fetch data
TITANIC_PATH = os.path.join("datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/titanic/"

def fetch_titanic_data(url=DOWNLOAD_URL, path=TITANIC_PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    for filename in ("train.csv", "test.csv"):
        filepath = os.path.join(path, filename)
        if not os.path.isfile(filepath):
            print("Downloading", filename)
            urllib.request.urlretrieve(url + filename, filepath)

fetch_titanic_data()

Downloading train.csv
Downloading test.csv


In [None]:
import pandas as pd
# Load data

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")


In [None]:
# Inspect the data shape and integrity
# .head(), .info(), .describe(), .value_counts()
train_data.head()
drop_train = train_data.drop(["Name","Ticket","Cabin","Embarked","Fare"],axis= 1)
drop_train.head()
drop_train.describe()
# Consider feature engineering at this point, such as replacing null values with medians, removing columns, etc

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch
count,891.0,891.0,891.0,714.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699113,0.523008,0.381594
std,257.353842,0.486592,0.836071,14.526507,1.102743,0.806057
min,1.0,0.0,1.0,0.4167,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0
50%,446.0,0.0,3.0,28.0,0.0,0.0
75%,668.5,1.0,3.0,38.0,1.0,0.0
max,891.0,1.0,3.0,80.0,8.0,6.0


In [None]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
men = women = train_data.loc[train_data.Sex == 'male']["Survived"]
print("Rate of women survived: ",sum(women)/len(women))
print("Rate of men survived: ",sum(men)/len(men))


Rate of women survived:  0.18890814558058924
Rate of men survived:  0.18890814558058924


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.compose import ColumnTransformer

# Prepare data for ML algorithms by building a preprocessing pipeline for numeric and categorical values
num_cols = ["Pclass","Age","SibSp","Parch"]
cat_cols = ["Sex"]

cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder()), 
])
num_pipeline = Pipeline([
    ("imputer",SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])
preprocessing_pipeline = ColumnTransformer([
    ("numeric", num_pipeline, num_cols),
    ("categorical", cat_pipeline, cat_cols),
])
# apply pipeline to train_data
x_train = preprocessing_pipeline.fit_transform(drop_train)
print(x_train)

# extract labels from train_data into train_labels
y_train = train_data["Survived"]

[[ 0.82737724 -0.56573582  0.43279337 -0.47367361  0.          1.        ]
 [-1.56610693  0.6638609   0.43279337 -0.47367361  1.          0.        ]
 [ 0.82737724 -0.25833664 -0.4745452  -0.47367361  1.          0.        ]
 ...
 [ 0.82737724 -0.10463705  0.43279337  2.00893337  1.          0.        ]
 [-1.56610693 -0.25833664 -0.4745452  -0.47367361  0.          1.        ]
 [ 0.82737724  0.20276213 -0.4745452  -0.47367361  0.          1.        ]]


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Select a model and train it using x_train, y_train
rf_model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=2)
rf_model.fit(x_train, y_train)



In [None]:
# Evaluate model 
# k fold cross validation, confusion matrix
from sklearn.model_selection import cross_val_score
rf_scores = cross_val_score(rf_model, x_train, y_train, cv=5) # Change cv value as desired

print("Random Forest CV Scores:", rf_scores.mean())

Random Forest CV Scores: 0.8136777352331931
