# Project 1 Tutorial

In this jupyter notebook, we will be going over some functions that are essential for project 1.

In [1]:
# Dealing with strings
import string

# Replacing substrings
text = "The Big Bang Theory is the worst tv show ever"
print(text)
text = text.replace("worst", "best")
print(text)
text = text.replace("The Big Bang Theory", "30 Rock")
print(text)
text = text.replace("e", "E")
print(text)

The Big Bang Theory is the worst tv show ever
The Big Bang Theory is the best tv show ever
30 Rock is the best tv show ever
30 Rock is thE bEst tv show EvEr


In [2]:
# Punctuation

print(type(string.punctuation))
print(string.punctuation)
print(string.punctuation[3])
print(string.punctuation[-1])

<class 'str'>
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
$
~


In [3]:
# Upper and lower case
text = "Club de Cuervos"
print(text.upper())
print(text.lower())

CLUB DE CUERVOS
club de cuervos


In [4]:
# Splitting strings
text = "L'etat, c'est moi"
print(type(text.split()))
print(text.split())
print(text.split("'"))
print("-".join(text.split()))

<class 'list'>
["L'etat,", "c'est", 'moi']
['L', 'etat, c', 'est moi']
L'etat,-c'est-moi


## Using SVM's for data classification

We will now be going over how to use the SVM sklearn library

In [5]:
# Using SVC
# Loading the dataset from sklearn
# (only for demo purposes)
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()
X = dataset.data
y = dataset.target

# Splitting into train and test data
# (only for demo purposes)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


In [6]:
# Creating the classifiers

import numpy as np
from sklearn.svm import SVC, LinearSVC

clf = SVC(kernel='linear', C=1.0, class_weight="balanced")
# Training the classifier
clf.fit(X_train, y_train)

# Predicting classes
y_pred = clf.predict(X_test)

print(y_pred)

[0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1
 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1
 1 1 0 1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1]


# Metrics and Imbalances

Now that we have trained our classifier, we need to check if it is good at its job or if we need to alter its hyperparameters.

By using different metrics, we can check how good the classifier is at different things

In [7]:
# Metrics
from sklearn import metrics

print(f"accuracy: {metrics.accuracy_score(y_test,y_pred)}")
print(f"f1 score: {metrics.f1_score(y_test,y_pred)}")
print(f"precision: {metrics.precision_score(y_test,y_pred)}")
print(f"recall: {metrics.recall_score(y_test,y_pred)}")

accuracy: 0.951048951048951
f1 score: 0.966183574879227
precision: 0.9615384615384616
recall: 0.970873786407767


In [8]:
# Auroc score

# Due to how Auroc works, we have to use a "different" prediction

y_pred = clf.predict(X_test)
print(f"auroc with predict: {metrics.roc_auc_score(y_test,y_pred)}")
# The decision_function() gives "how positively/negatively" the point was classified
# rather than simply the class prediction
y_pred = clf.decision_function(X_test)
# print(y_pred)

print(f"auroc with decision_function: {metrics.roc_auc_score(y_test,y_pred)}")

auroc with predict: 0.9354368932038835
auroc with decision_function: 0.9927184466019418


We can also use a confusion matrix to get the exact number of correct and incorrect positive and negative classifications.
From these, we can also build almost all other performance metrics

In [9]:
# Confusion Matrix
y_pred = clf.predict(X_test)
conf_mat = metrics.confusion_matrix(y_test, y_pred, [1, 0])
tp = conf_mat[0, 0]
fn = conf_mat[0, 1]
fp = conf_mat[1, 0]
tn = conf_mat[1, 1]

print(f"True Positive: {tp}")
print(f"False Negative: {fn}")
print(f"False Positive: {fp}")
print(f"True Negative: {tn}")

True Positive: 100
False Negative: 3
False Positive: 4
True Negative: 36


In [10]:
# Putting in class weights

clf = SVC(kernel='linear', C=1.0, class_weight={0: 1, 1: 100})
# Training the classifier
clf.fit(X_train, y_train)

# Predicting classes
y_pred = clf.predict(X_test)

print(y_pred)

[1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1
 1 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1
 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1]
