<a href="https://colab.research.google.com/github/acedesci/scanalytics/blob/master/S7_intro_ML/S7_Demo2_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 7: Introduction to ML - Classification

## Demo 2: Classification on Toy Dataset 2 (Titanic)

Classification using the Dataset from Kaggle (https://www.kaggle.com/c/titanic/data). The data file is provided to you and please put the folder "data_titanic" in the same folder as this jupyter notebook.

In [0]:
import sklearn
from sklearn import *
sklearn.__version__

In [0]:
import pandas #we also need pandas package here to load the csv data

# Load dataset
url = 'https://raw.githubusercontent.com/acedesci/scanalytics/master/S7_intro_ML/data/titanic_train.csv'
data = titanic_data = pandas.read_csv(url)
data.head(5)

In [0]:
le_gender = sklearn.preprocessing.LabelEncoder()
data['binary_gender'] = le_gender.fit_transform(titanic_data['Sex'].values)
selected_features = ['Pclass','Age', 'SibSp', 'Parch', 'Fare','binary_gender']
X = data[selected_features].fillna(0).values
y = data['Survived'].values
data.head(5)

In [0]:
# Split into training and testing data (72/25 by default)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=0)

### 1. Logistic regression model

In [0]:
# Run Logistic regression
logreg = sklearn.linear_model.LogisticRegression().fit(X_train, y_train)

# Print the results
print("Logistic Regression: Training accuracy: {:.3f}".format(logreg.score(X_train, y_train)))
print("Logistic Regression: Testing accuracy: {:.3f}".format(logreg.score(X_test, y_test)))

print("intercept"+str(logreg.intercept_))
print("coefficients:"+str(logreg.coef_[0]))

### 2. Decision tree model

In [0]:
# Run decision tree
tree = sklearn.tree.DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
print("Decision Tree: Training accuracy: {:.3f}".format(tree.score(X_train, y_train)))
print("Decision Tree: Testing accuracy: {:.3f}".format(tree.score(X_test, y_test)))

#### 2.1 Explore feature importance and tree

In [0]:
print(selected_features)
print(tree.feature_importances_)

In [0]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,12))
sklearn.tree.plot_tree(tree, feature_names=selected_features, fontsize=10, max_depth=2)
plt.show()

#### 2.2 Performing prediction from the trained (fitted) model

In [0]:
#Note: input features = ['Pclass','Age', 'SibSp', 'Parch', 'Fare','binary_gender']
jack = [3., 19., 0., 0., 5.0, 1.0]
rose = [1., 17., 1., 2., 100.0, 0.0]

print([jack,rose])

# Predict surviving chances
survivor_predictions = tree.predict([jack, rose])
print("Jack Survival Rate:", survivor_predictions[0])
print("Rose Survival Rate:", survivor_predictions[1])

### 3. Evaluation - calculate precision and recall

Calculate precision and recall for the model recently fitted (either logistic regression or decision tree)

In [0]:
# calculate precision and recall

y_predict = logreg.predict(X_test) #use this for logistic regression 
# y_predict = tree.predict(X_test) #use this for decision tree

print("Precision score: {:.3f}".format(sklearn.metrics.precision_score(y_test, y_predict)))
print("Recall score: {:.3f}".format(sklearn.metrics.recall_score(y_test, y_predict)))

Here are the codes to plot precision-recall curve (see https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py)

Here I have put both the logistic regression or decision tree. However, the precision-recall curve is generally used for the threshold analysis of logistic regression.

In [0]:
import matplotlib.pyplot as plt

y_prob = logreg.decision_function(X_test) #use this for logistic regression
# y_prob = tree.predict_proba(X_test)[:,1]  #use this for decision tree

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, y_prob)

plt.figure()
plt.step(recall, precision)
average_precision = sklearn.metrics.average_precision_score(y_test, y_prob)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Average precision score: AP={0:0.2f}'.format(average_precision))