## Active Learning

Download the titanic dataset here: https://drive.google.com/file/d/0Bz9_0VdXvv9bbVhpOEMwUDJ2elU/view?usp=sharing

In this exercise, we will simulate active learning. We will keep the small sample of observations for testing and we will test how quality of the model rises when we use active learning to choose labeled observations.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the Data into variable df
df = pd.read_csv("titanic_dataset.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.shape

(891, 12)

### EDA

In [5]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
df.fillna({"Age": df.Age.mean(),
           "Embarked": "S",
           },
         inplace=True)

In [7]:
df["Embarked"].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

### Encoding

In [8]:
df = pd.get_dummies(df, columns=["Embarked", "Sex"])

In [9]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,0,0,1,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,0,0,1,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,0,0,1,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,0,0,1,1,0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,0,0,1,0,1


# Split data

In [10]:
# TEST SAMPLE
# USE THIS SAMPLE ONLY FOR TESTING
test_df = df.sample(n=100, random_state=42)
# KEEP ONLY THOSE WHO ARE NOT IN THE TEST SET
df = df[~df.PassengerId.isin(test_df.PassengerId.tolist())]

In [11]:
df.shape

(791, 15)

In [12]:
# FIT THE FIRST MODEL ONLY ON THE DATAFRAME START_DF
start_df = df.sample(n=100, random_state=42)
# DROP OBS FROM START_DF FROM DF
df = df[~df.PassengerId.isin(start_df.PassengerId.tolist())]

In [13]:
df.shape

(691, 15)

### Tasks

1. fit the first model only on the **start_df** using **SVM** and evaluate accuracy, precision and recall on test_df
2. in each iteration, add 10 observations from **df** to your trainset (choose the observation using active learning approach) 
    - score all observations in df and take 10 where the model isn't sure what class it is. The probability of surviving will be around 50% 
3. refit the model and evaluate on **test_df** again.    
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [14]:
from sklearn.svm import SVC

### First Model

In [15]:
X_train = start_df.drop(["Survived", "PassengerId", "Cabin", "Fare", "Ticket", "Name"], axis=1)
y_train = start_df["Survived"]

X_test = test_df.drop(["Survived", "PassengerId", "Cabin", "Fare", "Ticket", "Name"], axis=1)
y_test = test_df["Survived"]

X = df.drop(["Survived", "PassengerId", "Cabin", "Fare", "Ticket", "Name"], axis=1)
y = df["Survived"]

In [16]:
X_train

Unnamed: 0,Pclass,Age,SibSp,Parch,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
288,2,42.000000,0,0,0,0,1,0,1
42,3,29.699118,0,0,1,0,0,0,1
416,2,34.000000,1,1,0,0,1,1,0
329,1,16.000000,0,1,1,0,0,1,0
587,1,60.000000,1,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...
10,3,4.000000,1,1,0,0,1,1,0
173,3,21.000000,0,0,0,0,1,0,1
431,3,29.699118,1,0,0,0,1,1,0
592,3,47.000000,0,0,0,0,1,0,1


In [17]:
print(y_train.value_counts())
print(y_train.shape)
print(y_test.value_counts())
print(y_train.shape)

0    60
1    40
Name: Survived, dtype: int64
(100,)
0    60
1    40
Name: Survived, dtype: int64
(100,)


In [18]:
svc = SVC(probability=True)
svc.fit(X_train, y_train)
pred = svc.predict(X_test)

In [19]:
y_test.to_numpy()

array([1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0])

In [20]:
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [21]:
from sklearn import metrics

In [22]:
accuracy = metrics.accuracy_score(y_test, pred)
precision = metrics.precision_score(y_test, pred)
recall = metrics.recall_score(y_test, pred)

  _warn_prf(average, modifier, msg_start, len(result))


In [23]:
print(accuracy, precision, recall, sep="\n")

0.6
0.0
0.0


### score observations and select 10 new samples

In [24]:
X_unlabelled = df.drop(["Survived", "PassengerId", "Cabin", "Fare", "Ticket", "Name"], axis=1)
y_unlabelled = df["Survived"]

In [25]:
probabilities = svc.predict_proba(X_unlabelled)

most_likely_labels = pd.Series(np.max(probabilities, axis=1), index=X_unlabelled.index)

least_confident_predictions = most_likely_labels.sort_values()

# get indices of rows to add
indices_to_select = least_confident_predictions.head(10).index

# select rows to add
rows_to_add = X_unlabelled.loc[indices_to_select]
X_unlabelled.drop(indices_to_select, inplace=True)

# add rows to X_train
X_train = pd.concat([X_train, rows_to_add])

# select corresponding labels
labels_to_add = y_unlabelled.loc[indices_to_select]
y_unlabelled.drop(indices_to_select, inplace=True)

# add corresponding labels to y train
y_train = pd.concat([y_train, labels_to_add])

# Rerun model

In [118]:
svc.fit(X_train, y_train)
pred = svc.predict(X_test)

In [119]:
accuracy = metrics.accuracy_score(y_test, pred)
precision = metrics.precision_score(y_test, pred)
recall = metrics.recall_score(y_test, pred)
print(accuracy, precision, recall, sep="\n")

0.73
0.6181818181818182
0.85


In [120]:
probabilities = svc.predict_proba(X_unlabelled)

most_likely_labels = pd.Series(np.max(probabilities, axis=1), index=X_unlabelled.index)

least_confident_predictions = most_likely_labels.sort_values()

# get indices of rows to add
indices_to_select = least_confident_predictions.head(10).index

# select rows to add
rows_to_add = X_unlabelled.loc[indices_to_select]
X_unlabelled.drop(indices_to_select, inplace=True)

# add rows to X_train
X_train = pd.concat([X_train, rows_to_add])

# select corresponding labels
labels_to_add = y_unlabelled.loc[indices_to_select]
y_unlabelled.drop(indices_to_select, inplace=True)

# add corresponding labels to y train
y_train = pd.concat([y_train, labels_to_add])

# Conclusion

There is some stochastisity in the results, but after 18 or 19 iterations recall score seemed to be consitently around .9

If I had more time I would put this all in a for loop and define functions but whatever I don't