## Active Learning

Download the titanic dataset here: https://drive.google.com/file/d/0Bz9_0VdXvv9bbVhpOEMwUDJ2elU/view?usp=sharing

In this exercise, we will simulate active learning. We will keep the small sample of observations for testing and we will test how quality of the model rises when we use active learning to choose labeled observations.

In [1]:
import pandas as pd

In [2]:
df_original=pd.read_csv('titanic_dataset.csv')

In [3]:
df_original.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df_original.shape

(891, 12)

In [5]:
# TEST SAMPLE
# USE THIS SAMPLE ONLY FOR TESTING
test_df = df_original.sample(n=100, random_state=42)
# KEEP ONLY THOSE WHO ARE NOT IN THE TEST SET
df = df_original[~df_original.PassengerId.isin(test_df.PassengerId.tolist())]

In [6]:
# FIT THE FIRST MODEL ONLY ON THE DATAFRAME START_DF
start_df = df.sample(n=100, random_state=42)
# DROP OBS FROM START_DF FROM DF
df = df[~df.PassengerId.isin(start_df.PassengerId.tolist())]

### Tasks

1. fit the first model only on the **start_df** using **SVM** and evaluate accuracy, precision and recall on test_df
2. in each iteration, add 10 observations (choose the observation using active learning approach) from **df** to your trainset, refit the model and evaluate on test_df again
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [7]:
df.shape

(691, 12)

In [8]:
test_df.shape

(100, 12)

In [9]:
start_df.shape

(100, 12)

In [10]:
start_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [11]:
start_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
288,289,1,2,"Hosono, Mr. Masabumi",male,42.0,0,0,237798,13.0,,S
42,43,0,3,"Kraeff, Mr. Theodor",male,,0,0,349253,7.8958,,C
416,417,1,2,"Drew, Mrs. James Vivian (Lulu Thorne Christian)",female,34.0,1,1,28220,32.5,,S
329,330,1,1,"Hippach, Miss. Jean Gertrude",female,16.0,0,1,111361,57.9792,B18,C
587,588,1,1,"Frolicher-Stehli, Mr. Maxmillian",male,60.0,1,1,13567,79.2,B41,C


In [12]:
start_df['Cabin'].value_counts()

C23 C25 C27    2
F G73          1
B41            1
B18            1
B51 B53 B55    1
E121           1
F33            1
E24            1
G6             1
A34            1
A19            1
D10 D12        1
B96 B98        1
Name: Cabin, dtype: int64

In [13]:
#encode Categorical Vars
start_df['Sex'] = pd.factorize(start_df['Sex'])[0]
start_df['Embarked'] = pd.factorize(start_df['Embarked'])[0]

In [14]:
# missing data
total = start_df.isnull().sum().sort_values(ascending=False)
percent = (start_df.isnull().sum()/start_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
Cabin,86,0.86
Age,21,0.21
Embarked,0,0.0
Fare,0,0.0
Ticket,0,0.0
Parch,0,0.0
SibSp,0,0.0
Sex,0,0.0
Name,0,0.0
Pclass,0,0.0


In [15]:
start_df=start_df.drop('Cabin',axis=1)

In [16]:
start_df=start_df.drop('Name',axis=1)

In [17]:
# Fill age with median
start_df['Age']=start_df['Age'].fillna(start_df['Age'].median())

In [18]:
# Scale Numeric Vars (Age, Fare)
from sklearn.preprocessing import StandardScaler
num_scaled = StandardScaler().fit_transform(start_df[['Age','Fare']])
num_scaled=pd.DataFrame(num_scaled)

In [19]:
num_scaled.columns=['Age','Fare']

In [20]:
num_scaled.tail()

Unnamed: 0,Age,Fare
95,-1.870339,-0.233231
96,-0.6777,-0.368007
97,-0.116458,-0.242447
98,1.146337,-0.378375
99,-0.116458,-0.132629


In [21]:
start_df[['Pclass', 'Sex', 'SibSp',
       'Parch', 'Embarked']].reset_index().drop('index',axis=1)

Unnamed: 0,Pclass,Sex,SibSp,Parch,Embarked
0,2,0,0,0,0
1,3,0,0,0,1
2,2,1,1,1,0
3,1,1,0,1,1
4,1,0,1,1,1
...,...,...,...,...,...
95,3,1,1,1,0
96,3,0,0,0,0
97,3,1,1,0,0
98,3,0,0,0,0


In [22]:
y=start_df['Survived']

In [23]:
X=pd.concat([start_df[['Pclass', 'Sex', 'SibSp',
       'Parch', 'Embarked']].reset_index().drop('index',axis=1),num_scaled],axis=1)

In [24]:
y.shape

(100,)

In [25]:
X.shape

(100, 7)

In [26]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
clf = SVC()
clf.fit(X, y)

SVC()

## Need to clean test Df to prepare for prediction

In [27]:
#encode Categorical Vars
test_df['Sex'] = pd.factorize(test_df['Sex'])[0]
test_df['Embarked'] = pd.factorize(test_df['Embarked'])[0]

In [28]:
# Fill age with median
test_df['Age']=test_df['Age'].fillna(test_df['Age'].median())

In [29]:
# Scale Numeric Vars (Age, Fare)
from sklearn.preprocessing import StandardScaler
num_scaled = StandardScaler().fit_transform(test_df[['Age','Fare']])
num_scaled=pd.DataFrame(num_scaled)

In [30]:
num_scaled.columns=['Age','Fare']

In [31]:
y_test=test_df['Survived']

In [32]:
X_test=pd.concat([test_df[['Pclass', 'Sex', 'SibSp',
       'Parch', 'Embarked']].reset_index().drop('index',axis=1),num_scaled],axis=1)

In [33]:
X_test.shape

(100, 7)

In [34]:
y_test.shape

(100,)

In [35]:
#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [36]:
from sklearn import metrics

In [37]:
# Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[51,  9],
       [11, 29]])

In [38]:
# Accuracy, Precision, Recall
print ("accuracy: ", metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

accuracy:  0.8
Precision: 0.7631578947368421
Recall: 0.725


# Now that my baseline model is done:

1. in each iteration, add 10 observations (choose the observation using active learning approach) from df to your trainset, refit the model and evaluate on test_df again
2. the goal is to converge to the optimal solution as fast as possible by choosing right observations in each iteration
3. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [None]:
# I want to choose samples that performed poorly - or that my model wasn't sure about

# how do I find these samples? use predict_probas - 
#look at the samples with the lowest probabilities and see if there is a pattern

#then choose similar samples from df that match those kinds

In [80]:
df.shape

(691, 12)