##Logistic Regression

We will explore diﬀerent methods for improving classiﬁcation performance in the presence of class imbalance. We focus on the ‘vowel’ dataset where the proportion of the positive class is approx 10%. All models should be trained using logistic regression and the metric for comparison will be the f1 score. Train the following policies on each fold and report the mean(std-dev) f1 score for all policies.

In [4]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale
from sklearn import metrics
import numpy.random as npr
import warnings
warnings.filterwarnings("ignore")

In [5]:
train1 = pd.read_csv("vowel_train1.csv", header = None)
train2 = pd.read_csv("vowel_train2.csv", header = None)
train3 = pd.read_csv("vowel_train3.csv", header = None)
train4 = pd.read_csv("vowel_train4.csv", header = None)
train5 = pd.read_csv("vowel_train5.csv", header = None)

In [6]:
tr_lb1 = pd.read_csv("vowel_tr_label1.csv", header = None)
tr_lb2 = pd.read_csv("vowel_tr_label2.csv", header = None)
tr_lb3 = pd.read_csv("vowel_tr_label3.csv", header = None)
tr_lb4 = pd.read_csv("vowel_tr_label4.csv", header = None)
tr_lb5 = pd.read_csv("vowel_tr_label5.csv", header = None)

In [7]:
test1 = pd.read_csv("vowel_test1.csv", header = None)
test2 = pd.read_csv("vowel_test2.csv", header = None)
test3 = pd.read_csv("vowel_test3.csv", header = None)
test4 = pd.read_csv("vowel_test4.csv", header = None)
test5 = pd.read_csv("vowel_test5.csv", header = None)

In [8]:
ts_lb1 = pd.read_csv("vowel_tst_label1.csv", header = None)
ts_lb2 = pd.read_csv("vowel_tst_label2.csv", header = None)
ts_lb3 = pd.read_csv("vowel_tst_label3.csv", header = None)
ts_lb4 = pd.read_csv("vowel_tst_label4.csv", header = None)
ts_lb5 = pd.read_csv("vowel_tst_label5.csv", header = None)

##Downsampling Policy

Write a function that will, given a training set, downsample the negative class samples so that the proportion of positive class samples is greater than 10%. Make this proportion a tuneable parameter of your function.

In [9]:
train = [train1, train2, train3, train4, train5]
tr_lb = [tr_lb1, tr_lb2, tr_lb3, tr_lb4, tr_lb5]
test = [test1, test2, test3, test4, test5]
ts_lb = [ts_lb1, ts_lb2, ts_lb3, ts_lb4, ts_lb5]

In [64]:
def downsample(xtrain,ytrain, p):
    pos_nm = ytrain[0].value_counts()[1]
    neg_nm = ytrain[0].value_counts()[0]
    all_nm = len(ytrain)
    posratio = float(pos_nm)/all_nm
    new_all_nm = int(float(pos_nm)/p + 1)
    new_neg_nm = new_all_nm - pos_nm
    mask = ytrain[0] == 1 
    postrain_y = ytrain[mask]
    negtrain_y = ytrain[~mask].sample(n = new_neg_nm, random_state = 77)
    frames_y = [postrain_y, negtrain_y]
    new_ytrain = pd.concat(frames_y)
    postrain_x = xtrain[mask]
    negtrain_x = xtrain[~mask].sample(n = new_neg_nm, random_state = 77)
    frames_x = [postrain_x, negtrain_x]
    new_xtrain = pd.concat(frames_x)
    return new_xtrain, new_ytrain 

In [65]:
p_list = [0.1,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.2,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.3]
f1_score_dict = {}

In [66]:
for i in range(len(p_list)):
    f1_list = []
    for j in range(len(train)):
        new_x, new_y = downsample(train[j], tr_lb[j], p_list[i])
        pred = LogisticRegression().fit(new_x,new_y).predict(test[j])
        f1 = metrics.f1_score(ts_lb[j], pred)
        f1_list.append(f1)
    f1_score_dict[p_list[i]] = np.mean(f1_list)

In [67]:
f1_score_dict

{0.1: 0.76594086021505381,
 0.11: 0.76814674256799498,
 0.12: 0.78854978354978356,
 0.13: 0.79391961085509466,
 0.14: 0.80279894473442859,
 0.15: 0.80698598892147277,
 0.16: 0.80698598892147277,
 0.17: 0.80163839533858494,
 0.18: 0.81214752567693738,
 0.19: 0.78191958191958189,
 0.2: 0.80252913652366453,
 0.21: 0.79317241137746575,
 0.22: 0.79123487522940328,
 0.23: 0.79694312047253235,
 0.24: 0.79813739248912552,
 0.25: 0.78091855004898492,
 0.26: 0.76668220668220666,
 0.27: 0.76232323232323229,
 0.28: 0.76516011175585641,
 0.29: 0.76808408836404352,
 0.3: 0.77316187584480267}

##Upsampling

Implement a function that will upsample (via replication) the minority class so that the new class ratio is p : (1−p) where p is a tuneable parameter as above.

In [68]:
def upsample(xtrain, ytrain, p):
    pos_nm = ytrain[0].value_counts()[1]
    neg_nm = ytrain[0].value_counts()[0]
    all_nm = len(ytrain)
    new_pos_nm = int(float(neg_nm)*p/float(1-p))
    mask = ytrain[0] == 0 
    negtrain_y = ytrain[mask]
    postrain_y = ytrain[~mask]
    ind_pos = np.array(postrain_y.index)
    masknew = npr.choice(ind_pos, size = new_pos_nm, replace = True)
    postrain_y = ytrain.ix[masknew]
    frames_y = [postrain_y, negtrain_y]
    new_ytrain = pd.concat(frames_y)
    postrain_x = xtrain.ix[masknew]
    negtrain_x = xtrain[mask]
    frames_x = [postrain_x, negtrain_x]
    new_xtrain = pd.concat(frames_x)
    return new_xtrain, new_ytrain

In [69]:
f1_score_dict = {}

In [70]:
for i in range(len(p_list)):
    f1_list = []
    for j in range(len(train)):
        new_x, new_y = upsample(train[j], tr_lb[j], p_list[i])
        pred = LogisticRegression().fit(new_x,new_y).predict(test[j])
        f1 = metrics.f1_score(ts_lb[j], pred)
        f1_list.append(f1)
    f1_score_dict[p_list[i]] = np.mean(f1_list)

In [71]:
f1_score_dict

{0.1: 0.75352918586789563,
 0.11: 0.76371661168821414,
 0.12: 0.77610743963881446,
 0.13: 0.74255411255411263,
 0.14: 0.80655478613732901,
 0.15: 0.73073207443897104,
 0.16: 0.72339080459770122,
 0.17: 0.76257575757575757,
 0.18: 0.74603641456582637,
 0.19: 0.75567844342037893,
 0.2: 0.74945115289517561,
 0.21: 0.7759848484848485,
 0.22: 0.78353495206436374,
 0.23: 0.79192810457516338,
 0.24: 0.80326270221007046,
 0.25: 0.78239002932551327,
 0.26: 0.80168746286393355,
 0.27: 0.81658219495031636,
 0.28: 0.78526544484798744,
 0.29: 0.81831102167914305,
 0.3: 0.79577457518633987}

##Class Weighting

Impose twice the weight of the majority class to the minority class. Repeat the training process with the newly designed logistic regression and report the f1 scores. 

In [58]:
f1_list = []

In [59]:
for j in range(len(train)):
    pred = LogisticRegression(class_weight = {0:1, 1:2}).fit(train[j],tr_lb[j]).predict(test[j])
    f1 = metrics.f1_score(ts_lb[j], pred)
    f1_list.append(f1)

In [60]:
np.mean(f1_list)

0.78494397759103651

##Vanilla Logistic Regression

Report the baseline performance of a simple logistic regression model.

In [61]:
f1_list = []

In [62]:
for j in range(len(train)):
    pred = LogisticRegression().fit(train[j],tr_lb[j]).predict(test[j])
    f1 = metrics.f1_score(ts_lb[j], pred)
    f1_list.append(f1)

In [63]:
np.mean(f1_list)

0.76239618094178729