In [122]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In this notebook, we will going to choose the techniques we are going to apply for the final pipeline. Since it is a first approach model, we decided to pick which model we are going to train in the following steps by cross-validating them with a very simple default LogisticRegression. 
- If one of the techniques are proven to be significantly better than the other, we will use that in the model pipeline 
- If not, the simplest one will be chosen.

We kept track of the data indexes, since some of the data will be a sparse numpy array. 
Even though the split was done by a seed, because we cannot ensure the consistency between library versions or OS systems.

We have chosen the f1_macro metric, since it's a symmetric metric of f1 and we care the same about false positives and false negatives and getting right both of the categories. 

In [21]:
df = pd.read_csv('processed_text_with_all.csv')
df = df.dropna()

train_index = pd.read_csv('train_index.csv').values[:, 0]
test_index = pd.read_csv('test_index.csv').values[:, 0]

In [22]:
X_train = df.Text.loc[train_index]
X_test = df.Text.loc[test_index]
y_train = df.Labels.loc[train_index]
y_test = df.Labels.loc[test_index]

# Text to vector

Algorithms to convert text into vector after having processing the text. We are going to try: CountVectorizer, TF-IDF and Word2Vec. The Word2Vec training is on the Word2VecTraining notebook, here we will only load the data.

- We have chosen 300 as the number of features to avoid memory problems and overfitting, keeping a reasonable rate between the number of rows and columns. Then, we will compare all the models having their first 300 features.

- We are going to train a 1 to 1-3 ngrams in the CountVectorizer and the TF-IDF model. The 2 ngram model to add the words with not and some compose words and the 3 to avoid left any useful extra information left.

In [23]:
count_vect3 = CountVectorizer(ngram_range=(1, 3), max_features=300)
X_train_cv3 = count_vect3.fit_transform(X_train) 
# X_val_cv = count_vect.transform(X_val)
X_test_cv3 = count_vect3.transform(X_test)

In [70]:
count_vect2 = CountVectorizer(ngram_range=(1, 2), max_features=300)
X_train_cv2 = count_vect2.fit_transform(X_train) 
# X_val_cv = count_vect.transform(X_val)
X_test_cv2 = count_vect2.transform(X_test)

In [86]:
count_vect1 = CountVectorizer(ngram_range=(1, 1), max_features=300)
X_train_cv1 = count_vect1.fit_transform(X_train) 
# X_val_cv = count_vect.transform(X_val)
X_test_cv1 = count_vect1.transform(X_test)

In [89]:
count_vect2.vocabulary_ == count_vect3.vocabulary_

True

In [110]:
count_vect2.vocabulary_ == count_vect1.vocabulary_

False

In [24]:
tfidf3 = TfidfVectorizer(ngram_range=(1, 3), max_features=300)
X_train_tfidf3 = tfidf3.fit_transform(X_train)
# X_val_tfidf = tfidf.transform(X_val)
X_test_tfidf3 = tfidf3.transform(X_test)

In [71]:
tfidf2 = TfidfVectorizer(ngram_range=(1, 2), max_features=300)
X_train_tfidf2 = tfidf2.fit_transform(X_train)
# X_val_tfidf = tfidf.transform(X_val)
X_test_tfidf2 = tfidf2.transform(X_test)

In [90]:
tfidf1 = TfidfVectorizer(ngram_range=(1, 1), max_features=300)
X_train_tfidf1 = tfidf1.fit_transform(X_train)
# X_val_tfidf = tfidf.transform(X_val)
X_test_tfidf1 = tfidf1.transform(X_test)

In [91]:
tfidf3.vocabulary_ == tfidf2.vocabulary_

True

In [92]:
tfidf2.vocabulary_ == tfidf1.vocabulary_

False

In [50]:
X_train_w2v = pd.read_csv('w2v_window10_train_data.csv', index_col=0)
X_test_w2v = pd.read_csv('w2v_window10_test_data.csv', index_col=0)

In [93]:
train_data_from_models = {'count_vec_ngram1': {'x': X_train_cv1,
                                               'y': y_train},
                          'tfidf_ngram1': {'x': X_train_tfidf1,
                                           'y': y_train},
                          'count_vec_ngram2': {'x': X_train_cv2,
                                               'y': y_train},
                          'tfidf_ngram2': {'x': X_train_tfidf2,
                                           'y': y_train},
                          'word2vec': {'x': X_train_w2v,
                                       'y': y_train}}

In [94]:
from sklearn.model_selection import cross_val_score


def generate_cross_scoring_table_from_transf_data(train_data_dict, 
                                                  pred_model, scoring='f1_macro', cv=10):

    all_scores = [cross_val_score(pred_model, train_data['x'], 
                                  train_data['y'], cv=cv, scoring=scoring)
                  for train_data in train_data_dict.values()]

    return pd.DataFrame(all_scores, index=train_data_dict.keys())

In [95]:
lr = LogisticRegression()


result = generate_cross_scoring_table_from_transf_data(train_data_from_models, 
                                                       lr, scoring='f1_macro', cv=10)


result

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count_vec_ngram1,0.73072,0.744704,0.740865,0.744642,0.733056,0.746572,0.741397,0.748263,0.743674,0.738538
tfidf_ngram1,0.747909,0.759543,0.754347,0.762461,0.754868,0.759464,0.753718,0.760397,0.75801,0.751561
count_vec_ngram2,0.734709,0.746985,0.743778,0.748821,0.737387,0.747985,0.744234,0.749215,0.74373,0.741885
tfidf_ngram2,0.751512,0.761197,0.756324,0.765519,0.757312,0.760695,0.753588,0.762479,0.759185,0.753557
word2vec,0.81996,0.833779,0.826369,0.834525,0.830219,0.83077,0.827782,0.834607,0.831262,0.83044


The word2vec is clearly better than the rest, so we will keep the data and the model from the word2vec for the following steps.

# Handle imbalance labels

We will be trying a simple random undersampling, random oversampling and SMOTE and see if the results of the model are better applying those techniques.

Here, we decided not to use the same cross validation technique, to be able to evaluate the model in data that has not been sampling to avoid bias in the model.

In [62]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

In [112]:
X_us_res, y_us_res = RandomUnderSampler().fit_resample(X_train_w2v, y_train)

In [114]:
X_os_res, y_os_res = RandomOverSampler().fit_resample(X_train_w2v, y_train)

In [115]:
X_sm_res, y_sm_res = SMOTE().fit_resample(X_train_w2v, y_train)

In [125]:
imb_models = {'base': None,
              'undersampling': RandomUnderSampler(),
              'oversampling': RandomOverSampler(),
              'smote': SMOTE()}

In [157]:
from sklearn.model_selection import ShuffleSplit


def generate_imb_results_table(pred_model_class, imb_models, X_train_orig, y_train_orig):
    all_results = []
    for model in imb_models.values():
        ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
        ind_results = []
        for train_index, test_index in ss.split(X_train_w2v.values):
            try:
                x_train = X_train_orig.iloc[train_index]
                y_train = y_train_orig.iloc[train_index]
                x_test = X_train_orig.iloc[test_index]
                y_test = y_train_orig.iloc[test_index]
            except:
                return train_index, test_index

            if model:
                x_train_imb, y_train_imb = model.fit_resample(x_train, y_train)
            else:
                x_train_imb, y_train_imb = x_train, y_train
            pred_model = pred_model_class()
            pred_model.fit(x_train_imb, y_train_imb)
            y_pred = pred_model.predict(x_test)
            ind_results.append(f1_score(y_test, y_pred, average='macro'))
            
        all_results.append(ind_results)
    return pd.DataFrame(all_results, index=imb_models.keys())

In [159]:
imb_result = generate_imb_results_table(LogisticRegression, imb_models, X_train_w2v, y_train)

In [160]:
imb_result

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
base,0.830428,0.831779,0.83053,0.832414,0.826815,0.82795,0.828817,0.826058,0.829886,0.828536
undersampling,0.804243,0.807803,0.801824,0.80243,0.803316,0.801522,0.80376,0.802825,0.803799,0.803864
oversampling,0.804965,0.808338,0.803273,0.802389,0.803654,0.803031,0.805854,0.803048,0.804639,0.804271
smote,0.810347,0.814086,0.809512,0.807804,0.808141,0.808434,0.80985,0.808486,0.809418,0.80897


In [117]:
from scipy.stats import ttest_ind, wilcoxon


def test_ditr_significantly_greater_than_another(data1, data2, test_type, alpha=0.05):

    print(f'----------------------- {test_type.capitalize()} -----------------------')

    if test_type == 'ttest':
        test_func = ttest_ind
    elif test_type == 'wilcoxon':
        test_func = wilcoxon
    else:
        raise ValueError(f'test_type must be ttest or wilcoxon not {test_type}')

    statistic, p_value = test_func(data1, data2, alternative='greater')
    
    if p_value < alpha:
        print("The mean of the first distribution is significantly" +
            " greater than the mean of the second one.")
    else:
        print("The mean of the first distribution is not significantly" +
            " greater than the mean of the second one.")
        
    return statistic, p_value

In [161]:
statistic, p_value = ttest_ind(imb_result.loc['base'],
                               imb_result.loc['oversampling'], 
                               alternative='greater')

test_ditr_significantly_greater_than_another(imb_result.loc['base'], 
                                             imb_result.loc['oversampling'], 
                                             'ttest')
    
test_ditr_significantly_greater_than_another(imb_result.loc['base'], 
                                             imb_result.loc['oversampling'], 
                                             'wilcoxon')

----------------------- Ttest -----------------------
The mean of the first distribution is significantly greater than the mean of the second one.
----------------------- Wilcoxon -----------------------
The mean of the first distribution is significantly greater than the mean of the second one.


(55.0, 0.0009765625)

In [162]:
test_ditr_significantly_greater_than_another(imb_result.loc['base'], 
                                             imb_result.loc['undersampling'], 
                                             'ttest')
    
test_ditr_significantly_greater_than_another(imb_result.loc['base'], 
                                             imb_result.loc['undersampling'], 
                                             'wilcoxon')

----------------------- Ttest -----------------------
The mean of the first distribution is significantly greater than the mean of the second one.
----------------------- Wilcoxon -----------------------
The mean of the first distribution is significantly greater than the mean of the second one.


(55.0, 0.0009765625)

In [163]:
test_ditr_significantly_greater_than_another(imb_result.loc['base'], 
                                             imb_result.loc['smote'], 
                                             'ttest')
    
test_ditr_significantly_greater_than_another(imb_result.loc['base'], 
                                             imb_result.loc['smote'], 
                                             'wilcoxon')

----------------------- Ttest -----------------------
The mean of the first distribution is significantly greater than the mean of the second one.
----------------------- Wilcoxon -----------------------
The mean of the first distribution is significantly greater than the mean of the second one.


(55.0, 0.0009765625)

It seems like for this dataset, the imbalancing techniques worsen the results. Therefore, we are skipping them.