## Conventional Text Preprocessing with Bag of Words, Machine Learning + Bayesian Hyperparameters Tuning with TPE and Ensembles

In this kernel, i will clean the text and use normal conventional methods to prepare the text for machine learning algorithms:
- CountVectorizer
- TFIDF

After input is in format suitable for machine learning alorithms, I will train few classical ML algorithms with default parameters:
- LogisticRegression
- Naive Bayes
- Random Forest
- XGBoost
- LightGBM

Next, I will try to optimize hyperparameters for each model using Bayesian Optimization with Tree Parzen Estimator (TPE) algorithm.

Finally, optimized models will be ensebmled by averaging predictions and forming majority rule ensemble.

References:
* https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
* https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/

In [1]:
# Imports
import os 
import random
import copy
import time
import pandas as pd
import numpy as np
import gc
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from hyperopt import hp, tpe
from hyperopt.fmin import fmin
from hyperopt.pyll.stochastic import sample

from tqdm import tqdm_notebook, tnrange
from tqdm.auto import tqdm
tqdm.pandas()

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import  SnowballStemmer
from nltk.tokenize.toktok import ToktokTokenizer

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import preprocessing, decomposition, model_selection, metrics
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# cross validation and metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

import lightgbm as lgb
import xgboost as xgb

### Basic Parameters

In [2]:
SEED = 2576

## Data Preparation and Cleaning

Basic Preprocessing Techniques for text data:
* Cleaning Special Characters and Removing Punctuations
* Cleaning Numbers
* Removing Misspells
* Removing Contractions

Since we are going to create features for words in the feature creation step, it makes sense to reduce words to a common denominator so that ‘organize’, ‘organizes’ and ‘organizing’ could be referred to by a single word ‘organize’.

There are few most common ways to do this:
* Stemming
    * Stemming is the process of converting words to their base forms using crude Heuristic rules. For example, one rule could be to remove ’s’ from the end of any word, so that ‘cats’ becomes ‘cat’. or another rule could be to replace ‘ies’ with ‘i’ so that ‘ponies becomes ‘poni’.
* Lemmatization 
    * Lemmatization is very similar to stemming but it aims to remove endings only if the base form is present in a dictionary.


In [3]:
# This preprocesssing is common to most text classification methods.

# remove punctuations:
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

def clean_text(x):
    x = str(x)
    for punct in puncts:
        if punct in x:
            x = x.replace(punct, ' ')
    return x

# We won't clean numbers in conventional methods case since we might get extra info 
# from bigrams like 5 mins or 30 mins
def clean_numbers(x):
    if bool(re.search(r'\d', x)):
        x = re.sub('[0-9]{5,}', '#####', x)
        x = re.sub('[0-9]{4}', '####', x)
        x = re.sub('[0-9]{3}', '###', x)
        x = re.sub('[0-9]{2}', '##', x)
    return x

# remove Misspell:
mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re

mispellings, mispellings_re = _get_mispell(mispell_dict)
def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]
    return mispellings_re.sub(replace, text)

# remove stopwords:
stopword_list = nltk.corpus.stopwords.words('english')
def remove_stopwords(text, is_lower_case=True):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

# remove contractions:
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

def _get_contractions(contraction_dict):
    contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))
    return contraction_dict, contraction_re

contractions, contractions_re = _get_contractions(contraction_dict)

def replace_contractions(text):
    def replace(match):
        return contractions[match.group(0)]
    return contractions_re.sub(replace, text)

# use Stemming to convert words to their base form
def stem_text(text):
    tokenizer = ToktokTokenizer()
    stemmer = SnowballStemmer('english')
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(tokens)\


# use Lemmatization to keep dictionary form of words. Might be helpful if later we want to use word embeddings.
wordnet_lemmatizer = WordNetLemmatizer()
def lemma_text(text):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)


def clean_sentence(x):
    x = x.lower()
    x = clean_text(x)
    x = replace_typical_misspell(x)
    x = remove_stopwords(x)
    x = replace_contractions(x)
    x = lemma_text(x)
    x = x.replace("'","")
    return x

In [4]:
train_df = pd.read_csv("../input/train.csv")[:100000]
test_df = pd.read_csv("../input/test.csv")[:100000]
print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)

Train shape :  (100000, 3)
Test shape :  (100000, 2)


In [5]:
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [6]:
# clean the sentences
train_df['cleaned_text'] = train_df['question_text'].progress_apply(lambda x : clean_sentence(x))
test_df['cleaned_text'] = test_df['question_text'].progress_apply(lambda x : clean_sentence(x))

HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))




HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))




In [7]:
train_df.head()

Unnamed: 0,qid,question_text,target,cleaned_text
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,quebec nationalist see province nation 1960s
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,adopted dog would encourage people adopt shop
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,velocity affect time velocity affect space geo...
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,otto von guericke used magdeburg hemisphere
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,convert montra helicon mountain bike changing ...


### Text representaion for conventional ML
To use text in conventional machine learning models we need to create features from word and features need to be represented in vector format.

Some of representations that achieve that are:

__Bag of Words - Countvectorizer Features__

All words in corpus are encoded with a dictionary. Every sentece is then represented with frequency of the words appearing in sentence.

__TFIDF Features__

Similar to Countvectorizer but with TFIDF we take features only for the significant words. Important factors here are:
* Term Frequency: How important is the word in the document?
* Inverse Document Frequency: How important the term is in the whole corpus?

TFIDF then is just multiplication of these two scores. This techinique allowes to find important words in a document which are also not very common.

__Hashing Features__

This technique uses hashes to redunce the vocabulary size. We won't explore it here.

__Word2vec Features__

This techique uses embeddings to represent each word in an "n-dimensional space". This techinque is very powerfull when used with neural networks with Embedding layer. We won't explore it here.

#### Bag of words model using Count Vectorizer
We will use the simplest method "Bag of Words" to represent text as vectors before using it with machine learning algorithms.

![count vectorizer](../images/bag_of_words.png "Bag of Words")

Important parameters:
* `ngram_range`: we use (1,3). This means that unigrams, bigrams, and trigrams will be taken into account while creating features.
* `min_df`: Minimum number of times an ngram should appear in a corpus to be used as a feature.

In [8]:
cnt_vectorizer = CountVectorizer(dtype=np.float32,
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3),min_df=3)

# Fitting count vectorizer to both training and test sets (semi-supervised learning)
cnt_vectorizer.fit(list(train_df.cleaned_text.values) + list(test_df.cleaned_text.values))

xtrain =  cnt_vectorizer.transform(train_df.cleaned_text.values) 
y_train = train_df.target.values

In [9]:
xtrain.shape

(100000, 79652)

## Creating Machine Learning Models

#### Helper Functions

In [10]:
# helper function to find threshold and find best f score - Eval metric of competition
def bestThresshold(y_train,train_preds):
    tmp = [0,0,0] # idx, cur, max
    delta = 0
    for tmp[0] in np.arange(0.1, 0.501, 0.01):
        tmp[1] = f1_score(y_train, np.array(train_preds)>tmp[0])
        if tmp[1] > tmp[2]:
            delta = tmp[0]
            tmp[2] = tmp[1]
    # print('best threshold is {:.4f} with F1 score: {:.4f}'.format(delta, tmp[2]))
    return tmp[2]

In [11]:
# helper function to train model with cross validation and get out of fold predictions on the test set
def model_train_cv(x_train,y_train,nfold,model_obj):
    splits = list(StratifiedKFold(n_splits=nfold, shuffle=True, random_state=SEED).split(x_train, y_train))
    x_train = x_train
    y_train = np.array(y_train)
    # matrix for the out-of-fold predictions
    train_oof_preds = np.zeros((x_train.shape[0]))
    for i, (train_idx, valid_idx) in enumerate(splits):

        x_train_fold = x_train[train_idx.astype(int)]
        y_train_fold = y_train[train_idx.astype(int)]
        x_val_fold = x_train[valid_idx.astype(int)]
        y_val_fold = y_train[valid_idx.astype(int)]

        clf = copy.deepcopy(model_obj)
        clf.fit(x_train_fold, y_train_fold)
        valid_preds_fold = clf.predict_proba(x_val_fold)[:,1]

        # storing OOF predictions
        train_oof_preds[valid_idx] = valid_preds_fold
    return train_oof_preds

## About Bayesian Optimization Method
Bayesian method will be use used to tune the model hyperparameters. 

Bayesian methods differ from random or grid search in that they use past evaluation results to choose the next values to evaluate. The concept is: limit expensive evaluations of the objective function by choosing the next input values based on those that have done well in the past.

Main parts of Bayesian Optimization problem are:
1. __Objective Function__: what we want to minimize, in this case the validation error of a machine learning model with respect to the hyperparameters
1. __Domain Space__: hyperparameter values to search over
1. __Optimization algorithm__: method for constructing the surrogate model and choosing the next hyperparameter values to evaluate
1. __Result history__: stored outcomes from evaluations of the objective function consisting of the hyperparameters and validation loss

With those four pieces, we can optimize (find the minimum) of any function that returns a real value. 

In this notebook, `Hyperopt` library with Tree Parzen Estimator (TPE) as optimization algorithm will be used.

> Note: number of iterations for choosing hyperparamters will be limited 10 or less due to low capacity of on my PC, so results will be random and not optimal. If we could let the optimiztion algorithm to run longer, e.g. 500 times, we would see convergence to optimal parmeters.

Reference: https://towardsdatascience.com/automated-machine-learning-hyperparameter-tuning-in-python-dfda59b72f8a

### Fitting a simple Logistic Regression

In [12]:
train_oof_preds = model_train_cv(xtrain,y_train,5,LogisticRegression(C=1.0))
print ("F1 Score: %0.3f " % bestThresshold(y_train,train_oof_preds))



F1 Score: 0.549 


We are able to get an F1 local CV score of 0.549 with default model which just counts the number of time some ngrams appear in a sentence. Let's try tuning the parameters.

#### Parameter tuning for a simple Logistic Regression with Hyperopt

In this part we'll tune the logistic regression classsifier, using hyperopt library. The hyperopt library has a similar purpose as gridsearch, but instead of doing an exhaustive search of the parameter space it evaluates a few well-chosen data points and then extrapolates the optimal solution based on modeling. In practice that means it often needs much fewer iterations to find a good solution.

The important parameters to tune are:
* `C` - inverse regularization strength. 
    * For small values of C, we increase the regularization strength which will create simple models which underfit the data. 
    * For big values of C, we low the power of regularization which imples the model is allowed to increase it's complexity, and therefore, overfit the data 
* `solver` - algorithm to use in the optimization problem, since we have every large, sparse dataset we'll use Stochastic Average Gradient variant `saga`
* `penalty` - norm used in the penalization, but we'll use only `l2` as it is suported by all solvers

In [13]:
def objective(params):
    params = {'C': params['C']}        
    clf = LogisticRegression(
        solver='saga', 
        max_iter=1000,
        **params
    )
    
    # Perform n_folds cross validation
    n_folds = 5
    train_oof_preds = model_train_cv(xtrain,y_train,n_folds,clf)
    
    score = bestThresshold(y_train,train_oof_preds)
    loss = 1 - score  # objective function returns loss to minimize
    print("Loss {:.3f} params {}".format(loss, params))
    
    return loss

In [14]:
space = {
    'C': hp.loguniform('C', low=-3*np.log(10), high=np.log(10))  # C [0.001, 10]
}

best_logit = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10,
            rstate = np.random.RandomState(50))

  0%|                                                                             | 0/10 [00:00<?, ?it/s, best loss: ?]








Loss 0.474 params {'C': 9.653515323349497}
 10%|█████▏                                              | 1/10 [03:39<32:52, 219.17s/it, best loss: 0.473708629755645]







Loss 0.463 params {'C': 2.7794458879762054}
Loss 0.450 params {'C': 0.974347331723907}
Loss 0.483 params {'C': 0.0427166098571098}
 40%|████████████████████                              | 4/10 [09:33<14:11, 141.83s/it, best loss: 0.45036282229427205]

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)



Loss 0.694 params {'C': 0.0010721186361638802}
Loss 0.490 params {'C': 0.03240995351294748}
Loss 0.446 params {'C': 0.3376656942908817}
Loss 0.527 params {'C': 0.01246817925733405}
Loss 0.456 params {'C': 0.131346235470916}
Loss 0.450 params {'C': 0.9737908722066724}
100%|██████████████████████████████████████████████████| 10/10 [15:17<00:00, 75.63s/it, best loss: 0.44610820737392276]


In [15]:
print('Logit, best F1-score %0.3f ' % (1-0.44610820737392276))
print('Logit, best parameters chosen with Hyperopt are: '+str(best_logit))

Logit, best F1-score 0.554 
Logit, best parameters chosen with Hyperopt are: {'C': 0.3376656942908817}


`Logit, best F1-score 0.554 
Logit, best parameters chosen with Hyperopt are: {'C': 0.3376656942908817}`

So, with hyperparameter tuning we've found better value for C=0.34, which had improved F1-score comapred to default model (C=1). We should explore this further by naroving down distribtuion of C to to have only values between (0.2, 1.0) as samples from that range yielded best results.

### Fitting a simple Naive Bayes Model

In [16]:
train_oof_preds = model_train_cv(xtrain,y_train,5,MultinomialNB())
print ("F1 Score: %0.3f " % bestThresshold(y_train,train_oof_preds))

F1 Score: 0.498 


We are able to get an F1 local CV score of 0.498 with default model. Let's try tuning the parameters.

#### Parameter tuning for a simple Naive Bayes with Hyperopt

In [17]:
def objective(params):
    params = {'alpha': params['alpha']}        
    clf = MultinomialNB(**params)
    
    # Perform n_folds cross validation
    n_folds = 5
    train_oof_preds = model_train_cv(xtrain,y_train,n_folds,clf)
    
    score = bestThresshold(y_train,train_oof_preds)
    loss = 1 - score  # objective function returns loss to minimize
    print("Loss {:.3f} params {}".format(loss, params))
    
    return loss

In [18]:
space = {
    'alpha': hp.uniform('alpha', 0.5, 1.5)
}

best_nb = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=100,
            rstate = np.random.RandomState(50))

Loss 0.528 params {'alpha': 1.496171372572428}
Loss 0.521 params {'alpha': 1.360989555833787}
Loss 0.515 params {'alpha': 1.2471784500414422}
Loss 0.499 params {'alpha': 0.907649194556314}
Loss 0.498 params {'alpha': 0.5075607113066694}
Loss 0.497 params {'alpha': 0.8776696020261421}
Loss 0.509 params {'alpha': 1.1321217350011867}
Loss 0.494 params {'alpha': 0.7739507593955738}
Loss 0.504 params {'alpha': 1.0296044074844866}
Loss 0.515 params {'alpha': 1.2471164248463555}
Loss 0.506 params {'alpha': 1.059792010384872}
Loss 0.518 params {'alpha': 1.3071413385798567}
Loss 0.495 params {'alpha': 0.6012325414277948}
Loss 0.509 params {'alpha': 1.1417322312889917}
Loss 0.524 params {'alpha': 1.4265514535647026}
Loss 0.495 params {'alpha': 0.8250262881583786}
Loss 0.498 params {'alpha': 0.5558786294367265}
Loss 0.494 params {'alpha': 0.6667158345220777}
Loss 0.505 params {'alpha': 1.036303350575253}
Loss 0.527 params {'alpha': 1.4758255838270777}
Loss 0.494 params {'alpha': 0.712478227619760

In [19]:
print('Naive Bayes, best F1-score %0.3f ' % (1-0.49354675895882905))
print('Naive Bayes, best parameters chosen with Hyperopt are: '+str(best_nb))

Naive Bayes, best F1-score 0.506 
Naive Bayes, best parameters chosen with Hyperopt are: {'alpha': 0.7124782276197604}


`Naive Bayes, best F1-score 0.506 
Naive Bayes, best parameters chosen with Hyperopt are: {'alpha': 0.7124782276197604}`

With hyperparameter tuning we've found better value for alpha=0.71, which had improved F1-score comapred to default model (alpha=1). 

> Note: Since this is fast model to train, over 100 iteration we see how TPE algorithm stars to exploit the distribution of alpha where the lowest loss is achieved. 

### Fitting a Random Forest Model
Random forests work by averaging predictions from many decision trees - the idea is that by averaging many trees the mistakes of each tree are ironed out. Each decision tree can be somewhat overfitted, by averaging them the final result should be good.

In [20]:
train_oof_preds = model_train_cv(xtrain,y_train,5,RandomForestClassifier(n_jobs=4, class_weight='balanced'))
print ("F1 Score: %0.3f " % bestThresshold(y_train,train_oof_preds))



F1 Score: 0.426 


We got a much lower F1 on local CV score with default model compared to simpler models above. 
> Training of Random Forest takes a lot of time.

Let's try tuning the parameters but with low number of evaluations.

#### Parameter tuning for a Random Forest with Hyperopt

The important parameters to tune are:

* Number of trees in the forest (n_estimators)
* Tree complexity (max_depth)


In [21]:
def objective(params):
    params = {'n_estimators': int(params['n_estimators']), 
              'max_depth': int(params['max_depth'])} 
    clf = RandomForestClassifier(n_jobs=4, class_weight='balanced', **params)
    
    # Perform n_folds cross validation
    n_folds = 2
    train_oof_preds = model_train_cv(xtrain,y_train,n_folds,clf)
    
    score = bestThresshold(y_train,train_oof_preds)
    loss = 1 - score  # objective function returns loss to minimize
    print("Loss {:.3f} params {}".format(loss, params))
    
    return loss

In [22]:
space = {
    'n_estimators': hp.quniform('n_estimators', 25, 500, 25),
    'max_depth': hp.quniform('max_depth', 1, 10, 1),
}

best_rf = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=3,
            rstate = np.random.RandomState(50))

Loss 0.554 params {'n_estimators': 500, 'max_depth': 6}
Loss 0.554 params {'n_estimators': 425, 'max_depth': 5}
Loss 0.552 params {'n_estimators': 375, 'max_depth': 10}
100%|█████████████████████████████████████████████████████| 3/3 [00:53<00:00, 18.22s/it, best loss: 0.5521843275921068]


In [23]:
print('Random Forest, best parameters chosen with Hyperopt are: '+str(best_rf))

Random Forest, best parameters chosen with Hyperopt are: {'max_depth': 10.0, 'n_estimators': 375.0}


`Random Forest, best parameters chosen with Hyperopt are: {'n_estimators': 375, 'max_depth': 10}`



### Fitting a XGBoost Model
XGBoost is also an based on an ensemble of decision trees, but different from random forest. The trees are not averaged, but added. The decision trees are trained to correct residuals from the previous trees. The idea is that many small decision trees are trained, each adding a bit of info to improve overall predictions.

In [24]:
clf = xgb.XGBClassifier(
        n_estimators=250,
        learning_rate=0.05,
        n_jobs=4,
    )
train_oof_preds = model_train_cv(xtrain,y_train,5,clf)
print ("F1 Score: %0.3f " % bestThresshold(y_train,train_oof_preds))

F1 Score: 0.482 


We've also got a much lower F1 on local CV score with default model compared to simpler models above. But at least better and faster than Random Forest.

Let's try tuning the parameters.

#### Parameter tuning for a XGBoost with Hyperopt

The most important parameters are:
* Number of trees (n_estimators)
* Learning rate - later trees have less influence (learning_rate)
* Tree complexity (max_depth)
* Gamma - Make individual trees conservative, reduce overfitting 
* Column sample per tree - reduce overfitting

We will fix the number of trees to 250 and learning rate to 0.05 - then we can find good values for the other parameters. Later we can re-visit this.

In [25]:
def objective(params):
    params = {
        'max_depth': int(params['max_depth']),
        'gamma': "{:.3f}".format(params['gamma']),
        'colsample_bytree': '{:.3f}'.format(params['colsample_bytree']),
    }
    
    clf = xgb.XGBClassifier(
        n_estimators=250,
        learning_rate=0.05,
        n_jobs=4,
        **params
    )
    
    # Perform n_folds cross validation
    n_folds = 5
    train_oof_preds = model_train_cv(xtrain,y_train,n_folds,clf)
    
    score = bestThresshold(y_train,train_oof_preds)
    loss = 1 - score  # objective function returns loss to minimize
    print("Loss {:.3f} params {}".format(loss, params))
    
    return loss

In [26]:
space = {
    'max_depth': hp.quniform('max_depth', 2, 8, 1),
    'gamma': hp.uniform('gamma', 0.0, 0.5),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1.0)
}

best_xgb = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=5,
            rstate = np.random.RandomState(50))

Loss 0.485 params {'max_depth': 8, 'gamma': '0.292', 'colsample_bytree': '0.719'}
Loss 0.489 params {'max_depth': 7, 'gamma': '0.226', 'colsample_bytree': '0.650'}
Loss 0.493 params {'max_depth': 6, 'gamma': '0.492', 'colsample_bytree': '0.921'}
Loss 0.509 params {'max_depth': 4, 'gamma': '0.031', 'colsample_bytree': '0.635'}
Loss 0.537 params {'max_depth': 2, 'gamma': '0.447', 'colsample_bytree': '0.447'}
100%|████████████████████████████████████████████████████| 5/5 [02:18<00:00, 25.81s/it, best loss: 0.48458817584638714]


In [27]:
print('XGBoost, best parameters chosen with Hyperopt are: '+str(best_xgb))

XGBoost, best parameters chosen with Hyperopt are: {'colsample_bytree': 0.7189583460013607, 'gamma': 0.2922725910062163, 'max_depth': 8.0}


`XGBoost, best parameters chosen with Hyperopt are: {'colsample_bytree': 0.7189583460013607, 'gamma': 0.2922725910062163, 'max_depth': 8.0}`

### Fitting a LightGBM Model
LightGBM is very similar to xgboost, it is also uses a gradient boosted tree approach. So the explanation above mostly holds also.

In [28]:
clf = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.01
)

train_oof_preds = model_train_cv(xtrain,y_train,5,clf)

In [29]:
print ("F1 Score: %0.3f " % bestThresshold(y_train,train_oof_preds))

F1 Score: 0.522 


#### Parameter tuning for a LightGBM with Hyperopt

The important parameters to tune are:
* Number of estimators
* Tree complexity - in lightgbm that is controlled by number of leaves (num_leaves)
* Learning rate
* Feature fraction

We will fix number of estimators to 500 and learning rate to 0.01 (chosen experimentally) and tune the remaining parameters with hyperopt. Then later we could revisit for better results! 

In [30]:
def objective(params):
    params = {
        'num_leaves': int(params['num_leaves']),
        'colsample_bytree': '{:.3f}'.format(params['colsample_bytree']),
    }
    
    clf = lgb.LGBMClassifier(
        n_estimators=500,
        learning_rate=0.01,
        **params
    )
    
    # Perform n_folds cross validation
    n_folds = 5
    train_oof_preds = model_train_cv(xtrain,y_train,n_folds,clf)
    
    score = bestThresshold(y_train,train_oof_preds)
    loss = 1 - score  # objective function returns loss to minimize
    print("Loss {:.3f} params {}".format(loss, params))
    
    return loss

In [31]:
space = {
    'num_leaves': hp.quniform('num_leaves', 8, 128, 2),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.3, 1.0),
}

best_lgm = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=5)

Loss 0.464 params {'num_leaves': 112, 'colsample_bytree': '0.893'}
Loss 0.471 params {'num_leaves': 38, 'colsample_bytree': '0.541'}
Loss 0.465 params {'num_leaves': 118, 'colsample_bytree': '0.912'}
Loss 0.462 params {'num_leaves': 124, 'colsample_bytree': '0.617'}
Loss 0.459 params {'num_leaves': 106, 'colsample_bytree': '0.468'}
100%|████████████████████████████████████████████████████| 5/5 [11:24<00:00, 144.31s/it, best loss: 0.4586096416064377]


In [32]:
print('LightGBM, best parameters chosen with Hyperopt are: '+str(best_lgm))

LightGBM, best parameters chosen with Hyperopt are: {'colsample_bytree': 0.46753405887357136, 'num_leaves': 106.0}


`LightGBM, best parameters chosen with Hyperopt are: {'colsample_bytree': 0.30915021459799613, 'num_leaves': 88.0}`

## Comparing the models

Now let's see how the models perform - if hyperopt has determined a sensible set of parameters for us...

In [33]:
logit_model = LogisticRegression(
    C=1.0,
    solver='saga',
    max_iter=4000
)

nb_model = MultinomialNB(
    alpha=0.7
)

rf_model = RandomForestClassifier(
    n_jobs=4,
    class_weight='balanced',
    n_estimators=325,
    max_depth=7
)

xgb_model = xgb.XGBClassifier(
    n_estimators=250,
    learning_rate=0.05,
    n_jobs=4,
    max_depth=8,
    colsample_bytree=0.7,
    gamma=0.3
)

lgbm_model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.01,
    num_leaves=88,
    colsample_bytree=0.3
)

models = [
    ('Logistic Regression', logit_model),
    ('Naive Bayes', nb_model),
    ('Random Forest', rf_model),
    ('XGBoost', xgb_model),
    ('LightGBM', lgbm_model)
]

In [34]:
for label, model in models:
    train_oof_preds = model_train_cv(xtrain,y_train,5,model)
    score = bestThresshold(y_train,train_oof_preds)
    print("F1-Score: %0.4f [%s]" % (score, label))

F1-Score: 0.5497 [Logistic Regression]
F1-Score: 0.5065 [Naive Bayes]
F1-Score: 0.4367 [Random Forest]
F1-Score: 0.5147 [XGBoost]
F1-Score: 0.5390 [LightGBM]


That's it for hyperparameter tuning. We've got following perfomances from tuned models:
* F1-Score: 0.5497 [Logistic Regression]
* F1-Score: 0.5065 [Naive Bayes]
* F1-Score: 0.4367 [Random Forest]
* F1-Score: 0.5147 [XGBoost]
* F1-Score: 0.5390 [LightGBM]

It is important to note that when using Bayesian methods, high number of evaluations are needed to find best paramteres. Advantage is that Bayesian method will start exploiting values for best paramteres which we can use to narrow down the search range and converge faster than with Random Search or Grid Search/.

References:
    https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle

## Making predictions from ensemble
Here I will try to make predictions from three previously trained models.

In [35]:
# split data to train and validation sets
X_t, X_v, y_t, y_v = train_test_split(
                                    xtrain, 
                                    y_train,
                                    test_size=0.1, 
                                    random_state=42, 
                                    stratify=y_train)
print(X_t.shape, 
      y_t.shape, 
      X_v.shape, 
      y_v.shape)

(90000, 79652) (90000,) (10000, 79652) (10000,)


Let's try to get better result with ensembling 3 models which work in very different ways:

In [36]:
# Logistic Regression
logit_model.fit(X_t, y_t)
# Naive Bayes
nb_model.fit(X_t, y_t)
# LightGBM
lgbm_model.fit(X_t, y_t)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.3,
        importance_type='split', learning_rate=0.01, max_depth=-1,
        min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
        n_estimators=500, n_jobs=-1, num_leaves=88, objective=None,
        random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
        subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [37]:
# predict on hold-out set
pred_val_1 = logit_model.predict_proba(X_v)[:,1]
pred_val_2 = nb_model.predict_proba(X_v)[:,1]
pred_val_3 = lgbm_model.predict_proba(X_v)[:,1]

In [38]:
# put all predictions into a dataframe
pred_val_1_df = np.reshape(pred_val_1, (pred_val_1.shape[0]))
pred_val_2_df = np.reshape(pred_val_2, (pred_val_2.shape[0]))
pred_val_3_df = np.reshape(pred_val_3, (pred_val_3.shape[0]))

pred_val_df = np.reshape(y_v, (y_v.shape[0]))

validation_df = pd.DataFrame({'val_1': pred_val_1_df, 'val_2': pred_val_2_df, 'val_3': pred_val_3_df, 'prediction': pred_val_df})
validation_df.to_csv('validation.csv', index=False)

In [39]:
validation_df.head(5)

Unnamed: 0,val_1,val_2,val_3,prediction
0,0.001799,2e-06,0.016429,0
1,0.122895,0.997764,0.272211,1
2,0.009584,0.000371,0.016544,0
3,0.006439,0.001017,0.037953,0
4,0.002904,3e-06,0.023346,0


Here we see predictions on the hold-out set from all three model and the ground thruth in the last column. We observe difference between them. E.g. model2 (NB) is very certain about prediction in row 2 while other two models are not. 

Idea of ensembling is to take many "specialized models" and find best score for them as a group of experts compared to individual predictions.

#### Take the average prediction from ensemble
One of the simplest techinques to combine results is to take the average or weigheted average of all predictions.

In [40]:
validation_df['mean_pred'] = validation_df.iloc[:, 0:3].agg(np.mean, axis=1)
validation_df['weight_pred'] = validation_df.val_1 * 0.7 + validation_df.val_2 * 0.15 + validation_df.val_3 * 0.15
validation_df.head()

Unnamed: 0,val_1,val_2,val_3,prediction,mean_pred,weight_pred
0,0.001799,2e-06,0.016429,0,0.006076,0.003724
1,0.122895,0.997764,0.272211,1,0.46429,0.276523
2,0.009584,0.000371,0.016544,0,0.008833,0.009246
3,0.006439,0.001017,0.037953,0,0.015136,0.010353
4,0.002904,3e-06,0.023346,0,0.008751,0.005535


In [41]:
print("F1 Score: %0.3f [Average Prediction]" % bestThresshold(validation_df.prediction, validation_df.mean_pred))

F1 Score: 0.540 [Average Prediction]


In [42]:
print("F1 Score: %0.3f [Weighted Average]" % bestThresshold(validation_df.prediction, validation_df.weight_pred))

F1 Score: 0.552 [Weighted Average]


This strategy didn't work and we've got lower score than by using single best classifer. 

#### Create simple Voting Classifier
This approach uses majority rule, where the final prediction is one that has most votes.

In [43]:
clf = VotingClassifier(estimators=models, voting='soft')

In [44]:
clf.fit(X_t, y_t)

VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=4000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False)...0, reg_lambda=0.0, silent=True,
        subsample=1.0, subsample_for_bin=200000, subsample_freq=0))],
         flatten_transform=None, n_jobs=None, voting='soft', weights=None)

In [45]:
ens_preds = clf.predict_proba(X_v)[:,1]

In [46]:
print ("F1 Score: %0.3f " % bestThresshold(y_v, ens_preds))

F1 Score: 0.546 


Oh, dissapoiting.. Result did not improve with ensemble compared to using simple Logistic Regression! 

### Conclusion

There must be smarter ways to create an ensamble but this is my first encounter with ensembling and I didn't have much time to improve..

Some additional steps that may be taken to improve could be:

1. Running the tuning for longer. In my case it took few hours to train forest classifiers on only 5 epoch! Bayesian Optimization requires high number of rounds to converge to optimal parameter.
1. Implementing a good cross-validation strategy in training the models to find optimal parameter values
1. Introduce a greater variety of base models for learning. The more uncorrelated the results, the better the final score.



### Next...
Since everyone in the competition were using neural networks I decided to test it. In next notebook we check the performance of a LSTM Neural Network which is proven to work well on this type of problems. 

Additionally with neural networks, we can use _word embeedings_ to efficiently represent words of each sentence is n-dimensional space. Using embeddings we will keep all words without stemming them, thus keeping natural meaning and algorithm will be able to infer similarities between words giving extra boost to performance.