# Text Classification

All models : https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

CNN Text Classification
https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb

CNN Multichannel Text Classification + Hierarchical attention + ...
https://github.com/gaurav104/TextClassification/blob/master/CNN%20Multichannel%20Text%20Classification.ipynb

Notes for Deep Learning
https://arxiv.org/pdf/1808.09772.pdf

Doc classification with NLP
https://github.com/mdh266/DocumentClassificationNLP/blob/master/NLP.ipynb

Paragraph Topic Classification
http://cs229.stanford.edu/proj2016/report/NhoNg-ParagraphTopicClassification-report.pdf

1D convolutional neural networks for NLP
https://github.com/Tixierae/deep_learning_NLP/blob/master/cnn_imdb.ipynb

Hierarchical Attention for text classification
https://github.com/Tixierae/deep_learning_NLP/blob/master/HAN/HAN_final.ipynb

Multi-class classification scikit learn (Random forest, SVM, logistic regression)
https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb

Text feature extraction TFIDF mathematics
https://dzone.com/articles/machine-learning-text-feature-0

Classification Yelp Reviews (AWS)
http://www.developintelligence.com/blog/2017/06/practical-neural-networks-keras-classifying-yelp-reviews/

Convolutional Neural Networks for Text Classification (waouuuuu)
http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/
https://github.com/davidsbatista/ConvNets-for-sentence-classification


**3 ways to interpretate your NLP model** [Lime, ELI5, Skater]
https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb
https://towardsdatascience.com/3-ways-to-interpretate-your-nlp-model-to-management-and-customer-5428bc07ce15
https://medium.freecodecamp.org/how-to-improve-your-machine-learning-models-by-explaining-predictions-with-lime-7493e1d78375

Deep Learning for text made easy with AllenNLP
https://medium.com/swlh/deep-learning-for-text-made-easy-with-allennlp-62bc79d41f31

Ensemble Classifiers
https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/

In [1]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble

import pandas, xgboost, numpy, textblob, string
#from keras.preprocessing import text, sequence
#from keras import layers, models, optimizers

  from numpy.core.umath_tests import inner1d


# <font color='BLUE'>DATA LOADING </font>

### A. FIRST dataset: Consumer Reviews [Amazon] [2 labels / binary classification]

In [None]:
# load the dataset
data = open('corpus.txt', encoding="utf8").read()
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
    content = line.split()
    labels.append(content[0])
    texts.append(" ".join(content[1:]))

# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels

In [None]:
trainDF.head()

In [None]:
trainDF.shape

In [None]:
# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

In [None]:
train_x.shape

### B. Second dataset: Consumer Complaints [Banking industry] [multi-classification]

In [1]:
import pandas as pd
df = pd.read_csv('C:/Users/adsieg/Desktop/link_news/ML/Consumer_Complaints.csv')
df.head()

df = df[pd.notnull(df['Consumer complaint narrative'])]

col = ['Product', 'Consumer complaint narrative']
df = df[col]
df.columns = ['Product', 'Consumer_complaint_narrative']

df['category_id'] = df['Product'].factorize()[0]
from io import StringIO
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)

In [None]:
id_to_category

In [2]:
# We take only 10,000 customer complaints to speed up algorithms
df = df[:10000]

df.head(5)

Unnamed: 0,Product,Consumer_complaint_narrative,category_id
1,Credit reporting,I have outdated information on my credit repor...,0
2,Consumer Loan,I purchased a new car on XXXX XXXX. The car de...,1
7,Credit reporting,An account on my credit report has a mistaken ...,0
12,Debt collection,This company refuses to provide me verificatio...,2
16,Debt collection,This complaint is in regards to Square Two Fin...,2


In [7]:
df['Consumer_complaint_narrative'].iloc[2]

'An account on my credit report has a mistaken date. I mailed in a debt validation letter to allow XXXX to correct the information. I received a letter in the mail, stating that Experian received my correspondence and found it to be " suspicious \'\' and that " I did n\'t write it \'\'. Experian \'s letter is worded to imply that I am incapable of writing my own letter. I was deeply offended by this implication. \nI called Experian to figure out why my letter was so suspicious. I spoke to a representative who was incredibly unhelpful, She did not effectively answer any questions I asked of her, and she kept ignoring what I was saying regarding the offensive letter and my dispute process. I feel the representative did what she wanted to do, and I am not satisfied. It is STILL not clear to me why I received this letter. I typed this letter, I signed this letter, and I paid to mail this letter, yet Experian willfully disregarded my lawful request. \nI am disgusted with this entire situati

#### Imbalanced dataset

Conventional algorithms are often **biased towards the majority class**, not taking the data distribution into consideration. In the worst case, **minority classes are treated as outliers and ignored**. 

For some cases, such as fraud detection or cancer prediction, we would need to carefully configure our model or artificially balance the dataset, for example by **undersampling or oversampling each class.**

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
fig = plt.figure(figsize=(8,6))
df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()

# <font color='BLUE'>DATA CLEANING</font>

### A. --- A quick and easy function to clean my text

In [None]:
import re
from nltk.corpus import stopwords
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

def preprocess(raw_text):

    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    meaningful_words = [w for w in words if w not in stopword_set]
    
    #stemmed words
    ps = PorterStemmer()
    stemmed_words = [ps.stem(word) for word in meaningful_words]
    
    #join the cleaned words in a list
    cleaned_word_list = " ".join(stemmed_words)

    return cleaned_word_list

In [None]:
df['Consumer_complaint_narrative'] = df['Consumer_complaint_narrative'].apply(lambda line : preprocess(line))

### B. How to decline all ways of a given 

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def split_dataset_into_words(dataset):
    datawords = dataset.apply(lambda x: x.split())
    return list(datawords)

#  my_list = all_incidents 
# dictionnary
def buffer_stemmisation_keywords(my_list):
    my_list = [item for sublist in my_list for item in sublist]
    aux = pd.DataFrame(my_list, columns =['word'] )
    aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
    aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
    aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
    aux.index = aux['word_stemmed']
    del aux['word_stemmed']
    my_dict = aux.to_dict('dict')['word']
    return my_dict

In [None]:
dictionnary_all_words_unstemmed = buffer_stemmisation_keywords(split_dataset_into_words(df['Consumer_complaint_narrative']))

# Dictionnary de-duplicated
for key, value in dictionnary_all_words_unstemmed.items():
    new_value = value.replace(",", "")
    new_value = list(set(value.split()))
    new_value = list(set(map(lambda each:each.strip(","), new_value)))
    dictionnary_all_words_unstemmed[key]=new_value

In [None]:
dictionnary_all_words_unstemmed

# <font color='BLUE'>FEATURE ENGINEERING</font>

2.1 Count Vectors as features

2.2 TF-IDF Vectors as features
- --- Word level
- --- N-Gram level
- --- Character level

2.3 Word Embeddings as features

2.4 Text / NLP based features

2.5 Topic Models as features

The different types of **word embeddings** can be broadly classified into two categories-

- **Frequency based Embedding**
        - Count Vector
        - TF-IDF Vector
        - Co-Occurrence Matrix with a fixed context window (with SVD)
- **Prediction based Embedding**
        - CBOW (Continuous Bag of words)
        - Skip – Gram model

### 2.1 Count Vectors as features

Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

In [None]:
# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['Consumer_complaint_narrative'], df['Product'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

In [None]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['Consumer_complaint_narrative'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

### 2.2 TF-IDF Vectors as features

a. **Word Level TF-IDF**: Matrix representing tf-idf scores of every term in different documents

b. **N-gram Level TF-IDF**: N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams

c. **Character Level TF-IDF**: Matrix representing tf-idf scores of character level n-grams in the corpus

Most often **term-frequency** alone is **not** a good measure of the **importance of a word/term to a document's topic**.  Very common words like "the", "a", "to" are almost always the terms with the **highest frequency in the text**. Thus, having a high raw count of the number of times a term appears in a document does not necessarily mean that the corresponding word is more important. Furtermore, longer documents could have high frequency of terms that do not correlate with the document topic, but instead occur with high numbers solely due to the length of the document.

To circumvent the limination of term-frequency, we often normalize it by the **inverse document frequency (idf)**.  This results in the **term frequency-inverse document frequency (tf-idf)** matrix.  The **inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents in the corpus**.  We can give a formal defintion of the inverse-document-frequency by letting $\mathcal{D}$ be the corpus or the set of all documents and $N$ is the number of documents in the corpus and $N_{t,D}$ be the number of documents that contain the term $t$ then, 

$$idf(t,\mathcal{D}) \, = \,  \log\left(\frac{N_{\mathcal{D}}}{1 + N_{t,\mathcal{D}}}\right) \, = \, -  \log\left(\frac{1 + N_{t,\mathcal{D}}}{N_{\mathcal{D}}}\right) $$

The reason for the presence of the $1$ is for smoothing.  Without it, if the term/word did not appear in any training documents, then its inverse-document-frequency would be $idf(t,\mathcal{D}) = \infty$.  However, with the presense of the $1$ it will now have $idf(t,\mathcal{D}) = 0$.


Now we can formally defined the term frequnecy-inverse document frequency as a normalized version of term-frequency,


$$\text{tf-idf}(t,d) \, = \, tf(t,d) \cdot idf(t,\mathcal{D}) $$

Like the term-frequency, the term frequency-inverse document frequency is a sparse matrix, where again, each row is a document in our training corpus ($\mathcal{D}$) and each column corresponds to a term/word in the bag-of-words list.
_________________________________________
**EXAMPLE:**

**from** sklearn.feature_extraction.text **import** TfidfVectorizer

tfidf = TfidfVectorizer(**sublinear_tf**=True, **min_df**=5, **norm**='l2', **encoding**='latin-1', **ngram_range**=(1, 2), **stop_words**='english')

features = tfidf.fit_transform(df['text']).toarray()
labels = df.category_id
_________________________________________

- **sublinear_df** is set to True to use a logarithmic form for frequency.
- **min_df** is the minimum numbers of documents a word must be present in to be kept.
- **norm** is set to l2, to ensure all our feature vectors have a euclidian norm of 1.
- **ngram_range** is set to (1, 2) to indicate that we want to consider both unigrams and bigrams.
- **stop_words** is set to "english" to remove all common pronouns ("a", "the", ...) to reduce the number of noisy features.

In [None]:
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['text'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(trainDF['text'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['text'])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x) 
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x) 

The term-frequency is a **sparse matrix** where **each row is a document in our training corpus** ($\mathcal{D}$) and each **column corresponds to a term/word in the bag-of-words list**

In [None]:
print('- Size of the matrix is', xvalid_tfidf.shape, 'as we passed 5,000 words and we have 1725 customer comments')

In [None]:
print('Here is my bag of words:', tfidf_vect.get_feature_names())

In [None]:
print('Size of my bag of words:', len(tfidf_vect.get_feature_names()))

#### Other implementation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')

features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
labels = df.category_id

print('- Size of the matrix is', features.shape)
print('- Each of', features.shape[0], 'consumer complaint narratives is represented by', features.shape[1], 'features, representing the tf-idf score for different unigrams and bigrams.')
print('-', features.shape[0], 'is the # of document / complaint and', features.shape[1], 'is my bag of words containing unigram and bigram')

The term-frequency is a **sparse matrix** where **each row is a document in our training corpus** ($\mathcal{D}$) and each **column corresponds to a term/word in the bag-of-words list**

- **sklearn.feature_selection.chi2** to find the terms that are the most correlated with each of the products

In [None]:
from sklearn.feature_selection import chi2
import numpy as np

N = 2
for Product, category_id in sorted(category_to_id.items()):
    features_chi2 = chi2(features, labels == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}':".format(Product))
    print("  . Most correlated unigrams:\n       . {}".format('\n       . '.join(unigrams[-N:])))
    print("  . Most correlated bigrams:\n       . {}".format('\n       . '.join(bigrams[-N:])))

### 2.3 Word Embeddings

A word embedding is a **form of representing words and documents** using a **dense vector representation**. The position of a word within the vector space is learned from text and is based on the **words that surround the word** when it is used. Word embeddings can be trained using the input corpus **itself** or can **be generated using pre-trained word embeddings** such as **Glove**, **FastText**, and **Word2Vec**. Any one of them can be downloaded and **used as transfer learning**. 

Four essential steps:
- Loading the pretrained word embeddings
- Creating a tokenizer object
- Transforming text documents to sequence of tokens and pad them
- Create a mapping of token and their respective embeddings

https://www.slideshare.net/PyData/sujit-pal-applying-the-fourstep-embed-encode-attend-predict-framework-to-predict-document-similarity

In [None]:
# load the pre-trained word-embedding vectors 
embeddings_index = {}
for i, line in enumerate(open('data/wiki-news-300d-1M.vec')):
    values = line.split()
    embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')

# create a tokenizer 
token = text.Tokenizer()
token.fit_on_texts(trainDF['text'])
word_index = token.word_index

# convert text to sequence of tokens and pad them to ensure equal length vectors 
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70)

# create token-embedding mapping
embedding_matrix = numpy.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### 2.4 Text / NLP based features

- 1. **Word Count of the documents** – total number of words in the documents
- 2. **Character Count of the documents** – total number of characters in the documents
- 3. **Average Word Density of the documents** – average length of the words used in the documents
- 4. **Puncutation Count in the Complete Essay** – total number of punctuation marks in the documents
- 5. **Upper Case Count in the Complete Essay** – total number of upper count words in the documents
- 6. **Title Word Count in the Complete Essay** – total number of proper case (title) words in the documents
- 7. **Frequency distribution of Part of Speech Tags:**
    - Noun Count
    - Verb Count
    - Adjective Count
    - Adverb Count
    - Pronoun Count

In [None]:
trainDF['char_count'] = trainDF['text'].apply(len)
trainDF['word_count'] = trainDF['text'].apply(lambda x: len(x.split()))
trainDF['word_density'] = trainDF['char_count'] / (trainDF['word_count']+1)
trainDF['punctuation_count'] = trainDF['text'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
trainDF['title_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
trainDF['upper_case_word_count'] = trainDF['text'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [None]:
pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

# function to check and get the part of speech tag count of a words in a given sentence
def check_pos_tag(x, flag):
    cnt = 0
    try:
        wiki = textblob.TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt

trainDF['noun_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'noun'))
trainDF['verb_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'verb'))
trainDF['adj_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'adj'))
trainDF['adv_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'adv'))
trainDF['pron_count'] = trainDF['text'].apply(lambda x: check_pos_tag(x, 'pron'))

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

### 2.5 Topic Models as features (LDA)

In [None]:
# train a LDA Model
lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)
X_topics = lda_model.fit_transform(xtrain_count)
topic_word = lda_model.components_ 
vocab = count_vect.get_feature_names()

# view the topic models
n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(topic_word):
    topic_words = numpy.array(vocab)[numpy.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))

# <font color='BLUE'>MODEL BUILDING</font>

- Naive Bayes Classifier
- Linear Classifier
- Support Vector Machine
- Bagging Models
- Boosting Models
- Shallow Neural Networks
- Deep Neural Networks
- Convolutional Neural Network (CNN)
- Long Short Term Modelr (LSTM)
- Gated Recurrent Unit (GRU)
- Bidirectional RNN
- Recurrent Convolutional Neural Network (RCNN)
- Other Variants of Deep Neural Networks

### 3.1 Naive Bayes

-------------------

One of the most basic models for text classification is the <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes model</a>. The Naive Bayes classification model predicts the document topic, $y = \{C_{1},C_{2},\ldots, C_{20}\}$ where $C_{k}$ is the class or topic based on the document feactures $\textbf{x} \in \mathbb{N}^{p}$,  and $p$ is the number of terms in our bag-of-words list.  The feature vector,

$$\textbf{x} \, = \, \left[ x_{1}, x_{2}, \ldots , x_{p} \right] $$

contains counts $x_{i}$ for the $\text{tf-idf}$ value of the i-th term in our bag-of-words list.  Using <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes Theorem</a> we can develop a model to predict the topic class  ($C_{k}$) of a document from its feature vector $\textbf{x}$,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right)P(C_{k})}{P\left(x_{1}, \ldots, x_{p} \right)}$$

The Naive Bayes model makes the "Naive" assumption the probability of each term's $\text{tf-idf}$ is **conditionally independent** of every other term.  This reduces our **conditional probability function** to the product,

$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \; = \; \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$

Subsequently Bayes' theorem for our classification problem becomes,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{ P(C_{k}) \, \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)}{P\left(x_{1}, \ldots, x_{p} \right)}$$


Since the denominator is independent of the class ($C_{k}$) we can use a <a href="https://en.wikipedia.org/wiki/Maximum_a_posteriori">Maxmimum A Posteriori</a> method to estimate the document topic , 

$$ \hat{y} \, = \, \text{arg max}_{k}\;  P(C_{k}) \,  \Pi_{i=1}^{p} P\left(x_{i} \, \vert \, C_{k} \right)$$ 


The **prior**, $P(C_{k}),$ is often taken to be the relative frequency of the class in the training corpus, while the form of the conditional distribution $P\left(x_{i} \, \vert \, C_{k} \right)$ is a choice of the modeler and determines the type of Naive Bayes classifier. 


We will use a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB">multinomial Naive Bayes</a> model which works well when our features are discrete variables such as those in our $\text{tf-idf}$ matrix.  In the multinomial Naive Bayes model the conditional probability takes the form,


$$ P\left(x_{1}, \ldots, x_{p} \, \vert \, C_{k} \right) \, = \, \frac{\left(\sum_{i=1}^{p} x_{i}\right)!}{\Pi_{i=1}^{p} x_{i}!}  \Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$


where $p_{k,i}$ is the probability that the $k$-th class will have the $i$-th bag-of-words term in its feature vector. This leads to our **posterior distribution** having the functional form,

$$P\left(C_{k} \, \vert \, x_{1}, \ldots , x_{p} \right) \; = \; \frac{ P(C_{k})}{P\left(x_{1}, \ldots, x_{p} \right)} \, \frac{\left(\sum_{i=1}^{p} x_{i}\right)!}{\Pi_{i=1}^{p} x_{i}!}  \Pi_{i=1}^{p} p_{k,i}^{x_{i}}$$



We can instantiate a multinomial Naive Bayes classifier using the Scikit-learn library and fit it to our  $\text{tf-idf}$ matrix using the commands,

In [None]:
# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count)
print("NB, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors

accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print("NB, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print("NB, N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print("NB, CharLevel Vectors: ", accuracy)

#### Other implementation

In [None]:
# Look at my dataframe
trainDF.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(trainDF['text'], trainDF['label'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

mod = MultinomialNB()
clf = mod.fit(X_train_tfidf, y_train)

In [None]:
from sklearn.metrics import accuracy_score
X_test_tf = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_tf)

predicted = mod.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, predicted))

-- A -- **Accuracy** - Accuracy is the most intuitive performance measure and it is simply **a ratio of correctly predicted observation to the total observations.** One may think that, **if we have high accuracy then our model is best**. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. **For our model, we have got 0.803 which means our model is approx. 80% accurate.**

**Accuracy = TP+TN/TP+FP+FN+TN**

-- B -- **Precision** - Precision is the ratio of **correctly predicted positive observations to the total predicted positive observations.** The question that this metric answer is of **all passengers that labeled as survived, how many actually survived?** High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good.

**Precision = TP/TP+FP**

-- C -- **Recall (Sensitivity)** - Recall is the **ratio of correctly predicted positive observations to the all observations in actual class - yes**. The question recall answers is: **Of all the passengers that truly survived, how many did we label?** We have got recall of 0.631 which is good for this model as it’s above 0.5.

**Recall = TP/TP+FN**

-- D -- **F1 score** - F1 Score is the **weighted average of Precision and Recall.** Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. In our case, F1 score is 0.701.

**F1 Score = 2*(Recall * Precision) / (Recall + Precision)**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))

In [None]:
### Prediction
print(clf.predict(count_vect.transform([""" spent 3 days on the phone with countless "agents" - most of that time trying to get each one to understand what I wanted - simply wanted to change my email address in my account. Based on the terrible telephone service I assume they are all located in South America where it is known to be poor quality phone service. Spent 15 - 25 minutes just getting them to understand what the problem was. Ended up having to create a new account losing my entire order history. This has to be the worse phone customer service out there!"""])))

### 3.2 Linear Classifier / Logistic Regression

https://stlong0521.github.io/20160228%20-%20Logistic%20Regression.html

<div><p>The generative classification model, such as Naive Bayes, tries to learn the probabilities and then predict by using Bayes rules to calculate the posterior, <span class="math">\(p(y|\textbf{x})\)</span>. However, discrimitive classifiers model the posterior directly. As one of the most popular discrimitive classifiers, logistic regression directly models the linear decision boundary.</p>
<h3>Binary Logistic Regression Classifier<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>Let us start with the binary case. For an M-dimensional feature vector <span class="math">\(\textbf{x}=[x_1,x_2,...,x_M]^T\)</span>, the posterior probability of class <span class="math">\(y\in\{\pm{1}\}\)</span> given <span class="math">\(\textbf{x}\)</span> is assumed to satisfy
</p>
<div class="math">\begin{equation}
\ln{\frac{p(y=1|\textbf{x})}{p(y=-1|\textbf{x})}}=\textbf{w}^T\textbf{x},
\end{equation}</div>
<p>
where <span class="math">\(\textbf{w}=[w_1,w_2,...,w_M]^T\)</span> is the weighting vector to be learned. Given the constraint that <span class="math">\(p(y=1|\textbf{x})+p(y=-1|\textbf{x})=1\)</span>, it follows that
</p>
<div class="math">\begin{equation} \label{Eqn:Prob_Binary}
p(y|\textbf{x})=\frac{1}{1+\exp(-y\textbf{w}^T\textbf{x})}=\sigma(y\textbf{w}^T\textbf{x}),
\end{equation}</div>
<p>
in which we can observe the logistic sigmoid function <span class="math">\(\sigma(a)=\frac{1}{1+\exp(-a)}\)</span>.</p>
<p>Based on the assumptions above, the weighting vector, <span class="math">\(\textbf{w}\)</span>, can be learned by maximum likelihood estimation (MLE). More specifically, given training data set <span class="math">\(\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...,(\textbf{x}_N,y_N)\}\)</span>,
</p>
<div class="math">\begin{align}
\begin{aligned}
\textbf{w}^*&amp;=\max_{\textbf{w}}{\mathcal{L}(\textbf{w})}\\
&amp;=\max_{\textbf{w}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\
&amp;=\max_{\textbf{w}}{\sum_{i=1}^N{\ln{\frac{1}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}}}\\
&amp;=\min_{\textbf{w}}{\sum_{i=1}^N{\ln{(1+\exp(-y_i\textbf{w}^T\textbf{x}_i))}}}.
\end{aligned}
\end{align}</div>
<p>
We have a convex objective function here, and we can calculate the optimal solution by applying gradient descent. The gradient can be drawn as
</p>
<div class="math">\begin{align}
\begin{aligned}
\nabla{\mathcal{L}(\textbf{w})}&amp;=\sum_{i=1}^N{\frac{-y_i\textbf{x}_i\exp(-y_i\textbf{w}^T\textbf{x}_i)}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}\\
&amp;=-\sum_{i=1}^N{y_i\textbf{x}_i(1-p(y_i|\textbf{x}_i))}.
\end{aligned}
\end{align}</div>
<p>
Then, we can learn the optimal <span class="math">\(\textbf{w}\)</span> by starting with an initial <span class="math">\(\textbf{w}_0\)</span> and iterating as follows:
</p>
<div class="math">\begin{equation} \label{Eqn:Iteration_Binary}
\textbf{w}_{t+1}=\textbf{w}_{t}-\eta_t\nabla{\mathcal{L}(\textbf{w})},
\end{equation}</div>
<p>
where <span class="math">\(\eta_t\)</span> is the learning step size. It can be invariant to time, but time-varying step sizes could potential reduce the convergence time, e.g., setting <span class="math">\(\eta_t\propto{1/\sqrt{t}}\)</span> such that the step size decreases with an increasing time <span class="math">\(t\)</span>.</p>
<h3>Multiclass Logistic Regression Classifier<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>When it is generalized to multiclass case, the logistic regression model needs to adapt accordingly. Now we have <span class="math">\(K\)</span> possible classes, that is, <span class="math">\(y\in\{1,2,..,K\}\)</span>. It is assumed that the posterior probability of class <span class="math">\(y=k\)</span> given <span class="math">\(\textbf{x}\)</span> follows
</p>
<div class="math">\begin{equation}
\ln{p(y=k|\textbf{x})}\propto\textbf{w}_k^T\textbf{x},
\end{equation}</div>
<p>
where <span class="math">\(\textbf{w}_k\)</span> is a column weighting vector corresponding to class <span class="math">\(k\)</span>. Considering all classes <span class="math">\(k=1,2,...,K\)</span>, we would have a weighting matrix that includes all <span class="math">\(K\)</span> weighting vectors. That is, <span class="math">\(\textbf{W}=[\textbf{w}_1,\textbf{w}_2,...,\textbf{w}_K]\)</span>.
Under the constraint
</p>
<div class="math">\begin{equation}
\sum_{k=1}^K{p(y=k|\textbf{x})}=1,
\end{equation}</div>
<p>
it then follows that
</p>
<div class="math">\begin{equation} \label{Eqn:Prob_Multiple}
p(y=k|\textbf{x})=\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}.
\end{equation}</div>
<p>The weighting matrix, <span class="math">\(\textbf{W}\)</span>, can be similarly learned by maximum likelihood estimation (MLE). More specifically, given training data set <span class="math">\(\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...(\textbf{x}_N,y_N)\}\)</span>,
</p>
<div class="math">\begin{align}
\begin{aligned}
\textbf{W}^*&amp;=\max_{\textbf{W}}{\mathcal{L}(\textbf{W})}\\
&amp;=\max_{\textbf{W}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\
&amp;=\max_{\textbf{W}}{\sum_{i=1}^N{\ln{\frac{\exp(\textbf{w}_{y_i}^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}}}}.
\end{aligned}
\end{align}</div>
<p>
The gradient of the objective function with respect to each <span class="math">\(\textbf{w}_k\)</span> can be calculated as
</p>
<div class="math">\begin{align}
\begin{aligned}
\frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}&amp;=\sum_{i=1}^N{\textbf{x}_i\left(I(y_i=k)-\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}\right)}\\
&amp;=\sum_{i=1}^N{\textbf{x}_i(I(y_i=k)-p(y_i=k|\textbf{x}_i))},
\end{aligned}
\end{align}</div>
<p>
where <span class="math">\(I(\cdot)\)</span> is a binary indicator function. Applying gradient descent, the optimal solution can be obtained by iterating as follows:
</p>
<div class="math">\begin{equation}\label{Eqn:Iteration_Multiple}
\textbf{w}_{k,t+1}=\textbf{w}_{k,t}+\eta_{t}\frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}.
\end{equation}</div>
<p>
Note that we have "<span class="math">\(+\)</span>" instead of "<span class="math">\(-\)</span>", because the maximum likelihood estimation in the binary case is eventually converted to a minimization problem, while here we keep performing maximization.</p>
<h3>How to Perform Predictions?</h3>
<p>Once the optimal weights are learned from the logistic regression model, for any new feature vector <span class="math">\(\textbf{x}\)</span>, we can easily calculate the probability that it is associated to each class label <span class="math">\(k\)</span> in the binary case in the multiclass case. With the probabilities for each class label available, we can then perform:</p>
<ul>
<li>a hard decision by identifying the class label with the highest probability, or</li>
<li>a soft decision by showing the top <span class="math">\(k\)</span> most probable class labels with their corresponding probabilities.</li>
</ul>

In [None]:
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count)
print("LR, Count Vectors: ", accuracy)

# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
print("LR, WordLevel TF-IDF: ", accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print("LR, N-Gram Vectors: ", accuracy)

# Linear Classifier on Character Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars)
print("LR, CharLevel Vectors: ", accuracy)

### 3.3 Implementing a SVM Model

In [None]:
# https://svivek.com/teaching/machine-learning/fall2018/slides/svm/svm-sgd.pdf
# https://medium.com/deep-math-machine-learning-ai/chapter-3-support-vector-machine-with-math-47d6193c82be

In [None]:


# SVM on Ngram Level TF IDF Vectors
accuracy = train_model(svm.SVC(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram)
print("SVM, N-Gram Vectors: ", accuracy)

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])

In [None]:
>>> _ = text_clf_svm.fit(twenty_train.data, twenty_train.target)
>>> predicted_svm = text_clf_svm.predict(twenty_test.data)
>>> np.mean(predicted_svm == twenty_test.target)

### 3.4 Bagging Model

In [None]:
# RF on Count Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count)
print "RF, Count Vectors: ", accuracy

# RF on Word Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf)
print "RF, WordLevel TF-IDF: ", accuracy

### 3.5 Boosting Model

In [None]:
# Extereme Gradient Boosting on Count Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_count.tocsc(), train_y, xvalid_count.tocsc())
print("Xgb, Count Vectors: ", accuracy)

# Extereme Gradient Boosting on Word Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf.tocsc(), train_y, xvalid_tfidf.tocsc())
print("Xgb, WordLevel TF-IDF: ", accuracy)

# Extereme Gradient Boosting on Character Level TF IDF Vectors
accuracy = train_model(xgboost.XGBClassifier(), xtrain_tfidf_ngram_chars.tocsc(), train_y, xvalid_tfidf_ngram_chars.tocsc())
print("Xgb, CharLevel Vectors: ", accuracy)

### 3.6 Shallow Neural Networks

In [None]:
def create_model_architecture(input_size):
    # create input layer 
    input_layer = layers.Input((input_size, ), sparse=True)
    
    # create hidden layer
    hidden_layer = layers.Dense(100, activation="relu")(input_layer)
    
    # create output layer
    output_layer = layers.Dense(1, activation="sigmoid")(hidden_layer)

    classifier = models.Model(inputs = input_layer, outputs = output_layer)
    classifier.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    return classifier 

classifier = create_model_architecture(xtrain_tfidf_ngram.shape[1])
accuracy = train_model(classifier, xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram, is_neural_net=True)
print("NN, Ngram Level TF IDF Vectors",  accuracy)

### 3.7.1 Convolutional Neural Network [Deep Neural Networks]

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "cnn.png")

In [None]:
def create_cnn():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_cnn()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)
print("CNN, Word Embeddings",  accuracy)

### 3.7.2 Recurrent Neural Network – LSTM [Deep Neural Networks]

In [None]:
def create_rnn_lstm():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.LSTM(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_rnn_lstm()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)
print("RNN-LSTM, Word Embeddings",  accuracy)

### 3.7.3 Recurrent Neural Network – GRU [Deep Neural Networks]

In [None]:
def create_rnn_gru():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the GRU Layer
    lstm_layer = layers.GRU(100)(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_rnn_gru()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)
print("RNN-GRU, Word Embeddings",  accuracy)

### 3.7.4 Bidirectional RNN [Deep Neural Networks]

In [None]:
def create_bidirectional_rnn():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)

    # Add the LSTM Layer
    lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(lstm_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_bidirectional_rnn()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)
print("RNN-Bidirectional, Word Embeddings",  accuracy)

### 3.7.5 Recurrent Convolutional Neural Network

- Hierarichial Attention Networks
- Sequence to Sequence Models with Attention
- Bidirectional Recurrent Convolutional Neural Networks
- CNNs and RNNs with more number of layers

In [None]:
def create_rcnn():
    # Add an Input Layer
    input_layer = layers.Input((70, ))

    # Add the word embedding Layer
    embedding_layer = layers.Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], trainable=False)(input_layer)
    embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
    
    # Add the recurrent layer
    rnn_layer = layers.Bidirectional(layers.GRU(50, return_sequences=True))(embedding_layer)
    
    # Add the convolutional Layer
    conv_layer = layers.Convolution1D(100, 3, activation="relu")(embedding_layer)

    # Add the pooling Layer
    pooling_layer = layers.GlobalMaxPool1D()(conv_layer)

    # Add the output Layers
    output_layer1 = layers.Dense(50, activation="relu")(pooling_layer)
    output_layer1 = layers.Dropout(0.25)(output_layer1)
    output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

    # Compile the model
    model = models.Model(inputs=input_layer, outputs=output_layer2)
    model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
    
    return model

classifier = create_rcnn()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x, is_neural_net=True)
print("CNN, Word Embeddings",  accuracy)

# <font color='BLUE'>EXPLAIN MODELS</font>

# TextExplainer: debugging black-box text classifiers

https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html

https://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html#example-problem-lsa-svm-for-20-newsgroups-dataset

**Goal:** explain predictions of arbitrary classifiers, including text classifiers (when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved)

### Example problem: LSA+SVM for 20 Newsgroups dataset

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline

vec = TfidfVectorizer(min_df=3, stop_words='english',
                      ngram_range=(1, 2))

# The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents.
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
lsa = make_pipeline(vec, svd)

clf = SVC(C=150, gamma=2e-2, probability=True)
pipe = make_pipeline(lsa, clf)
pipe.fit(twenty_train.data, twenty_train.target)
pipe.score(twenty_test.data, twenty_test.target)

In [None]:
def print_prediction(doc):
    y_pred = pipe.predict_proba([doc])[0]
    for target, prob in zip(twenty_train.target_names, y_pred):
        print("{:.3f} {}".format(prob, target))

doc = twenty_test.data[0]

print(twenty_test.data[0])
print('------------------------------------ What is the prediction?-------------------------------------------------------')
print_prediction(doc)

### TextExplainer

1. Create a TextExplainer instance, 
2. ... then pass the document to explain and a black-box classifier (a function which returns probabilities) to the fit() method, 
3. ... then check the explanation:

In [None]:
import eli5
from eli5.lime import TextExplainer

doc = twenty_test.data[0]

te = TextExplainer(random_state=42)
te.fit(doc, pipe.predict_proba)
te.show_prediction(target_names=twenty_train.target_names)

### Why it works?

Explanation makes sense - we expect reasonable classifier to **take highlighted words in account**. But how can we be sure this is **how the pipeline works**, not just a nice-looking lie? 

A simple **sanity check** is to **remove or change the highlighted words**, to confirm that **they change the outcome**

In [None]:
import re
doc2 = re.sub(r'(recall|kidney|stones|medication|pain|tech)', '', doc, flags=re.I)
print_prediction(doc2)

**Predicted probabilities changed a lot indeed.**

And in fact, TextExplainer did something similar to get the explanation. TextExplainer generated a lot of texts similar to the document (by removing some of the words), and then trained a white-box classifier which predicts the output of the black-box classifier (not the true labels!). The explanation we saw is for this white-box classifier.

This approach follows the LIME algorithm; for text data the algorithm is actually pretty straightforward:

- generate distorted versions of the text;
- predict probabilities for these distorted texts using the black-box classifier;
- train another classifier (one of those eli5 supports) which tries to predict output of a black-box classifier on these texts.

The algorithm works because even though it could be hard or impossible to approximate a black-box classifier globally (for every possible text), approximating it in a small neighbourhood near a given text often works well, even with simple white-box classifiers.

Generated samples (distorted texts) are available in samples_ attribute:

In [None]:
print(te.samples_[0])

In [None]:
# By default TextExplainer generates 5000 distorted texts (use n_samples argument to change the amount):
len(te.samples_)

### Customizing TextExplainer: classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(te5.show_weights())

In [None]:
explain_prediction_tree_classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

te5 = TextExplainer(clf=DecisionTreeClassifier(max_depth=2), random_state=0)
te5.fit(doc, pipe.predict_proba)
print(te5.metrics_)
te5.show_weights()

So according to this tree if **“kidney” is not in the document** and **“pain” is not in the document** then the **probability of a document** belonging to **sci.med** drops to **0.65**. If at least one of these words remain sci.med probability stays** 0.9+.**

# 3 ways to interpretate NLP model

https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb

**Goal**: want to know why we predict it wrongly

**1 . Interpretability**
- **Intrinsic**: We do not need to train another model to explain the target. For example, it is using decision tree or linear model
- **Post hoc**: The model belongs to black-box model which we need to use another model to interpret it. 

**2. Approach**
- **Model-specific**: Some tools are limited to specific model such as liner model and neural network model.
- **Model-agnostic**: On the other hand, some tools able to explain any model by building write-box model. 

** 3. Level**
- **Global**: Explain the overall model such as feature weight. This one give you a in general model behavior
- **Local**: Explain the specific prediction result.

In [None]:
import random
import pandas as pd
import IPython
import xgboost

import eli5
from eli5.lime import TextExplainer
from lime.lime_text import LimeTextExplainer
print('ELI5 Version:', eli5.__version__)
print('XGBoost Version:', xgboost.__version__)

In [None]:
from sklearn.datasets import fetch_20newsgroups
train_raw_df = fetch_20newsgroups(subset='train')
test_raw_df = fetch_20newsgroups(subset='test')

In [None]:
x_train = train_raw_df.data
y_train = train_raw_df.target

x_test = test_raw_df.data
y_test = test_raw_df.target

In [None]:
x_train

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier

In [None]:
names = ['Logistic Regression', 'Random Forest', 'XGBoost Classifier']

In [None]:
def build_model(names, x, y):
    pipelines = []
    vec = TfidfVectorizer()
    vec.fit(x)

    for name in names:
        print('train %s' % name)
        
        if name == 'Logistic Regression':
            estimator = LogisticRegression(solver='newton-cg', n_jobs=-1)
            pipeline = make_pipeline(vec, estimator)
        elif name == 'Random Forest':
            estimator = RandomForestClassifier(n_jobs=-1)
            pipeline = make_pipeline(vec, estimator)
        elif name == 'XGBoost Classifier':
            estimator = XGBClassifier()
            pipeline = make_pipeline(vec, estimator)
            
        pipeline.fit(x, y)
        pipelines.append({
            'name': name,
            'pipeline': pipeline
        })
        
    return pipelines, vec

In [None]:
pipelines, vec = build_model(names, x_train, y_train)

### 1. ELI5

#### A. - ELI5 - Global Interpretation

In [None]:
for pipeline in pipelines:
    print('Estimator: %s' % (pipeline['name']))
    labels = pipeline['pipeline'].classes_.tolist()
    
    if pipeline['name'] in ['Logistic Regression', 'Random Forest']:
        estimator = pipeline['pipeline']
    elif pipeline['name'] == 'XGBoost Classifier':
        estimator = pipeline['pipeline'].steps[1][1].get_booster()
#     Not support Keras
#     elif pipeline['name'] == 'keras':
#         estimator = pipeline['pipeline']
    else:
        continue
    
    IPython.display.display(
        eli5.show_weights(estimator=estimator, top=10, target_names=labels, vec=vec))

#### B. - ELI5 - Local Interpretation

In [None]:
number_of_sample = 1
sample_ids = [random.randint(0, len(x_test) -1 ) for p in range(0, number_of_sample)]

for idx in sample_ids:
    print('Index: %d' % (idx))
#     print('Index: %d, Feature: %s' % (idx, x_test[idx]))
    for pipeline in pipelines:
        print('-' * 50)
        print('Estimator: %s' % (pipeline['name']))
        
        print('True Label: %s, Predicted Label: %s' % (y_test[idx], pipeline['pipeline'].predict([x_test[idx]])[0]))
        labels = pipeline['pipeline'].classes_.tolist()
  
        if pipeline['name'] in ['Logistic Regression', 'Random Forest']:
            estimator = pipeline['pipeline'].steps[1][1]
        elif pipeline['name'] == 'XGBoost Classifier':
            estimator = pipeline['pipeline'].steps[1][1].get_booster()
        #     Not support Keras
#         elif pipeline['name'] == 'Keras':
#             estimator = pipeline['pipeline'].model
        else:
            continue

        IPython.display.display(
            eli5.show_prediction(estimator, x_test[idx], top=10, vec=vec, target_names=labels))

### 2. LIME [2 independent examples]

## <font color='red'>1st example</font>

https://www.kaggle.com/emanceau/interpreting-machine-learning-lime-explainer/notebook

Dataset contains text from works of fiction written by spooky authors of the public domain:
- Edgar Allan Poe (EAP)
- HP Lovecraft (HPL)
- Mary Wollstonecraft Shelley (MWS)

The objective is to **accurately identify the author of the sentences in the test set**

**Lime explainer mission** is to help human to **understand decisions made by machine learning**. Basically, lime explainer create **a local linear model** around the prediction and try to **explain factor influence**.

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import ensemble, metrics, model_selection, naive_bayes
from sklearn.pipeline import make_pipeline

from lime import lime_text
from lime.lime_text import LimeTextExplainer
import itertools  
%matplotlib inline
import warnings
warnings.simplefilter('ignore')

In [3]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [4]:
train_df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [5]:
test_df.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


#### Explainer with basic model

In [6]:
class_names = ['EAP', 'HPL', 'MWS']
cols_to_drop = ['id', 'text']
train_X = train_df.drop(cols_to_drop+['author'], axis=1)

## Prepare the data for modeling ###
author_mapping_dict = {'EAP':0, 'HPL':1, 'MWS':2}
train_y = train_df['author'].map(author_mapping_dict)
train_id = train_df['id'].values

tfidf_vec = TfidfVectorizer(ngram_range=(1,5), analyzer='char')
full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist() + test_df['text'].values.tolist())
train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())

X_train, X_test, y_train, y_test = train_test_split(train_tfidf, train_y, test_size=0.33, random_state=14)
model_tf = naive_bayes.MultinomialNB()
model_tf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
print(X_train)

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
y_pred = model_tf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
plt.show()

In [None]:
import re
c_tf = make_pipeline(tfidf_vec, model_tf)

split_expression = lambda s: re.split(r'\W+', s)
explainer = LimeTextExplainer(class_names=class_names, split_expression=split_expression)

In [None]:
comp = y_test.to_frame()
comp['idx'] = comp.index.values
comp['pred'] = y_pred
comp.rename(columns={'author': 'real'}, inplace=True)

### Explaining errors

#### A --- True POE but classified in HPL

In [None]:
wrong_poe_hpl = comp[(comp.real ==0) & (comp.pred ==1)]
wrong_poe_hpl.shape
print(wrong_poe_hpl.idx)
idx = wrong_poe_hpl.idx.iloc[1]

print('We see that we got', len(wrong_poe_hpl.idx), 'as shown by the confusion matrix above')

In [None]:
c_tf.predict_proba

In [None]:
tokenizer = lambda doc: re.compile(r"(?u)\b\w\w+\b").findall(doc)
explainer = LimeTextExplainer(class_names=class_names, split_expression=tokenizer)
exp = explainer.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=6)

This error is created by the use of ancient greek words. Possible to improve the model ?

In [None]:
idx = wrong_poe_hpl.idx.iloc[3]
exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=2)
exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1))

OK, very difficult case. Only three words > Not enough to properly classify. No improvement possible.

#### B. --- True POE but classified in MWS

In [None]:
wrong_poe_mws = comp[(comp.real ==0) & (comp.pred ==2)]
print(wrong_poe_mws.shape)
idx = wrong_poe_mws.idx.iloc[12]

In [None]:
exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)
exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1))

OK, this text contains anaphora, possible to improve the model with anaphora feature.

In [None]:
idx = wrong_poe_mws.idx.iloc[18]
exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)
exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))

OK, probabilities (EAP and MWS) are very close. Possible to improve the model.

#### C. --- True MWS but classified in HPL

In [None]:
wrong_mws_hpl = comp[(comp.real ==2) & (comp.pred ==1)]
print(wrong_mws_hpl.shape)
idx = wrong_mws_hpl.idx.iloc[8]

In [None]:
exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)
exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))

OK, probabilities (HPL and MWS) are very close. Possible to improve the model.

In [None]:
idx = wrong_mws_hpl.idx.iloc[5]
exp = explainer_tf.explain_instance(train_df['text'][idx], c_tf.predict_proba, num_features=4, top_labels=3)
exp.show_in_notebook(text=train_df['text'][idx], labels=(0,1,2))

OK, probabilities (EAP, HPL, MWS ) are all very close. Possible to improve the model (using repetition pattern ?)

## <font color='red'>2nd example</font>

https://marcotcr.github.io/lime/tutorials/Lime%20-%20basic%20usage%2C%20two%20class%20case.html

### 1st step : Fetching data, training a classifier

For simplicity, we'll use a **2-class subset**: atheism and christianity

In [None]:
import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics
from __future__ import print_function

In [None]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
class_names = ['atheism', 'christian']

Let's use the **tfidf vectorizer**, commonly used for text.

In [None]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(newsgroups_train.data)
test_vectors = vectorizer.transform(newsgroups_test.data)

Now, let's say we want to use **random forests for classification**. It's usually hard to understand what random forests are doing, especially with many trees.

In [None]:
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, newsgroups_train.target)

In [None]:
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(newsgroups_test.target, pred, average='binary')

We see that this classifier achieves a very high F score

### 2nd step : Explaining predictions using lime

Lime explainers assume that **classifiers act on raw text**, but **sklearn classifiers** act on **vectorized representation of texts**. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.

In [None]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)

In [None]:
print(c.predict_proba([newsgroups_test.data[0]]))

Now we create an explainer object. We pass the class_names a an argument for prettier display.

In [None]:
from lime.lime_text import LimeTextExplainer
import re
split_expression = lambda s: re.split(r'\W+', s)
explainer = LimeTextExplainer(class_names=class_names, split_expression=split_expression)

We then generate an explanation with at most 6 features for an arbitrary document in the test set.

In [None]:
idx = 83
exp = explainer.explain_instance(newsgroups_test.data[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(christian) =', c.predict_proba([newsgroups_test.data[idx]])[0,1])
print('True class: %s' % class_names[newsgroups_test.target[idx]])

The classifier got this example right (it predicted atheism).

In [None]:
# The explanation is presented below as a list of weighted features

exp.as_list()

These weighted features are a linear model, which approximates the **behaviour of the random forest classifier in the vicinity of the test example**. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case.

In [None]:
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['Posting']] = 0
tmp[0,vectorizer.vocabulary_['Host']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])

Pretty close!
**The words that explain the model around this document seem very arbitrary** - not much to do with either Christianity or Atheism.
In fact, these are words that appear in the email headers (you will see this clearly soon), which **make distinguishing between the classes much easier.**

### 3rd Step: Visualizing explanations

In [None]:
%matplotlib inline
fig = exp.as_pyplot_figure()

In [None]:
exp.show_in_notebook(text=False)
# exp.save_to_file('/tmp/oi.html')

In [None]:
# how the words that affect the classifier the most are all in the email header.
exp.show_in_notebook(text=True)

# Clustering documents using similarity features

https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/feature%20engineering%20text%20data/Feature%20Engineering%20Text%20Data%20-%20Traditional%20Strategies.ipynb