## Naive Bayes Classifier


Text data is usually represented as strings, which in turn are concatenation of characters. The type and length of text will vary accross projects.

Due to it's nature, text is clearly very different from the numeric features and we will need to process it differently before we can do analysis with it and apply our machine learning algorithms to it. 

This notebook will cover the Naive Bayes Classifier that is one of the best ML techniques we can apply to labelled text data.

In [1]:
import os

In [2]:
os.getcwd()

'/Users/ariedamuco/Dropbox (CEU Econ)/ML-for-NLP/code/Text'

In [3]:
os.chdir("/Users/ariedamuco/Dropbox (CEU Econ)/ML-for-NLP")

In [4]:
file=open('Inputs/smsspamcollection/SMSSpamCollection').readlines()[0:5]

In [5]:
file

['ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n',
 'ham\tOk lar... Joking wif u oni...\n',
 "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n",
 'ham\tU dun say so early hor... U c already then say...\n',
 "ham\tNah I don't think he goes to usf, he lives around here though\n"]

In [None]:
#now open the same file in pandas
import pandas as pd
data = pd.read_csv('Inputs/smsspamcollection/SMSSpamCollection', sep='\t',names=["label", "message"])

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.groupby('label').describe()

In [None]:
data['length'] = data['message'].apply(len)

In [None]:
data.head()

In [None]:
data.length

In [None]:
data['length'].plot(bins=100, kind='hist', color='red')

In [None]:
data['length'].describe()

In [None]:
#910 characters, let's see how this looks like, use .iloc[0] to show full message
data[data['length'] == 910]['message'].iloc[0]

In [None]:
data.hist(column='length', by='label',color='blue', bins=50, figsize=(10,4), range=[0, 250])

## Text Pre-Processing

The classification algorithms need numerical feature vector in order to perform the classification task. 
There are actually many methods to convert a corpus to a vector format. The simplest is the the bag-of-words approach, where each unique word in a text will be represented by one number. 

### Bag of Words Approach (BOW)

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). When using this representation, we discard most of the structure of the input text and count the frequency of each word in the text. Disregarding the structure and counting only word occurrences leads to the
mental image of representing text as a `bag`. 

Computing the bag-of-words representation for a corpus of documents
consists of the following three steps: 

i) Tokenization: Split each document into the words `tokens`, for example by splitting them on whitespace and
punctuation.

ii) Vocabulary building:  Collect a vocabulary of all words that appear
in any of the documents

iii) Encoding: For each document, we count how many times each word appears.


For this purpose, we will use the NLTK library (alternatively you can load the stopwords list that I have provided you with). 
NLTK library, jointly with Spacy, are standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.

In [None]:
from nltk.corpus import stopwords
stopwords.words('english')# Show the vector of stop words

In [None]:
#Alternatively 
stopwords=open('Inputs/nltk_stopwords.txt').readlines()

In [None]:
stopwords

In [None]:
stopwords=[element.replace("\n", "") for element in stopwords]

In [None]:
stopwords[0:3]

In [None]:
import re

In [None]:
#https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string

In [None]:
string_original = "string. With. Punctuation?"
string_replaced = re.sub(r'\W',' ', string_original)

In [None]:
string_replaced

In [None]:
def remove_punct_tokenize(text):
    text = re.sub(r'[^\w\s]','', text)
    text = text.lower()   
    return text.split()  

Reminder 1:  `\w` means alphanumeric `[0-9a-zA-Z_]`, `\W` = non-alphanumeric, and `\s` stands for empty space. See http://www.pyregex.com/

Reminder 2: You can also use the NLKT library to do the tokenization.

Let's check what we have done.

In [None]:
remove_punct_tokenize("let's try this one....")

In [None]:
def remove_stopwords(text):
    clean_stopwords=""
    for element in remove_punct_tokenize(text):
        if element not in stopwords:
            clean_stopwords = clean_stopwords + " "+ element
    return clean_stopwords.strip()

In [None]:
remove_stopwords("let's try this one....")

In [None]:
data['message'].apply(remove_stopwords).head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#CountVectorizer will convert text into token counts
bow_transformer = CountVectorizer()

In [None]:
bow_transformer

In [None]:
bow_transformer = CountVectorizer(preprocessor = remove_stopwords).fit(data['message'])

In [None]:
bow_transformer

In [None]:
print (len(bow_transformer.vocabulary_))

In [None]:
message9 = data['message'][8]

In [None]:
message9

In [None]:
bow9 = bow_transformer.transform([message9])
print (bow9.shape)

In [None]:
type(bow9)

In [None]:
print(bow9)

Let's check what tokens are stored in (0, 217), (0, 2218)

In [None]:
print (len(bow_transformer.get_feature_names()))

In [None]:
print (bow_transformer.get_feature_names()[217])

In [None]:
print (bow_transformer.get_feature_names()[2218])

In [None]:
data[data['length'] == 910]

In [None]:
message_romeo = data['message'][1085]

In [None]:
message_romeo

In [None]:
bow_romeo = bow_transformer.transform([message_romeo])

In [None]:
bow_romeo.shape

In [None]:
print (bow_romeo)

In [None]:
type(bow_transformer.get_feature_names())

In [None]:
for index, word in enumerate(bow_transformer.get_feature_names()):
    if "love" in word:
        print (index, word)

In [None]:
#transform now all dataset
data_bow = bow_transformer.transform(data['message'])

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
tfidf_transformer = TfidfTransformer().fit(data_bow)

In [None]:
tfidf9 = tfidf_transformer.transform(bow9)
print (tfidf9)

In [None]:
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['claim']])

In [None]:
print (tfidf_transformer.idf_[bow_transformer.vocabulary_['love']])

In [None]:
data_tfidf = tfidf_transformer.transform(data_bow)
print (data_tfidf.shape)

### Naive Bayes  Classifier

Naive Bayes is one of the most practical machine learning algorithms. It performs very well with text data. It learns and predicts very fast and it does not require lots of storage. It takes the name after Bayes as the Bayes theorem is applied.  It's called "NAIVE" because all features are assumed to be independent of each other. This is rarely the case, however, the algorithm still returns very good accuracy in practice even when the independent assumption does not hold.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
spam_detect_model = MultinomialNB().fit(data_tfidf.toarray() , data['label'])

In [None]:
all_predictions = spam_detect_model.predict(data_tfidf)
print (all_predictions)


In [None]:
true_val = data['label']

In [None]:
print (true_val)

In [None]:
#check what is the prediction for tfidf9
spam_detect_model.predict(tfidf9)

In [None]:
data['label'][8]

In [None]:
from sklearn.metrics import classification_report
print (classification_report(data['label'], all_predictions))

In [None]:
from sklearn.model_selection import train_test_split
msg_train, msg_test, label_train, label_test = train_test_split(data['message'], data['label'], test_size=0.2, random_state=1)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
#create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(preprocessor = remove_stopwords )),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  #Naive Bayes classifier
])

In [None]:
#Now we can directly pass message text data and the pipeline will do our pre-processing for us!
pipeline.fit(msg_train, label_train)
predictions = pipeline.predict(msg_test)

In [None]:
len(label_test)

In [None]:
len(predictions)

In [None]:
print (classification_report(label_test,  predictions))

In [None]:
from sklearn.metrics import confusion_matrix
#tn, fp, fn, tp = confusion_matrix(label_test,predictions).ravel()
confusion_matrix(label_test, predictions)

In [None]:
#Predict out of sample messages

In [None]:
pipeline.predict(["I am prince Ali, you win 2000$"])

In [None]:
pipeline.predict(["Hello, it's me, I was wondering if after all these years..."])


### References 

-Data UC Irvine https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

-Precision and recall
https://en.wikipedia.org/wiki/Precision_and_recall

-Feature ingeneering
https://en.wikipedia.org/wiki/Feature_engineering

-Naive Bayes 
https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://scikit-learn.org/stable/modules/naive_bayes.html

-Confusion matrix
https://en.wikipedia.org/wiki/Confusion_matrix