<img src="Images/PU.png" width="100%">

### Course Name : ML 501 Practical Machine Learning  
#### Notebook compiled by : Rajiv Kale, Consultant at Learning and Development  
** Important ! ** For internal circulation only

# Extracting features from the Text
Many machine learning applications like sentiment analysis, text data is used as explanatory variable. Text must be converted to a different representation that captures as much of its information  as possible in a feature vector.
<img src="Images/Text_Data.png" width="80%">


# The bag-of-words representation

Let’s assume that, we are working on document classification problem. The collection of all the documents is called as Corpus.

In [2]:
X = ["Hackethon program was challenging and we enjoyed every bit of it",
     "Amazing initiative as Hackethon brings out best from innovaters", "The program had too much of a theory"]

In [3]:
len(X)


3

In Scikit, there is an encoder to score words based on their count called CountVectorizer, one for using a hash function of each word to reduce the vector length called HashingVectorizer, and a one that uses a score based on word occurrence in the document and the inverse occurrence across all documents called TfidfVectorizer

In [5]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
vectorizer.vocabulary_

{'amazing': 0,
 'and': 1,
 'as': 2,
 'best': 3,
 'bit': 4,
 'brings': 5,
 'challenging': 6,
 'enjoyed': 7,
 'every': 8,
 'from': 9,
 'hackethon': 10,
 'had': 11,
 'initiative': 12,
 'innovaters': 13,
 'it': 14,
 'much': 15,
 'of': 16,
 'out': 17,
 'program': 18,
 'the': 19,
 'theory': 20,
 'too': 21,
 'was': 22,
 'we': 23}

In [7]:
X_bag_of_words = vectorizer.transform(X)
X_bag_of_words


<3x24 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

In [8]:
X_bag_of_words.shape
X_bag_of_words[1, 9]

1

In [9]:
X_bag_of_words.toarray()

array([[0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
        1, 1],
       [1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1,
        0, 0]], dtype=int64)

#### Adding stop words

In [10]:
#my_list=['is','of']
my_list=['was','from', 'of', 'and']

In [11]:

vectorizer = CountVectorizer(stop_words=my_list)
vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['was', 'from', 'of', 'and'], strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

In [12]:
vectorizer.vocabulary_

{'amazing': 0,
 'as': 1,
 'best': 2,
 'bit': 3,
 'brings': 4,
 'challenging': 5,
 'enjoyed': 6,
 'every': 7,
 'hackethon': 8,
 'had': 9,
 'initiative': 10,
 'innovaters': 11,
 'it': 12,
 'much': 13,
 'out': 14,
 'program': 15,
 'the': 16,
 'theory': 17,
 'too': 18,
 'we': 19}

In [13]:
X_bag_of_words = vectorizer.transform(X)
print(X_bag_of_words.shape)
X_bag_of_words.toarray()

(3, 20)


array([[0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1],
       [1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0]],
      dtype=int64)

# Finding Important Words in Text Using TF-IDF
TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

+ If a word appears frequently in a document, it's important. Give the word a high score.
+ But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Please find more math details [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
import numpy as np
np.set_printoptions(precision=2)

print(tfidf_vectorizer.transform(X).toarray())

[[0.   0.32 0.   0.   0.32 0.   0.32 0.32 0.32 0.   0.24 0.   0.   0.
  0.32 0.   0.24 0.   0.24 0.   0.   0.   0.32 0.32]
 [0.34 0.   0.34 0.34 0.   0.34 0.   0.   0.   0.34 0.26 0.   0.34 0.34
  0.   0.   0.   0.34 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.4  0.   0.
  0.   0.4  0.31 0.   0.31 0.4  0.4  0.4  0.   0.  ]]


# N-Grams
Look for sequence of tokens

In [16]:
Ngram_vectorizer = CountVectorizer(ngram_range=(2, 3))
Ngram_vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [17]:
Ngram_vectorizer.get_feature_names()

['amazing initiative',
 'amazing initiative as',
 'and we',
 'and we enjoyed',
 'as hackethon',
 'as hackethon brings',
 'best from',
 'best from innovaters',
 'bit of',
 'bit of it',
 'brings out',
 'brings out best',
 'challenging and',
 'challenging and we',
 'enjoyed every',
 'enjoyed every bit',
 'every bit',
 'every bit of',
 'from innovaters',
 'hackethon brings',
 'hackethon brings out',
 'hackethon program',
 'hackethon program was',
 'had too',
 'had too much',
 'initiative as',
 'initiative as hackethon',
 'much of',
 'much of theory',
 'of it',
 'of theory',
 'out best',
 'out best from',
 'program had',
 'program had too',
 'program was',
 'program was challenging',
 'the program',
 'the program had',
 'too much',
 'too much of',
 'was challenging',
 'was challenging and',
 'we enjoyed',
 'we enjoyed every']

In [18]:
Ngram_vectorizer.transform(X).toarray()

array([[0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1,
        1],
       [1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
        0]], dtype=int64)

# SMS Spam Collection Data Set


The dataset is available at [UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) It is a collection of more than ** 5 thousand SMS phone messages.** 
<img src="Images/spam.jpg" width="80%">

In [19]:
import pandas as pd
import numpy as np
import seaborn as sns

In [20]:
import matplotlib.pyplot as plt
% matplotlib inline

In [21]:
sms = pd.read_csv('./Datasets/SMSSpamCollection', sep='\t', names=["label", "message"])
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [22]:
# examine the class distribution
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [23]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

You may consider adding a feature column like "length of a message" if you think that might help in classifying SPAM message from HAM message 


#sms['length']=sms['message'].map(lambda text: len(text))
#sms['length'].hist()
#sms.hist(column='length', by='label', bins=50)


In [24]:
# check that the conversion worked
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [25]:
X = sms.message
y = sms.label_num

In [26]:
# split X and y into training and testing sets
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Vectorizing our dataset

In [27]:
# instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [28]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [29]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

# Machine Learning 

In [30]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [31]:
# train the model using X_train_dtm 
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [32]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [33]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9885139985642498

In [34]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [35]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [36]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [37]:
# example false negative
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

# Time for Testing 

In [41]:
# example text for model testing
simple_test = ["Frre entry to Awesome orbit session"]

In [42]:
X_temp = vect.transform(simple_test)
X_temp.toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [43]:
nb.predict(X_temp)

array([1], dtype=int64)

In [44]:
nb.predict(X_temp)[0]

1

#### Play with this at home

#### CMU has released a 1 lac tweeter dataset with lables (positive tweet OR negative tweet). Use it to get familiar with this bag of words approach

More on tfidf with small example

#### How to use Keras for similar work

Bag-of-Words with Keras
The Keras Python library for deep learning also provides tools for encoding text using the bag-of words-model in the Tokenizer class.

As above, the encoder must be trained on source documents and then can be used to encode training data, test data and any other data in the future. The API also has the benefit of performing basic tokenization prior to encoding the words.

The snippet below demonstrates how to train and encode some small text documents using the Keras API and the ‘count’ type scoring of words. 


In [None]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print("\nWord Counts=")
print(t.word_counts)
print("\nDocumentCount=")
print(t.document_count)
print("\nWord Index=")
print(t.word_index)
print("\nWord Docs=")
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print("\nEncoded Docs=")
print(encoded_docs)



In [None]:
# How to use on test text?
test_docs = ['Very Well done!',
        'Amazing work',
        'effort Great',
        'nice work',
        'Excellent! work']

encoded_test_docs = t.texts_to_matrix(test_docs, mode='count')
print("\nEncoded Test Docs=")
print(encoded_test_docs)

##### Note: Word order in original sentence is not maintained in the representation