<img src="Images/PU.png" width="100%">

## Course Name : Machine Learning for Professionals (ML 701)    
#### Notebook compiled by : Bhushan Garware, Project Lead at Learning and Development  
** Important ! ** For internal circulation olny

# Extracting features from the Text
Many machine learning applications like sentiment analysis, text data is used as explanatory variable. Text must be converted to a different representation that captures as much of its information  as possible in a feature vector.
<img src="Images/Text_Data.png" width="80%">


# The bag-of-words representation

Let’s assume that, we are working on document classification problem. The collection of all the documents is called as Corpus.

In [None]:
X = ["Orbit program is important for all of us",
     "Orbit program is very interesting"]

In [None]:
len(X)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X)

In [None]:
vectorizer.vocabulary_

In [None]:
X_bag_of_words = vectorizer.transform(X)
X_bag_of_words

In [None]:
X_bag_of_words.shape

In [None]:
X_bag_of_words.toarray()

### Adding stop words

In [None]:
my_list=['is','of']

In [None]:

vectorizer = CountVectorizer(stop_words=my_list)
vectorizer.fit(X)

In [None]:
vectorizer.vocabulary_

In [None]:
X_bag_of_words = vectorizer.transform(X)
print(X_bag_of_words.shape)
X_bag_of_words.toarray()

# Finding Important Words in Text Using TF-IDF
TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

+ If a word appears frequently in a document, it's important. Give the word a high score.
+ But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Please find more math details [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

In [None]:
import numpy as np
np.set_printoptions(precision=2)

print(tfidf_vectorizer.transform(X).toarray())

# N-Grams
Look for sequence of tokens

In [None]:
Ngram_vectorizer = CountVectorizer(ngram_range=(2, 3))
Ngram_vectorizer.fit(X)

In [None]:
Ngram_vectorizer.get_feature_names()

In [None]:
Ngram_vectorizer.transform(X).toarray()

# SMS Spam Collection Data Set


The dataset is available at [UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) It is a collection of more than ** 5 thousand SMS phone messages.** 
<img src="Images/spam.jpg" width="80%">

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline

In [None]:
sms = pd.read_csv('./Datasets/SMSSpamCollection', sep='\t', names=["label", "message"])
sms.head()

In [None]:
# examine the class distribution
sms.label.value_counts()

In [None]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [None]:
# check that the conversion worked
sms.head(10)

In [None]:
X = sms.message
y = sms.label_num

In [None]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Vectorizing our dataset

In [None]:
# instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [None]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

In [None]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

# Machine Learning 

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
# train the model using X_train_dtm 
nb.fit(X_train_dtm, y_train)

In [None]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [None]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

In [None]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

In [None]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

In [None]:
# example false negative
X_test[3132]

# Time for Testing 

In [None]:
# example text for model testing
simple_test = [" Awesome orbit session"]

In [None]:
X_temp = vect.transform(simple_test)
X_temp.toarray()

In [None]:
nb.predict(X_temp)