# Contents:

- 1) Introduction to Natural Language Processing
- 2) Sentiment Analysis
    - Model Selection in scikit-learn
    - Extracting features
        - Bag-of-words
    - Logistic Regression classification
    - Tfidf
    - N-gram
- 3) Text Classification
    - Using sklearn's NaiveBayes Classifier

# 1. Introduction to Natural Language Processing
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

What better way than to use a popular use case application: Amazon review sentiment analysis, to better understand how text information can be parsed and processed into something useful for ML.


# 2. Case Study: Sentiment Analysis

We will be working on a large dataset of reviews of unlocked mobile phones sold on Amazon.com that has been collected by Crawlers et al. in December, 2016. The Amazon reviews dataset consists of 400 thousand reviews to find out insights with respect to reviews, ratings, price and their relationships.

#### Dataset Content 

Given below are the fields:

- Product Title
- Brand
- Price
- Rating
- Review text
- Number of people who found the review helpful

Our main end goal here is to learn how to extract meaningful information from a subset of these reviews to build a machine learning model that can predict whether a certain reviewer liked or disliked a mobile phones.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv', encoding="utf8")

# shuffle rows of dataframe
df = df.sample(frac=1.0, random_state=10)#frac -shuffle
df.head()
len(df)

413840

In [3]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0,0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0,1
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0,0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.00,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0,1
277158,Nokia N8 Unlocked GSM Touch Screen Phone Featu...,Nokia,95.00,5,I fell in love with this phone because it did ...,0.0,1
100311,Blackberry Torch 2 9810 Unlocked Phone with 1....,BlackBerry,77.49,5,I am pleased with this Blackberry phone! The p...,0.0,1
251669,Motorola Moto E (1st Generation) - Black - 4 G...,Motorola,89.99,5,"Great product, best value for money smartphone...",0.0,1
279878,OtterBox 77-29864 Defender Series Hybrid Case ...,OtterBox,9.99,5,I've bought 3 no problems. Fast delivery.,0.0,1
406017,Verizon HTC Rezound 4G Android Smarphone - 8MP...,HTC,74.99,4,Great phone for the price...,0.0,1
302567,"RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came...",RCA,159.99,5,My mom is not good with new technoloy but this...,4.0,1


In [4]:
df['Positively Rated'].mean() #more positive

0.7482686025879323

# Model Selection in scikit-learn

In [5]:
from sklearn.model_selection import train_test_split

# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [6]:
# What is the review number 10 in the X_train set
X_train.iloc[10]

"I own 2 of these, one is international version that I used during travels and another is T-Mobile's.T-Mobile's came with Android 2.3.5 and international one with 2.3.6, I liked 2.3.6 more, less glitchesand runs smoother,also battery efficient. Anyway,T-mobile applies a software update pretty right awayand now both run the same version. International has faster cpu (1.5Ghz) vs T-Mobile's 1Ghz, thoughhonestly I don't feel difference. I don't run games and use the usual set of applications - Google'sGmail,Maps,Talk, Facebook's client and messenger, Skype (which is not officially supported on SamsungExhibit / Wonder but works great) and few other social networking / informational apps. T-Mobile activationwas a blaze.What I like: screen is nice except in sunny places, surprisingly clear for TFT however my eyes like AMOLEDmore. Size is great if you can live with smaller screen. CPU speed is superb for both 1Ghz and 1.5Ghz. Battery lifeis fantastic.What I don't like: TFT screen,don't see any

In [7]:
# X_train size
X_train.size

231207

In [8]:
# X_test size
X_test.size

77070

# Extracting features from text files


Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using bag of words model.

## Bag-of-words (BOW)
BOW model allows us to represent text as numerical feature vectors. The idea behind BOW is quite simple and can be summarized as follows:
- 1) Create a vocabulary of unique tokens (or words) from the entire set 
    of documents.
- 2) Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse. For this reason we say that bags of words are typically <b>high-dimensional sparse datasets</b>.

{for our example. Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).}


### Transform words into vectors (CountVectorizer)
To construct a bag-of-words model based on the word counts in the respective documents, we can use the `CountVectorizer` class implemented in `scikit-learn`. As we will see in the following codes, the `CountVectorizer` class takes an array of text data, which can be documents or just sentences, and constructs the bag-of-words model for us:

Scikit-learn has a high level component which will create feature vectors for us <b>‘CountVectorizer’</b>

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

# Fit the CountVectorizer to the training data 
vect1=CountVectorizer().fit(docs)

# transform the documents in the training data to a document-term matrix. 
bag = vect1.transform(docs)

In [10]:
vect1.vocabulary_

{'and': 0, 'is': 1, 'shining': 2, 'sun': 3, 'sweet': 4, 'the': 5, 'weather': 6}

In [11]:
vect1.get_feature_names()

['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather']

In [12]:
train1= bag.toarray()

In [13]:
train1

array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]], dtype=int64)

# Exercise

1) Do CountVectorizer for training data

2) Determine: 
- The number of features 
- The shape of sparse matrix

In [14]:
# Your code is here
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data 
vect1=CountVectorizer().fit(X_train)

# transform the documents in the training data to a document-term matrix. 
bag = vect1.transform(X_train)

In [15]:
vect1.vocabulary_

{'it': 26074,
 'an': 4929,
 'otterbox': 33635,
 'what': 51628,
 'do': 15833,
 'you': 52893,
 'expect': 18613,
 'sturdy': 45292,
 'and': 4962,
 'makes': 28973,
 'the': 47008,
 'tablet': 46182,
 'easier': 16644,
 'to': 47699,
 'hold': 23637,
 'on': 33050,
 'very': 50519,
 'good': 21809,
 'phone': 35063,
 'was': 51235,
 'in': 24683,
 'great': 22209,
 'condition': 11918,
 'functions': 20870,
 'well': 51539,
 'only': 33132,
 'problem': 36816,
 'is': 25971,
 'with': 52022,
 'battery': 7124,
 'life': 27853,
 'but': 9055,
 'that': 46980,
 'given': 21579,
 'when': 51670,
 'buy': 9116,
 'used': 49923,
 'previous': 36614,
 'owner': 33956,
 'most': 30794,
 'likely': 28008,
 'wore': 52220,
 'down': 16069,
 'all': 4501,
 'also': 4635,
 'shipped': 42554,
 'extreamly': 18902,
 'fast': 19255,
 'amazing': 4754,
 'had': 22671,
 'two': 48731,
 'of': 32800,
 'them': 47064,
 'from': 20646,
 'how': 23878,
 'much': 31030,
 'enjoyed': 17389,
 'simplicity': 42939,
 'highly': 23481,
 'recommended': 38849,
 'for'

In [16]:
vect1.get_feature_names()

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '0000000',
 '00000000000',
 '0000from',
 '0001',
 '0004',
 '000ma',
 '000mah',
 '000mh',
 '000restricted',
 '0051',
 '006',
 '007',
 '00am',
 '00bucks',
 '00for',
 '00it',
 '00k',
 '00now',
 '00pm',
 '00x2',
 '01',
 '011',
 '013435003182980',
 '014',
 '0155379',
 '016',
 '016s',
 '019s',
 '02',
 '02may13',
 '02mbps',
 '03',
 '032g',
 '0330',
 '03pm',
 '04',
 '0400',
 '044',
 '04pm',
 '04th',
 '04the',
 '05',
 '050',
 '0500tkx',
 '050mms',
 '050prot',
 '051',
 '056',
 '0572013',
 '0577454',
 '05788690',
 '05th',
 '05the',
 '05using',
 '06',
 '061',
 '062',
 '0630',
 '066',
 '06pm',
 '07',
 '0780',
 '07am',
 '07gb',
 '07nov2015',
 '08',
 '0804245',
 '0808',
 '0825',
 '0829',
 '087',
 '087581287',
 '08in',
 '08mms',
 '08this',
 '09',
 '0909853',
 '09on',
 '0_1439_7',
 '0_150511',
 '0_print_120716',
 '0_user_manual',
 '0a',
 '0also',
 '0b3tbzlidhq7dce1bv05qdefaota',
 '0bj7255rf1f1a1118w65',
 '0c',
 '0cant',
 '0cesqfjad',
 '0dislikes',
 '0expand

In [17]:
train1=bag

# Logistic Regression classification

We will train a logistic regression model to classify the  Amazon reviews into positive and negative reviews by using feature matrix. 

In [18]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(train1, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect1.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9231168526742961


In [20]:
model.coef_

array([[-0.51217593, -0.12377712,  0.06281552, ...,  0.34869348,
         0.257767  ,  0.257767  ]])

In [21]:
model.coef_[0].argsort()

array([52365, 26658, 52373, ..., 18341, 18306, 18305], dtype=int64)

In [22]:
# get the feature names as numpy array
feature_names = np.array(vect1.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
print('Smallest Coefs:' )
print(feature_names[sorted_coef_index[:10]])
      
print('\n Largest Coefs:')      
print(feature_names[sorted_coef_index[:-11:-1]])

Smallest Coefs:
['worst' 'junk' 'worthless' 'unusable' 'mony' 'horrible' 'false' 'nope'
 'garbage' 'terrible']

 Largest Coefs:
['excelent' 'excelente' 'excellent' 'loving' 'exelente' 'loves' 'amazing'
 'perfecto' 'love' 'lovely']


# Tfidf

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called **term frequency-inverse document frequency** (*tf-idf*) that can be used to downweight those frequently occurring words in the feature vectors. On the other words by tf-idf we can reduce the weightage of more common words like (the, is, an etc.) which occurs in all document.

The *tf-idf* can be defined as the product of the term frequency and the inverse document frequency:

\begin{align}
\textit{tf-idf}(t,d) = tf(t,d) \times idf(t,d)
\end{align}

Here the *tf(t,d)* is the term frequency that equal to Count of word / Total words, in each document. The inverse document frequency *idf(t,d)* can be calculated as:

\begin{align}
idf(t,d) = log\frac{n_d}{1+\text{df(d,t)}}
\end{align}

where $n_d$ is the total number of documents, and *df(d,t)* is the number of documents *d* that contain the term *t*. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.


scikit-learn implements yet another vectorizer, the TfidfVectorizer, that creates feature vectors as tf-idfs.


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer


docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

vect2 = TfidfVectorizer().fit(docs)
bag2 = vect2.transform(docs)
bag2.toarray()

array([[0.        , 0.43370786, 0.55847784, 0.55847784, 0.        ,
        0.43370786, 0.        ],
       [0.        , 0.43370786, 0.        , 0.        , 0.55847784,
        0.43370786, 0.55847784],
       [0.40474829, 0.47810172, 0.30782151, 0.30782151, 0.30782151,
        0.47810172, 0.30782151]])

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data 
vect = TfidfVectorizer(min_df=5).fit(X_train)
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9269570672514825


In [25]:
len(vect.get_feature_names())

18025

 # Exercise
 
- Predict two below reviews as negetive or positive using your model: 

      ['not an issue, phone is working', 'an issue, phone is not working']       

In [26]:
# Your code is here
model.predict(vect.transform(['not an issue, phone is working', 'an issue, phone is not working']))

array([0, 0])

# n-grams

The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model — each item or token in the vocabulary represents a single word. Generally, <b>the contiguous sequences of items in NLP</b> — words, letters, or symbols— is also called an n-gram. The choice of the number n in the n-gram model depends on the particular application. For instance, spam filtering applications tend to use n=3 or n=4 for good performances.
To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document "the sun is shining" would be constructed as follows:
- 1-gram: "the", "sun", "is", "shining"
- 2-gram: "the sun", "sun is", "is shining"

The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter. By default, it uses a 1-gram representation.

In [27]:
# Try 2-gram representation
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

vect3=CountVectorizer(ngram_range=(1,2)).fit(docs)
bag3=vect3.transform(docs)

In [28]:
vect3.get_feature_names()

['and',
 'and the',
 'is',
 'is shining',
 'is sweet',
 'shining',
 'shining and',
 'sun',
 'sun is',
 'sweet',
 'the',
 'the sun',
 'the weather',
 'weather',
 'weather is']

In [29]:
len(vect3.get_feature_names())

15

In [30]:
len(vect.get_feature_names())

18025

In [31]:
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

In [32]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9634697117885772


In [33]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:' )
print(feature_names[sorted_coef_index[:10]])
      
print('\n Largest Coefs:')      
print(feature_names[sorted_coef_index[:-11:-1]])

Smallest Coefs:
['no good' 'junk' 'worst' 'horrible' 'not good' 'garbage' 'not happy'
 'terrible' 'not satisfied' 'not very']

 Largest Coefs:
['excelent' 'excelente' 'not bad' 'excellent' 'perfect' 'no problems'
 'awesome' 'exelente' 'great' 'no issues']


In [34]:
print(model.predict(vect.transform(['no an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


# Text Classification

## Using sklearn's NaiveBayes Classifier


### Exercise:
1. Do text classification for the Amazon reviews dataset using NaiveBayes Classifier
2. Evaluate your classification

In [35]:
from sklearn import naive_bayes
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

In [36]:
from sklearn.metrics import roc_auc_score

mnb = MultinomialNB()
mnb.fit(X_train_vectorized,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [37]:
predicted_model=mnb.predict(vect.transform(X_test))

In [38]:
mnb_f1=metrics.f1_score(y_test,predicted_model,average='micro')
print("Multinomial NB - F1 score: {:.3f}".format(mnb_f1))

Multinomial NB - F1 score: 0.951
