### Natural Language Processing
Natural language processing (NLP) is a branch of machine learning that deals with processing, analyzing, and sometimes generating human speech (“natural language”).

NLP techniques are one of the most used branch of ML. It is used in cases like :
* Customer call center (analyse incoming mails, call)
* Analysing reviews, interviews.
* Finding and tagging news items, classifiying documents etc.


In [1]:
import string
from collections import Counter
from pprint import pprint
import gzip
import matplotlib.pyplot as plt 
import numpy as np
%matplotlib inline

#### 01.Text Feature Extraction:

* __Tokenisation__ : Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. Tokens can be either words or sentences.


In [2]:
import spacy
from spacy.lang.en import English
nlp = English()

Breaking text into words.

In [3]:
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

# "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)
# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
print(token_list)

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


Breaking words into sentences

In [4]:
# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(sbd)

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)

["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]


* __Stop words removal__ :
Most text data that we work with is going to contain a lot of words that aren’t actually useful to us. These words, called stopwords, are useful in human speech, but they don’t have much to contribute to data analysis.


In [5]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))

#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:20])

Number of stop words: 326
First ten stop words: ['every', 'top', 'otherwise', 'therein', 'your', 'where', 'beforehand', 'until', 'many', 'nevertheless', '’d', 'towards', '’m', 'other', 'thereupon', 'hereby', 'had', 'keep', 'into', "'s"]


In [6]:
from spacy.lang.en.stop_words import STOP_WORDS

#Implementation of stop words:
filtered_sent=[]

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# filtering stop words
for word in doc:
    if word.is_stop==False:
        filtered_sent.append(word)
print("Filtered Sentence:",filtered_sent)

Filtered Sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


* __Lexicon Normalization__

There are many words in any languages and they have diffrent forms as well.
Eg the word eat, eating,eaten etc ave the same meaning from analytics purpose.


    Lemmatization:It is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. aren’t exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.

    Stemming:It involves simply lopping off easily-identified prefixes and suffixes to produce what’s often the simplest version of a word. Connection, for example, would have the -ion suffix removed and be correctly reduced to connect. This kind of simple stemming is often all that’s needed, but lemmatization—which actually looks at words and their roots (called lemma) as described in the dictionary—is more precise (as long as the words exist in the dictionary).

* Term Frequency & Inverse Document Frequency

In [7]:
# Implementing lemmatization
lem = nlp("run runs running runner")
# finding lemma for each word
for word in lem:
    print(word.text,word.lemma_)

run run
runs runs
running running
runner runner


Spacy also offers Entity Detection

In [8]:
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.
At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.
The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i,i.label_,i.label) for i in nytimes.ents]

In [9]:
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'PERSON', 380),
 (6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

In [10]:
displacy.render(nytimes, style = "ent",jupyter = True)

### Text Classification
In the case of classification, a pipeline would look as follows :
![text classification](images/text-classification-python-spacy.png)

We will try to classify alexa review, as posiitve or negative.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline


In [12]:
# Loading TSV file
df_amazon = pd.read_csv ("data/amazon_alexa.tsv", sep="\t")
# Top 5 records
df_amazon.sample(5)

Unnamed: 0,rating,date,variation,verified_reviews,feedback
2988,4,30-Jul-18,White Dot,Handy if you don't expect much out of it much ...,1
1461,5,30-Jul-18,Black Show,,1
906,5,29-Jul-18,Charcoal Fabric,The best part of this product is you can contr...,1
428,5,9-Jul-18,Black,Good as new,1
2641,4,30-Jul-18,Black Dot,love it,1


In [13]:
# View data information
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [14]:
# Feedback Value count
df_amazon.feedback.value_counts()


1    2893
0     257
Name: feedback, dtype: int64

Now that we know what we’re working with, let’s create a custom tokenizer function using spaCy. We’ll use this function to automatically strip information we don’t need, like stopwords and punctuation, from each review.

We’ll start by importing the English models we need from spaCy, as well as Python’s string module, which contains a helpful list of all punctuation marks that we can use in string.punctuation. We’ll create variables that contain the punctuation marks and stopwords we want to remove, and a parser that runs input through spaCy‘s English module.

Then, we’ll create a spacy_tokenizer() function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stop words. This is similar to what we did in the examples earlier in this tutorial, but now we’re putting it all together into a single function for preprocessing each user review we’re analyzing.

In [15]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

**Defining a Custom Transformer** 

To further clean our text data, we’ll also want to create a custom transformer for removing initial and end spaces and converting text into lower case. Here, we will create a custom predictors class wich inherits the TransformerMixin class. This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [16]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

### Bag of Words
* CountVectorizer

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
cat_in_the_hat_docs=[
       "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
       "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
       "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
       "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
       "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
      ]


cv = CountVectorizer(cat_in_the_hat_docs,
                     lowercase=True,
                     stop_words='english',
                     ngram_range=(2,2)
                    )
count_vector=cv.fit_transform(cat_in_the_hat_docs)

# show resulting vocabulary; the numbers are not counts, they are the position in the sparse vector.
cv.vocabulary_

{'cent cents': 3,
 'cents old': 6,
 'old cent': 18,
 'cent new': 5,
 'new cent': 16,
 'cent money': 4,
 'money cat': 15,
 'cat hat': 2,
 'hat learning': 8,
 'learning library': 13,
 'inside outside': 12,
 'outside human': 19,
 'human body': 10,
 'body cat': 0,
 'oh things': 17,
 'things good': 24,
 'good staying': 7,
 'staying healthy': 23,
 'healthy cat': 9,
 'bugs insects': 1,
 'insects cat': 11,
 'place like': 20,
 'like space': 14,
 'space solar': 22,
 'solar cat': 21}

* TfIDFVectorizer


![Tfidf](images/TF-IDF_web.png)

* TF-IDF is the product of these two quantities and is useful for finding terms that are
important for the specific document (high TF) and uncommon in the corpus as a whole
(large IDF/small DF)

* In particular, a term that occurs in every document is meaningless when it comes to
distinguishing between documents.

* Stopwords, are naturally weighed down due to appearing in all documents


In [19]:
import pandas as pd
  
# this is a very toy example, do not try this at home unless you want to understand the usage differences
docs=["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
 
# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

In [20]:
# get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[2]
# place tf-idf values in a pandas data frame
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
away,0.457093
from,0.457093
ran,0.457093
the,0.435614
house,0.36878
mouse,0.217807
ate,0.0
cat,0.0
end,0.0
finally,0.0


Let us continue with the classification.

In [21]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [22]:
from sklearn.model_selection import train_test_split

X = df_amazon['verified_reviews'] # the features we want to analyze
ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

In [56]:
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.pipeline import make_pipeline
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# vec = CountVectorizer()
# clf = LogisticRegression()
# pipe = make_pipeline(vec, clf)

# model generation
pipe.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function spacy_tokenizer at 0x107a2d730>,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
            

In [57]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.9248677248677248
Logistic Regression Precision: 0.9348534201954397
Logistic Regression Recall: 0.9873853211009175


### Model Explainations

In [58]:
import eli5
eli5.show_weights(pipe, top=(25,25))

Weight?,Feature
+2.656,love
+2.138,<BIAS>
+1.700,easy
+1.597,great
+1.070,best
+1.048,fun
+1.037,amazing
+1.005,perfect
+0.988,expected
+0.959,good


In [59]:
import eli5
from eli5.lime import TextExplainer
te = TextExplainer(random_state=42)
te.fit(X_test.iloc[15], pipe.predict_proba)
te.show_prediction(target_names=[0,1])

Contribution?,Feature
1.552,love
0.715,different
0.539,trouble
0.518,music
0.513,<BIAS>
0.464,pleased
0.45,i
0.402,with
0.381,the
0.349,of


In [60]:
te.metrics_

{'mean_KL_divergence': 0.007541006102304846, 'score': 0.9993869129174795}