# NLP
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

# Benefits of NLP

As all of you know, there are millions of gigabytes every day are generated by blogs, social websites, and web pages.

There are many companies gathering all of these data for understanding users and their passions and give these reports to the companies to adjust their plans.

These data could show that the people of Brazil are happy with product A which could be a movie or anything while the people of the US are happy of product B. And this could be instant (real-time result). Like what search engines do, they give the appropriate results to the right people at the right time.

You know what, search engines are not the only implementation of natural language processing (NLP) and there are a lot of awesome implementations out there.

 

# NLP Implimentations


These are some of the successful implementation of Natural Language Processing (NLP):

    Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy so it shows you results related to you.
    Social websites feeds like Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
    Speech engines like Apple Siri.
    Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.

 

# NLP Libraries

There are many open source Natural Language Processing (NLP) libraries and these are some of them:

    Natural language toolkit (NLTK).
    Apache OpenNLP.
    Stanford NLP suite.
    Gate NLP library.

Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it.

NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.

In this NLP Tutorial, we will use Python NLTK library.

# Feature Extraction
Feature extraction is a technique to convert the content into the numerical vectors to perform machine learning.
Text feature extraction
Image feature extraction

# Bag of Words
Bag of words is used to convert text data into numerical feature vectors with a fixed size.
eg: text data
1.Assign a fixed integer id to each word
2.Number of occurrences of each word
3.Store as the value feature
4.Tokenizing
5.Counting
6.Store

CountVectorizerClass Signature
class sklearn.feature_extraction.text.CountVectorizer
(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
Class
Built-in stopwordslist
Overrides string tokenizer
Specifies number of components to keep
File name or sequence of strings
Encoding used to decode the input
Min Threshold
Max Threshold
Removes accents

Text Feature Extraction Considerations
Sparse
This utility deals with sparse matrix while storing them in memory. Sparse data is commonly noticed when it comes to extracting feature values, especially for large document datasets.
Vectorizer
It implements tokenization and occurrence. Words with minimum two letters get tokenized. We can use the analyzer function to vectorize the text data.
Tf-idf
It is a term weighing utility for term frequency and inverse document frequency. Term frequency indicates the frequency of a particular term in the document. Inverse document frequency is a factor which diminishes the weight of terms that occur frequently.
Decoding
This utility can decode text files if their encoding is specified

# Model Training
An important task in model training is to identify the right model for the given dataset. The choice of model completely depends on the type of dataset.
Unsupervised
Models identify patterns in the data and extract its structure. They are also used to group documents using clustering algorithms.
Example: K-means
Supervised
Models predict the outcome of new observations and datasets, and classify documents based on the features and response of a given dataset.
Example: Naïve Bayes, SVM, linear regression, K-NN neighbors


# Grid Search and Multiple Parameters
Document classifiers can have many parameters and a Grid approach helps to search the best parameters for model training and predicting the outcome accurately

In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on entire grids or a combination of grids.

Pipeline
A pipeline is a combination of vectorizers, transformers, and model training.


1. Vectorizer(Converts a collection of text documents into a numerical feature vector)


2. Transformer(tf-idf)(Extracts features around the word of interest)

3. Model Training
(document classifiers)(Helps the model predict accurately)


# Example of the Bag-of-Words Model

Let’s make the bag-of-words model concrete with a worked example.

In [None]:
Step 1: Collect Data

Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles Dickens, taken from Project Gutenberg.

    It was the best of times,
    it was the worst of times,
    it was the age of wisdom,
    it was the age of foolishness,

For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire corpus of documents.

In [None]:
Step 2: Design the Vocabulary

Now we can make a list of all of the words in our model vocabulary.

The unique words here (ignoring case and punctuation) are:

    “it”
    “was”
    “the”
    “best”
    “of”
    “times”
    “worst”
    “age”
    “wisdom”
    “foolishness”

That is a vocabulary of 10 words from a corpus containing 24 words.

In [None]:
Step 3: Create Document Vectors

The next step is to score the words in each document.

The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.

Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first document (“It was the best of times“) and convert it into a binary vector.

The scoring of the document would look as follows:

    “it” = 1
    “was” = 1
    “the” = 1   #It was the best of times,
    “best” = 1
    “of” = 1
    “times” = 1
    “worst” = 0
    “age” = 0
    “wisdom” = 0
    “foolishness” = 0
As a binary vector, this would look as follows:
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
1
	
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look as follows:
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
1
2
3
	
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

Managing Vocabulary

As the vocabulary size increases, so does the vector representation of documents.

In the previous example, the length of the document vector is equal to the number of known words.

You can imagine that for a very large corpus, such as thousands of books, that the length of the vector might be thousands or millions of positions. Further, each document may contain very few of the known words in the vocabulary.

This results in a vector with lots of zero scores, called a sparse vector or sparse representation.

Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words model.

There are simple text cleaning techniques that can be used as a first step, such as:

    Ignoring case
    Ignoring punctuation
    Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
    Fixing misspelled words.
    Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.

A more sophisticated approach is to create a vocabulary of grouped words. This both changes the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document.

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible bigrams.

    An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

In [None]:
Scoring Words

Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored.

In the worked example, we have already seen one very simple approach to scoring: a binary scoring of the presence or absence of words.

Some additional simple scoring methods include:

    Counts. Count the number of times each word appears in a document.
    Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.


CountVectorizer

CountVectorizer works on Terms Frequency, i.e. counting the occurrences of tokens and building a sparse matrix of documents x tokens.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = CountVectorizer()
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

TF-IDF Vectorizer

TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

    Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.
    
    Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.

In [2]:
# import and instantiate TfidfVectorizer (with the default parameters)
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
vect

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [3]:
corpus = ["hotel food","hote food service"]
vect = TfidfVectorizer()
X = vect.fit_transform(corpus)
idf = vect.idf_
vect.vocabulary_

{'hotel': 2, 'food': 0, 'hote': 1, 'service': 3}

In [4]:
def computeTF(wordDict,bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict    