# Natural Language Processing

According to industry estimates, only 21% of the available data is present in structured form. Data is being generated as we speak, as we tweet, as we send messages on Whatsapp and in various other activities. Majority of this data exists in the textual form, which is highly unstructured in nature.

Few notorious examples include – tweets / posts on social media, user to user chat conversations, news, blogs and articles, product or services reviews and patient records in the healthcare sector. A few more recent ones includes chatbots and other voice driven bots.

Despite having high dimension data, the information present in it is not directly accessible unless it is processed (read and understood) manually or analyzed by an automated system. In order to produce significant and actionable insights from text data, it is important to get acquainted with the techniques and principles of Natural Language Processing (NLP).

In [0]:
! pip install nltk



In [0]:
import nltk  
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

In [0]:
! pip install spacy



## Introduction to Natural Language Processing

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, lets have a look at few keywords used frequently.

Tokenization – process of converting a text into tokens

Tokens – words or entities present in the text

Text object – a sentence or a phrase or a word or an article

### Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps:

* Noise Removal
* Lexicon Normalization
* Object Standardization

### Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text. A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

In [0]:
from sklearn.feature_extraction import stop_words

print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'bottom', 'be', 'nor', 'whom', 'those', 'whence', 'three', 'thereupon', 'describe', 'move', 'except', 'out', 'must', 'per', 'cant', 'and', 'during', 'hereafter', 'eleven', 'however', 'one', 'there', 'etc', 'everything', 'then', 'anyway', 'where', 'yet', 'done', 'her', 'of', 'hers', 'its', 'nobody', 'him', 'toward', 'anyone', 'whenever', 'about', 'un', 'again', 'meanwhile', 'can', 'hereby', 'all', 'see', 'yourself', 'to', 'formerly', 'bill', 'therefore', 'no', 'into', 'made', 'although', 'around', 'between', 'every', 'might', 'myself', 'ourselves', 'ltd', 'she', 'by', 'someone', 'system', 'some', 'go', 'once', 'find', 'since', 'your', 'other', 'take', 'everywhere', 'us', 'name', 'found', 'what', 'do', 'elsewhere', 'over', 'how', 'empty', 'eg', 'six', 'two', 'well', 'enough', 'anywhere', 'been', 'whose', 'co', 'moreover', 'too', 'anyhow', 'each', 'forty', 'seems', 'was', 'side', 'even', 'whoever', 'below', 'together', 'whether', 'get', 'become', 'throughout', 'because', 'whate

In [0]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "...",'to', 'and'] 
def remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

remove_noise("this is a sample text and i'll go to market now")

"sample text i'll go market now"

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

Another approach is to use the regular expressions while dealing with special patterns of noise.

In [0]:
# Sample code to remove a regex pattern 
import re 

def remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[\w]*"  

remove_regex("remove this #hashtag from my given string object", regex_pattern)

'remove this  from my given string object'

**Lexicon Normalization**

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.

The most common lexicon normalization practices are :

* Stemming:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
* Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

In [0]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
stem = PorterStemmer()
new_stem = SnowballStemmer("english")

word = "dictionary" 

print(lem.lemmatize(word, "v"))
print(stem.stem(word))
print(new_stem.stem(word))

dictionary
dictionari
dictionari


**Object Standardization**

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [0]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

print(lookup_words("RT this is a retweeted tweet by Doland J.Trump"))

Retweet this is a retweeted tweet by Doland J.Trump


### Text to Features (Feature Engineering on text data)
To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings. Read on to understand these techniques in detail.

**Syntactic Parsing**

Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.

* Dependency Trees – Sentences are composed of some words sewed together. The relationship among the words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For example: consider the sentence – “**Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas**.” The relationship among the words can be observed in the form of a tree representation as shown

![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/01/11181146/image-2.png)


The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation” relation).

This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity identification, and text classification. The python wrapper StanfordCoreNLP (by Stanford NLP Group, only commercial license) and NLTK dependency grammars can be used to generate dependency trees.

**Part of speech tagging** – Apart from the grammar relations, every word in a sentence is also associated with a part of speech tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. Here is a list of all possible pos-tags defined by Pennsylvania university. Following code using NLTK performs pos tagging annotation on input text. (it provides several implementations, the default one is perceptron tagger)

In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
from nltk import word_tokenize, pos_tag
text = "I am going to travel the world and click lot of beautiful pictures"
tokens = word_tokenize(text)
print(tokens)
print (pos_tag(tokens))

['I', 'am', 'going', 'to', 'travel', 'the', 'world', 'and', 'click', 'lot', 'of', 'beautiful', 'pictures']
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('travel', 'VB'), ('the', 'DT'), ('world', 'NN'), ('and', 'CC'), ('click', 'NN'), ('lot', 'NN'), ('of', 'IN'), ('beautiful', 'JJ'), ('pictures', 'NNS')]


**A.Word sense disambiguation**: Some language words have multiple meanings according to their usage. For example, in the two sentences below:

I. “Please book my flight for Delhi”

II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases are different. In sentence I, the word “book” is used as verb, while in II it is used as noun. (Lesk Algorithm is also us ed for similar purposes)

**B.Improving word-based features**: A learning model could learn different contexts of a word when used word as the features, however if the part of speech tag is linked with them, the context is preserved, thus making strong features. For example:

Sentence -“book my flight, I will read this book”

Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)

Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

**C.Normalization and Lemmatization**: POS tags are the basis of lemmatization process for converting a word to its base form (lemma).

**D.Efficient stopword removal** : POS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”, “mu st” etc)

### Entity Extraction (Entities as features)
Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.


![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/01/11181407/image-3.png)

**A. Named Entity Recognition (NER)**

The process of detecting the named entities such as person names, location names, company names etc from the text is called as NER. For example :

Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

Named Entities –  ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)

A typical NER model consists of three blocks:

* Noun phrase identification: This step deals with extracting all the noun phrases from a text using dependency parsing and part of speech tagging.

* Phrase classification: This is the classification step in which all the extracted noun phrases are classified into respective categories (locations, names etc). Google Maps API provides a good path to disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to identify person names or company names. Apart from this, one can curate the lookup tables and dictionaries by combining information from different sources.

* Entity disambiguation: Sometimes it is possible that entities are misclassified, hence creating a validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and Wikipedia. 

 

**B. Topic Modeling**

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.



Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.

![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/Modeling1.png)

In [0]:
! pip install gensim
! pip install corpora

Collecting corpora
  Downloading https://files.pythonhosted.org/packages/6c/f5/998ee3d19c64e42a5a3839858ede61ccd504c13f24fbe3bf48ddb6fd3592/Corpora-1.0.tar.gz
Building wheels for collected packages: corpora
  Building wheel for corpora (setup.py) ... [?25l[?25hdone
  Created wheel for corpora: filename=Corpora-1.0-cp36-none-any.whl size=5510 sha256=1d03a67c21e108bd4daaa15b464d52c2c476fca86d8ac86d100436fe252bee2a
  Stored in directory: /root/.cache/pip/wheels/9c/05/3f/335a581f1159688e17600c6363af906dbdb52515faf43dd448
Successfully built corpora
Installing collected packages: corpora
Successfully installed corpora-1.0


In [0]:
! pip install corpus

Collecting corpus
[?25l  Downloading https://files.pythonhosted.org/packages/f1/b9/120d9e0ae8702a6929946b494b723a4de6c9bf3d79e8e07e239a81be4e7c/Corpus-0.4.2.tar.gz (88kB)
[K     |███▊                            | 10kB 13.9MB/s eta 0:00:01[K     |███████▍                        | 20kB 1.7MB/s eta 0:00:01[K     |███████████▏                    | 30kB 2.6MB/s eta 0:00:01[K     |██████████████▉                 | 40kB 1.7MB/s eta 0:00:01[K     |██████████████████▋             | 51kB 2.1MB/s eta 0:00:01[K     |██████████████████████▎         | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████████████      | 71kB 2.8MB/s eta 0:00:01[K     |█████████████████████████████▊  | 81kB 3.2MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 2.6MB/s 
[?25hBuilding wheels for collected packages: corpus
  Building wheel for corpus (setup.py) ... [?25l[?25hdone
  Created wheel for corpus: filename=Corpus-0.4.2-cp36-none-any.whl size=88798 sha256=8a4dbc522e4edd99a1e70962f7

**Latent Dirichlet Allocation for Topic Modeling**

There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. The following matrix shows a corpus of N documents D1, D2, D3 … Dn and vocabulary size of M words W1,W2 .. Wn. The value of i,j cell gives the frequency count of word Wj in Document Di.

![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/Modeling2.png)

LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2.
M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N,  K) and (K, M) respectively, where N is the number of documents, K is the number of topics and M is the vocabulary size.

![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/modeling3.png)

![alt text](https://www.analyticsvidhya.com/wp-content/uploads/2016/08/Modeling4.png)

otice that these two matrices already provides topic word and document topic distributions, However, these distribution needs to be improved, which is the main aim of LDA. LDA makes use of sampling techniques in order to improve these matrices.

It Iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.

For every topic, two probabilities p1 and p2 are calculated. P1 – p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 – p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.

The current topic – word assignment is updated with a new topic with the probability, product of p1 and p2 . In this step, the model assumes that all the existing word – topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.

After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.

In [0]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim
# import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = gensim.corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

[(0, '0.089*"to" + 0.051*"my" + 0.051*"My" + 0.051*"sister" + 0.051*"consume." + 0.051*"is" + 0.051*"sugar," + 0.051*"not" + 0.051*"Sugar" + 0.051*"bad"'), (1, '0.053*"driving" + 0.053*"sister" + 0.053*"My" + 0.053*"my" + 0.053*"father" + 0.053*"around" + 0.053*"a" + 0.053*"time" + 0.053*"spends" + 0.053*"lot"'), (2, '0.060*"driving" + 0.060*"that" + 0.060*"and" + 0.060*"blood" + 0.060*"cause" + 0.060*"increased" + 0.060*"suggest" + 0.060*"Doctors" + 0.060*"may" + 0.060*"pressure."')]


### Statistical Features

Text data can also be quantified directly into numbers using several techniques described in this section:

**Term Frequency – Inverse Document Frequency (TF – IDF)**

TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering. For Example – let say there is a dataset of N text documents, In any document “D”, TF and IDF will be defined as –

Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D”

Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.

TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of documents), given by the following formula below. Following is the code using python’s scikit learn package to convert a text into tf idf vectors:

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print (X)

  (0, 1)	0.34520501686496574
  (0, 4)	0.444514311537431
  (0, 2)	0.5844829010200651
  (0, 7)	0.5844829010200651
  (1, 3)	0.652490884512534
  (1, 0)	0.652490884512534
  (1, 1)	0.3853716274664007
  (2, 5)	0.5844829010200651
  (2, 6)	0.5844829010200651
  (2, 1)	0.34520501686496574
  (2, 4)	0.444514311537431


The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

### Word Embedding (text vectors)
Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks.

Word2Vec and GloVe are the two popular models to create word embedding of a text. These models takes a text corpus as input and produces the word vectors as output.

Word2Vec model is composed of preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip-gram. These models are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations. Following code using gensim package prepares the word embedding as the vectors.

In [0]:
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics', 'physics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print (model.similarity('physics', 'science'))
print (model['learning'] )

-0.08600392
[-1.2198028e-03  2.4706037e-03  2.6814300e-03 -1.6301633e-03
 -3.0342906e-04  1.5185301e-03  4.5139566e-03  4.8952134e-05
 -1.0094991e-03  3.8598652e-03 -1.5097521e-03  2.8059029e-03
  3.6757116e-03  1.7180959e-03 -2.5798606e-03 -3.7834109e-03
  2.7867212e-04  3.1320988e-03 -3.2025119e-03  3.5009619e-03
 -2.9597154e-03  1.4892552e-04 -3.2243966e-03 -2.4942739e-03
  1.8790006e-03  8.3901529e-04 -4.9113007e-03 -3.2897454e-03
  4.3058014e-03  2.8676714e-03  4.3029212e-03  1.9690786e-03
 -3.1480938e-03 -4.2104749e-03  9.0599561e-04 -3.6311063e-03
 -3.9923638e-03  4.6447222e-03  1.9224291e-03 -3.4577441e-03
 -4.4117952e-03 -1.5797585e-03 -1.0015120e-03  4.1363607e-03
 -1.0758891e-03 -2.5922281e-03 -6.3116715e-04  3.0007577e-03
  8.2647125e-04 -2.4536974e-03  8.8104629e-04 -9.3406107e-04
  4.5654322e-03  3.5452223e-04 -3.4287060e-03 -1.4621671e-03
  1.1387101e-03 -1.0127380e-03 -7.4882066e-04 -2.2539236e-03
 -2.2875336e-03  2.9483379e-03  2.8683969e-03 -7.0170069e-04
  1.5491229e

  import sys
  if np.issubdtype(vec.dtype, np.int):
  


They can be used as feature vectors for ML model, used to measure text similarity using cosine similarity techniques, words clustering and text classification techniques.

### Text Classification
Text classification is one of the classical problem of NLP. Notorious examples include – Email Spam Identification, topic classification of news, sentiment classification and organization of web pages by search engines.

Text classification, in common words is defined as a technique to systematically classify a text object (document or sentence) in one of the fixed category. It is really helpful when the amount of data is too large, especially for organizing, information filtering, and storage purposes.

A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown in image below. Firstly the text input is processes and features are created. The machine learning models then learn these features and is used for predicting against the new text.

![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/01/11182015/image-5.png)

In [0]:
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]


In [0]:
import sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn import svm 
from sklearn.metrics import classification_report

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

print(train_data)
# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=0.4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

print(train_vectors)
# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)


print (classification_report(test_labels, prediction))

['I am exhausted of this work.', "I can't cooperate with this", 'He is my badest enemy!', 'My management is poor.', 'I love this burger.', 'This is an brilliant place!', 'I feel very good about these dates.', 'This is my best work.', 'What an awesome view', 'I do not like this dish']
  (0, 1)	1.0
  (1, 1)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 1)	1.0
  (5, 0)	0.776355388538153
  (5, 1)	0.6302954154107211
  (7, 0)	0.776355388538153
  (7, 1)	0.6302954154107211
  (9, 1)	1.0
              precision    recall  f1-score   support

     Class_A       0.50      0.67      0.57         3
     Class_B       0.50      0.33      0.40         3

    accuracy                           0.50         6
   macro avg       0.50      0.50      0.49         6
weighted avg       0.50      0.50      0.49         6



The text classification model are heavily dependent upon the quality and quantity of features, while applying any machine learning model it is always a good practice to include more and more training data.

**NLP problems / tasks**
* Text Summarization – Given a text article or paragraph, summarize it automatically to produce most important and relevant sentences in order.
* Machine Translation – Automatically translate text from one human language to another by taking care of grammar, semantics and information about the real world, etc.
* Natural Language Generation and Understanding – Convert information from computer databases or semantic intents into readable human language is called language generation. Converting chunks of text into more logical structures that are easier for computer programs to manipulate is called language understanding.
* Optical Character Recognition – Given an image representing printed text, determine the corresponding text.
* Document to Information – This involves parsing of textual data present in documents (websites, files, pdfs and images) to analyzable and clean format.

 

**Important Libraries for NLP (python)**

* Scikit-learn: Machine learning in Python
* Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
* Pattern – A web mining module for the with tools for NLP and machine learning.
* TextBlob – Easy to use nlp tools API, built on top of NLTK and Pattern.
* spaCy – Industrial strength NLP with Python and Cython.
* Gensim – Topic Modelling for Humans
* Stanford Core NLP – NLP services and packages by Stanford NLP Group.