In [1]:
import nltk

#### Entire process of making text noise free and ready for analysis is called Text Processing.

    It consists of three steps.
        (1). Noise Removal
        (2). Lexicon Normalization.
        (3). Object Standarization.
        
  Raw Text ->
  
  NoiseRemoval([Stop Words, URLs, puntuations ..]) -> 
  
  Word Normalization([Tokenization, Lemmatization, Stemming]) -> 
  
  WordStandarization([Regular Expression, Lookup Tables]) -> 
  
  Cleaned Text

#### What is Noise - Any thing that is not valuable for given context is called Noise.

Examples of Noise:-

    (1). Language Stop Words.
    (2). URLs, Media Link,
    (3). Industry Specific words.

General approach for removing noise is create a Dictionary of noise entity(word) and iterate through the text to remove them.

In [5]:
# sample code to remove the noisy words from text

noise_entity = set(['a', 'an', 'the', 'that', 'This', 'is'])

def _remove_noise(text):
    text_list = text.split()
    cleaned_list = [item for item in text_list if item not in noise_entity]
    cleaned_text = ' '.join(cleaned_list)
    return cleaned_text
_remove_noise("This is a text")
    

'text'

In [9]:
# Regular expression can also be used to filter some kind of noise
import re
def _remove_noise_regex(text, pattern):
    urls = re.finditer(pattern, text)
    
    for i in urls:
        text = re.sub(i.group().strip(), '', text)
    return text

regex_pattern = '#[\w]*'
_remove_noise_regex("remove this #hashtag from analytics vidhya", regex_pattern)
    

'remove this  from analytics vidhya'

#### Lexicon Normalization
    (1). Stemming - Stemming is a rudimentary rule-based process to remove suffixes like (s, es, ly)
    (2). Lemmatization - Organized way of finding root of the word. 
        It considers word structure and grammer relations.

In [23]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word='playing'
print(lem.lemmatize(word, 'v'))

print(stem.stem(word))

play
play


#### Object Standarization - Words and Phrases that are not present in any standard lexical dictionaries.
Some examples are acronyms, hashtags, colloquial Slangs.

Examples - awsm, rt, dm, luv

With help of regular expresion and prepared data dictionaries these cases can be handled.

In [28]:
lookup_dict = {'awsm': 'awesome', 'luv': 'Love', 'rt': 'Retweet'}

def _lookup_words(text):
    words = text.split()
    new_words = []
    for word in words:
        if word.lower() in lookup_dict:
            temp = lookup_dict[word.lower()]
            new_words.append(temp)
        else:
            new_words.append(word)
    return ' '.join(new_words)
_lookup_words("RT this is retweeted by me saying NLP is awsm")

'Retweet this is retweeted by me saying NLP is awesome'

## Text to Features

#### Depending on use text can be converted into features by assorted techniques
    1. Syntactical Parsing - Analysis of words in sentence as per grammar. 
        Dependency Trees and POS tags are important for text syntactics.
    
    2. N-gram, Word-based features - 
    3. Staistical Features
    4. Word Embedding
    

In [7]:
from nltk import word_tokenize, pos_tag

text = "I am learning Natural Language Processing from various sources"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'VBG'), ('from', 'IN'), ('various', 'JJ'), ('sources', 'NNS')]


#### Pos tagging is useful in many NLP purposes.
    1. Word sense disambiguation  - Please book my flight. I am reading this book in flight
    2. Improving word-based features - Pos tagging helps to preserve the context, thus make strong feature.
    3. Normalization & lemmatization - Pos tags are basis of Lemmatization.
    

### Entity Extraction

Entities are the most important chunks of a sentence. 

Entitiy detection algorithms are generally ensemle of rule based, dictonary lookups, pos-tagging, and dependency parsing.

Tha applicabilty can be seen as automated chatbots, content analyzers and consumer-insights.
                  
  
                  
***At the W party Thursday{__day__} night{__time__} at Nandi-Hills{Place}.***



Topic modelling & Named Entity Recognition are two key entity detection methods in NLP.

__1. Named Entity Recognition : __
Sentence - Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

Named Entity - ("Sergey Brin", Person), ("Google Inc"- Org), ("New York"- Place)

A typical named entity consists three blocks.

_Noun Phrase Identification_ - Extracting noun phrases from a text using POS tagging or dependency parsing.

__2. Topic Modeling__

Topic modeling is process of identifying the topics present in text corpus in unsupervised manner.

Topics are defined as "a repeating pattern of co-occuring terms in corpus."

A good topic modeler will result:   "Health", "Doctor", "Patient", "hospital" for topic - _Healthcare_. and

"Farms", "Wheat", "Corps" for topic _Farming_

**LDA(Latent Dirichlet Allocation)** is most popular topic modeling technique.


In [None]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."

doc = [doc1, doc2, doc3]
doc_clean = [d.split() for d in doc]

import gensim
import corpora

#### N-Gram Featres

A combination of N words together are called N-Gram. N=2 is considered most important features of all others.

In [11]:
def n_gram(string, n):
    string_list = string.split()
    n_gram_list = []
    i = 0
    while i < len(string_list):
        n_gram_list.append(string_list[i: i+n])
        i+=n-1
    return n_gram_list[:-1]
n_gram("My Name is Anshu Kumar", 2)
        
    

[['My', 'Name'], ['Name', 'is'], ['is', 'Anshu'], ['Anshu', 'Kumar']]

### Statistical Features

__TF-IDF__ TF-IDF converts text document to vectors on basis of occurence of words in document.


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = tfidf.fit_transform(corpus)
#print(X.)

In [25]:
print(CountVectorizer().fit_transform(corpus).toarray())

[[0 1 1 0 1 0 0 1]
 [1 1 0 1 0 0 0 0]
 [0 1 0 0 1 1 1 0]]


In [28]:
print(X)

  (0, 7)	0.58448290102
  (0, 2)	0.58448290102
  (0, 4)	0.444514311537
  (0, 1)	0.345205016865
  (1, 1)	0.385371627466
  (1, 0)	0.652490884513
  (1, 3)	0.652490884513
  (2, 4)	0.444514311537
  (2, 1)	0.345205016865
  (2, 6)	0.58448290102
  (2, 5)	0.58448290102


In [27]:
tfidf.get_feature_names()

['another', 'document', 'is', 'random', 'sample', 'text', 'third', 'this']

### Word Embeddings

This is modern way of representing text as vector of real numbers. Word embedding represents a word in fixed 
dimension.

A word "man" might be represented in 5-d, [4.2, 4.5, 1.1, 3.76, 1.9], each values represents the magnitude in particular dimension.

Aim of word embedding is to redefine the high dimensional word features into low-dimensional feature vectors
by preserving the contextual similarty in corpus.

They are widely used in CNNs and RNNs.

Word2Vec and glove are two popular models to create word embeddings of text.

Word2Vec model is composed of preprocessing shallow neural network modules, 
    1. Continuous Bagof words
    2. Skip-Gram
Word2Vec first constructs a vocabulary from given corpus and then learn word embedding representations.


In [44]:
from gensim.models import Word2Vec

sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

model = Word2Vec(sentences, min_count=1)
print('model similartity: ', model.similarity('data', 'learning'))

model similartity:  -0.0208757325844


This similarity scores can be used to measure text similarity using cosine similarity techniques