<a href="https://colab.research.google.com/github/chain-veerender/datascience/blob/master/Natural_language_processing_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This session covers the fundamental concepts and algorithms in Natural Language Processing (NLP). The goal of these sessions are to understand text using relevant computational statistics.

This course will start with basic text processing techniques (such as regular expressions) and then cover advanced techniques (text classification and topic modeling). The emphasis will be on contemporary best practices in industry, including Deep Learning and text embeddings. Along the way we will touch upon text mining, information retrieval, and computational linguistics.

This is curated in a "buffet" format, a sample of many things, but we will not discuss "full" on any one topic. People get a PhD in each of these individual topics 🥇

Remember - A little bit of knowledge and a lot of "how to" goes a long way in Data Science.

credits: [brianspiering Lecture](https://github.com/brianspiering/nlp-course)

![An image](https://lawtomated.com/wp-content/uploads/2019/04/structuredVsUnstructuredIgneos.png)

Image credits: https://lawtomated.com/wp-content/uploads/2019/04/structuredVsUnstructuredIgneos.png

![An image](https://www.oriresults.com/wp-content/uploads/Blog-Whats-Hiding-in-Your-Unstructured-Data-1000x592px.png)

Image credits: https://www.oriresults.com/wp-content/uploads/Blog-Whats-Hiding-in-Your-Unstructured-Data-1000x592px.png

In order to produce significant and actionable insights from text data, it is important to get acquainted with the techniques and principles of Natural Language Processing (NLP).

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.

Import all libraries here

In [None]:
import nltk
from nltk.stem.wordnet import *
from nltk.stem.porter import *
from nltk import *
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import re

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


**Text Pre-processing**

Text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing.

The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It consists of 3 steps:

* Noise Removal

* Lexicon Normalization

* Object Standardization

![An image](https://cdn.analyticsvidhya.com/wp-content/uploads/2017/01/11180616/Image-1.png)

credits: https://cdn.analyticsvidhya.com/wp-content/uploads/2017/01/11180616/Image-1.png



**Noise Removal**

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.



In [None]:
# Sample code to remove a regex pattern
def remove_symbols(input_text, symbol_pattern):
    urls = re.finditer(symbol_pattern, input_text)
    for i in urls:
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

symbol_pattern = "#[\w]*"

remove_symbols("remove this #hashtag from this text", symbol_pattern)

'remove this  from this text'

**Lexicon Normalization**

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma).

The most common lexicon normalization practices are :

**Stemming:**  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

**Lemmatization:** Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

In [None]:
lem = WordNetLemmatizer()
stem = PorterStemmer()

word = "carrying"

print("lemmatized word:", lem.lemmatize(word, "v"))
print("stemming word:", stem.stem(word))

lemmatized word: carry
stemming word: carri


**Object Standardization**

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed

In [None]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome"}
def decode_fullform(input_text):
  words = input_text.split()
  new_words = []
  for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        else:
            word = word
        new_words.append(word)
        new_text = " ".join(new_words)
  return new_text

decode_fullform("This is a RT and dm by famous celebrity and this is awsm")

'This is a Retweet and direct message by famous celebrity and this is awesome'

**Text to Features (Feature Engineering on text data)**

1. Syntactical Parsing

    1.a Dependency Grammar

    1.b Part of Speech Tagging

2. Entity Parsing

    2.a Phrase Detection

    2.b Named Entity Recognition

    2.c Topic Modelling

    2.d N-Grams

3. Statistical features

   3.a TF – IDF

   3.b Frequency / Density Features

   3.c Readability Features

4. word embeddings

**1. Syntactic Parsing**

Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words.

**1. A Dependency Trees**

Sentences are composed of some words sewed together. The relationship among the words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For example: consider the sentence – “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.”

![An image](https://cdn.analyticsvidhya.com/wp-content/uploads/2017/01/11181146/image-2.png)

Image Credits: https://cdn.analyticsvidhya.com/wp-content/uploads/2017/01/11181146/image-2.png

The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation” relation)

This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity identification, and text classification.

**1.B Part of speech tagging**

Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence.



In [None]:
text = "I am learning Natural Language Processing at Prime Classes"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('at', 'IN'), ('Prime', 'NNP'), ('Classes', 'NNP')]


Credits: https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/