#**Introduction of Natural Language Processing (NLP)**

At the end of this workshop, you will be able to understand

*   the basic NLP pipeline
*   the different word embeddings and their pros and cons 
*   how to use custom stop words
*   the implementation of NLP in Python 
*   how to use Google Colab (- if you are not already familiar :) )



## **What is NLP?** 

- Sub field of Computer Science and Artificial Intelligence that focuses on interactions between computer and human (natural) languages 
- Application of machine learning (ML) and deep learning (DL) algorithms to text and speech (datasets). 
- Applications: Speech recognition, machine translation, spam detection, auto complete/next word suggestion, chat bot etc. 


# **NLP Pipeline**

Here, NLP pipeline refers to the pre-processing steps that should be applied on the text data before proceeding towards the machine learning aspect of the model. 

For example, the **objective** of a project is identification of e-mails as spam (or non-spam). 

1.   Identification of *type of ML problem*: Classification (using text data)
2.   ML algorithms: Multinomial naive bayes, Logistic Regression and Support Vector Machine

Great! We have an idea about the type of problem and what possible ML algorithms to use. But before that, how do we process the text data?  

Here is an outline of the steps that we could use for processig the text data:

### **Text Pre-processing** 

1.   Spell check (- depending on the context)
1.   Sentence Tokenization
2.   Word Tokenization
3.   Conversion to lower case
4.   Lemmatization and Stemming 
5.   Removal of puncatuations and stop words (and numbers - depending on the context)
6.   Parts-of-speech (POS) tagging 
7.   Creation of n-grams 

### **Exploratory Analysis**

1.   Word Cloud
2.   Distribution of data with respect to each class 

### **Word Embeddings** 

1.   Bag-of-Words (BoW)
2.   Term Frequency (TF)
3.   Term Frequency - Inverse Document Frequency (TF - IDF)
4.   Pre-trained (Neural) Word Embeddings 

> * Word level embeddings: Word2Vec and Glove 
> * Character level embeddings: ELMo and Flair 

We will explore each of these topics using a dataset. 


In [1]:
import pandas as pd 
import numpy as np 