# Introduction to Natural Language Processing (NLP)  

Outline of the course
---------------

* Statistical principles of NLP 

* Elements of NLP: computational representations (Entities, Tokens, Syntax, Lexions,...)

* Important open-source NLP libraries: NLTK and SpaCy

* Vectorial Representations (Word Embeddings)

* NLP using Neural Networks 

* Applications of NLP: Chatbots, Sentiment analysis, Spam filtering,


# Introduction: Background


What is NLP? 
-------------------
* Natural Language Processing, or NLP, is an area of computer science that focuses on developing techniques to produce machine-driven analyses of text

* In the broad field of artificial intelligence, the ability to parse and understand natural language is an important goal with many types of application 

* Text data is unstructured which needs formatting/engineering to analyse and obtain a mathematical representation 

* NLP began in the 1950s as the intersection of artificial intelligence and linguistics. NLP was originally distinct from text information retrieval (IR), which employs highly scalable statistics-based techniques to index and search large volumes of text efficiently





Why is Natural Language Processing Important? 
---------------------------------------------------------------------

* NLP expands the sheer amount of data that can be used for insight. Since so much of the data available is in the form of text, this is extremely important for analytics and predictions 

* A specific common application of NLP is each time you use a language conversion tool. The techniques used to accurately convert text from one language to another very much falls under the umbrella of "natural language processing."

Why is NLP a "hard" problem?
---------------------------------------------

* Language is inherently ambiguous. Once person's interpretation of a sentence may very well differ from another person's interpretation. Because of this inability to consistently be clear, it's hard to have an NLP technique that works perfectly. 


# NLP and Computational Linguistics

* There’s an area which is closely related to NLP and sometimes confused with it


* Computational Linguistics is a more theoretical field that develops computational methods to answer the scientific questions from the point of view of linguists
    
    
* Natural Language Processing is dedicated to give solutions to engineering problems related to natural language, focusing on the people


*  "CL is science" and "NLP is engineering" is a nice distinction

![](./data/cl.jpeg?raw=true)
##### source: zipfslaw.org

# Statistical NLP

* Descriptive statistics is often used to provide the quantitative measurements of a particular quality such as accuracy or robustness, as exemplified in the following list

* Word error rate, usually defined as the number of deletions, insertions and substitutions divided by the number of words in the test sample, is the standard measure of accuracy for automatic speech recognition systems

* Accuracy rate (or percent correct), defined as the number of correct cases divided by the total number of cases, is commonly used as a measure of accuracy for part-of-speech tagging and word sense disambiguation 

* Recall and precision, often defined as the number of true positives divided by, respectively, the sum of true positives and false negatives (recall) and the sum of true positives and false positives (precision), are used as measures of accuracy for a wide range of applications including part-of-speech tagging, syntactic parsing and information retrieval

 Topics in NLP
-------------------------
* Computational linguistics
* Statistical models
* Neural networks
* Elements of NLP: Computational Representations (Entities, Tokensiation, Syntax, Lexions)
* Vector space embeddings
* Pre-processing of data 
* Convolutional neural networks
* RNNS: LSTMS and GRUs in practise
* Advanced models:  Bi-directional LSTM and stacked LSTM
* Hyper-parameter tuning
* Applications of NLP: Chatbots, Sentiment analysis, Spam filtering, News flows classification
* Important NLP libraries: NLTK, SpaCy, etc.
* Text Classification: Linear classifiers, Naive bayes and deep learning technique
* Information Retrieval and Extraction
* Named entity recognition and Relationship extraction
* Topic segmentation
* Language Modeling and Sequence Tagging 

# NLP Implementations

These are some of the successful implementation of Natural Language Processing (NLP):

    Search engines like Google, Yahoo, etc. 
    
    Social websites feeds like Facebook news feed where using NLP the algorithm understands and gives relevant     
    suggestions 
    
    Speech engines like Apple Siri
    
    Spam filters like Google ones which understand what’s inside the email content and check if its spam or not
 

 NLP Libraries 
-------------

* There are many open source Natural Language Processing (NLP) libraries and these are some of them:

                Natural language toolkit (NLTK)
                
                Apache OpenNLP
                
                Stanford NLP
                
                Gate NLP library
                
                Spacy library

* Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python, very easy to learn, and is widely used for teaching and research [nltk](https://www.nltk.org/)

* SpaCy, an open-source software library written in Python and Cython and offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language Name-entity recognition (NER), as well as tokenisation for various other languages [Spacy](https://spacy.io/)

# What Can NLP be used for ?

* NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analysing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inferences. 

* Summarise blocks of text using Summariser to extract the most important and central ideas while ignoring irrelevant information. 
    
* Create a chat bot and use Point-of-Speech tagging.
   
* Automatically generate keyword tags from content using probabilistic topic allocation algorithms (LDA)
    
* Identify the type of entity extracted, such as it being a person, place, or organization using Named Entity Recognition.
    
* Use Sentiment Analysis to identify the sentiment of a string of text, from very negative to neutral to very positive.
    
* Reduce words to their root, or stem, using PorterStemmer, or break up text into tokens using Tokenizer.


NLP related Tasks
-------------------------

* Spell and Grammar Checking

* Word Prediction: Predicting the next word that is highly probable 

* Information retrieval (IR): give a word query, retrieve documents that are relevant to the query

* Information filtering (text categorisation): group documents based on topics/categories
– E.g. categories for browsing
– E.g. E-mail filters
– News services

* Information extraction: given a text, get relevant information in a template. Closest to language understanding
E.g. House advertisements (get location, price, features) or Contact information for companies

![](./data/NLP-flowchart.png?raw=true)
##### source: https://www.nltk.org/book/ch00.html

# NLP Applications

With NLP, we can do the following-

    Summarising blocks of text
    
    Creating chatbots
    
    Machine translation
    
    Fighting spam
    
    Extracting information
    
    Automatically generating keyword tags
    
    Identifying types of entities extracted
    
    Identifying the sentiment of a string with sentiment analysis
    
    Reducing words to their roots
    
    Summarising 
    
   
    


# NLP at different levels of difficulties

Easy (mostly solved)
– Spell and grammar checking
– Some text categorization tasks
– Some named-entity recognition tasks


Intermediate (good progress)
– Information retrieval
– Sentiment analysis
– Machine translation
– Information extraction



Difficult (still hard)
– Question answering
– Summarization
– Dialog systems

# Tasks in NLP

#### With Natural Language Processing, we carry out five different tasks

* Lexical Analysis : Lexical analysis deals with identifying and analyzing word structure. We divide the whole chunk of text into paragraphs, sentences, and words

* Syntactic Analysis: Also called parsing, it involves analyzing words in sentences for grammar and rearranging them to determine how they relate to each other. It rejects sentences like “The apple eats the girl”

* Semantic Analysis: This deals with extracting the dictionary meanings from text. It also maps syntactic structures and objects in the task domain to check for meaningfulness. It rejects statements like “tall stub”

* Discourse Integration: It analyzes the previous sentence to guess the meaning of the current sentence and the one after it

* Pragmatic Analysis: This reinterprets the statement to ensure it determines correctly what the statement means. It tries to retrieve aspects of the language that requires knowledge of real world


![](./data/nlp_task.jpeg?raw=true)
##### source: Medium

Machine Translation
------------------------------

* Translating a text from one language to another

*  This need corpus statistical, and neural techniques which can give better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies

![](./data/mt.gif?raw=true)
##### source: Medium

Speech recognition
--------------------------

Speech to text

• Input:  wave sound file

• Output: typed text representing the words

• To disambiguate  the  next  word,  one  can  use  sequence models  to  predict the most likely next word, based on the past words


![](./data/sprec1.jpeg?raw=true)
##### source: Medium

Speech recognition
--------------------------

Text to speech

• Input: typed text representing the words

• Output: wave sound file


![](./data/tts.png?raw=true)
##### source: Medium

#### C: new picture

Sentiment Analysis  
-------------------

* Sentiment analysis involves building a system to collect and determine the emotional tone behind words. 

* This is important because it allows you to gain an understanding of the attitudes, opinions and emotions of the people in your data. 

 * It involves Natural language processing of the actual text element, transforming it into a format that a machine can read, and using statistics to determine the actual sentiment.
 
 * Important to have labelled Data to accomplish sentiment analysis computationally, we have to use techniques that will allow us to learn from data that's already been labeled
 
 ![](./data/senti.png?raw=true)
##### source: Glue Labs

# Projects related to NLP applications 

Development of applications which carry out

* Information Retrieval
* Information Extraction
* Text Summarization
* Question Answering
* Sentiment Analysis
* Machine Translation


These applications include the following components

* Part-of-speech tagging
* Syntactic parsing
* Lexical semantics
* Discourse analysis
* Named-entity recognition

Glossary of terms in NLP
----------------------------


Some common terminology

<b>Corpus: </b> (Plural: Corpora) a collection of written texts that serve as our datasets

<b>nltk: </b> (Natural Language Toolkit) the python module wich has a lot of useful built-in NLP techniques

<b>Token: </b> a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon


<b>Parts of speech Tagging: </b> process of tagging each word, into a lexical category: Noun, Adjective, Verb, etc...



<b>WordNet:</b> A semantic graph for words. NLTK provides a interface to the API </li></h3>



<b>Chunks: </b> Chunking is the process of collecting patterns of Part of Speech together, representing some meaning.


<b>Entity Recognition - Chunking: </b> The goal is to detect entities: Person, Location, Time, etc.




General Summary
---------------------------------

    Natural Language Processing is a branch of AI which helps computers to understand, interpret and manipulate human language
    
    NLP never focuses on voice modulation; it does draw on contextual patterns
    
    Five essential components of Natural Language processing are 1) Morphological and Lexical Analysis 2)Syntactic Analysis 3) Semantic Analysis 4) Discourse Integration 5) Pragmatic Analysis
    
    Three types of the Natural process writing system are 1)Logographic 2) Syllabic 3) Alphabetic
    
    Machine learning and Statistical inference are two methods to implementation of Natural Process Learning
    
    Essential Applications of NLP are Information retrieval & Web Search, Grammar Correction Question Answering, Text Summarization, Machine Translation, etc.
    
    Future computers or machines with the help of NLP and Data Science will able to learn from the information online and apply that in the real world, however, lots of work need to on this regard
    
    NLP is are ambiguous while open source computer language is designed to unambiguous. The biggest advantage of the NLP system is that it offers exact answers to the questions, no unnecessary or unwanted information
    
    
    The biggest draw back of the NLP system is built for a single and specific task only so it is unable to adapt to new domains and problems because of limited functions

Tasks in NLP
--------------------

With Natural Language Processing, five different tasks

a. Lexical Analysis: deals with identifying and analyzing word structure the whole chunk of text into paragraphs, sentences, and words

b. Syntactic Analysis or Parsing: involves analysing words in sentences for grammar and rearranging them to determine how they relate to each other


c. Semantic Analysis: This deals with extracting the dictionary meanings from text. It also maps syntactic structures and objects in the task domain to check for meaningfulness. It rejects statements like “tall stub”.


d. Discourse Integration: It analyzes the previous sentence to guess the meaning of the current sentence and the one after it.


e. Pragmatic Analysis: This reinterprets the statement to ensure it determines correctly what the statement means. It tries to retrieve aspects of the language that requires knowledge of real world


# Next steps


### Statistical NLP 

### NLTK tutorial

### Advanced NLP tutorial