The following is based on:
- https://scionanalytics.com/text-mining-vs-natural-language-processing/#:~:text=NLP%20and%20text%20mining%20differ,than%20the%20meaning%20of%20content.
- https://www.folio3.ai/blog/why-natural-language-processing-is-important/#:~:text=By%20enabling%20machines%20to%20understand,%2C%20spell%20checks%2C%20and%20summarization.
- https://nexocode.com/blog/posts/definitive-guide-to-nlp/
- https://monkeylearn.com/blog/introduction-to-topic-modeling/#:~:text=Topic%20modeling%20is%20an%20unsupervised,characterize%20a%20set%20of%20documents.
- https://www.turing.com/kb/which-language-is-useful-for-nlp-and-why
- https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/

# 1. Introduction

In today’s digital world, businesses are overwhelmed with unstructured data. Without adequate technology, it’s virtually impossible for businesses to analyze and process the massive volume of data.That’s where Natural Language Processing (NLP) comes to the rescue…

## 1.1. Some definitions

Natural Language Processing (NLP) is a subset of AI in which computers can analyze and interpret human language in an efficient and useful way. It is a way to get a human-level understanding of the language for machines.

The earliest NLP applications were rule-based systems that only performed certain tasks. These programs lacked exception handling and scalability, hindering their capabilities when processing large volumes of text data. We then moved towards more complex and powerful NLP solutions based on ML and deep learning techniques.

<div align="center">
    <img src="images/DCF_NLP-in-the-Data-Center_ML-DL-Diagram.png" alt="drawing" width="400"/>
</div>

Text mining is used to extract information from unstructured and structured content. It focuses on structure rather than the meaning of the content.

## 1.2. NLP use cases

Many NLP tasks target particular problem areas. These tasks can be broken down into several different categories.

<div align="center">
    <img src="images/NLP-tasks.jpg" alt="drawing" width="500"/>
    <img src="images/NLP-tasks-difficulty.png" alt="drawing" width="400"/>
</div>

### Keyword Extraction
The keyword extraction task aims to identify all the keywords from a given natural language input. Utilizing keyword extractors aids in different uses, such as indexing data to be searched or creating tag clouds, among other things.

Services like PubMed auto-tag their articles based on AI keyword extraction.

### Topic modeling
Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.

<div align="center">
    <img src="images/topic-modeling.png" alt="drawing" width="800"/>
</div>

### Text Classification

The text classification task involves assigning a class to an arbitrary piece of natural language input such as documents, email messages, or tweets. Text classification has many applications, from spam filtering (e.g., spam, not spam) to the analysis of electronic health records (classifying different medical conditions).
Deep learning methods prove very good at text classification, achieving state-of-the-art results on a suite of standard academic benchmark problems.
<div align="center">
    <img src="images/TextClassificationExample.png" alt="drawing" width="400"/>
</div>

### Named Entity Recognition
The entity recognition task involves detecting mentions of specific types of information in natural language input. Typical entities of interest for entity recognition include people, organizations, locations, events, and products.
<div align="center">
    <img src="images/NER.png" alt="drawing" width="600"/>
</div>

### Text Summarization
The text summarization task produces a short extract of arbitrary natural language input, typically a document or article. The goal in the sentence-level text summarization (SLTS) tasks is to create a summary that retains the meaning and style of the source: synthesizing high-level concepts while maintaining factual accuracy without excessive detail.
<div align="center">
    <img src="images/Text-summarization.png" alt="drawing" width="600"/>
</div>

### Conversational Agents
The main task of a conversational agent is to have conversations with humans. The most popular type of conversational agent is chatbots – they use simple responses based on a given input. Their function is to provide the answer or perform the requested action. They are used in many different fields: telecommunications (providing support), marketing and sales (24/7 sales and helping customers), education (languages learning). However, some challenges come along with designing this kind of technology: not being able to answer all questions using only natural language understanding, the fact that it may feel dehumanizing if an AI doesn’t act like a human would when talking about emotions, etc., lack of accuracy due to the complexity inherent to natural languages. Currently, most conversational agents operate within a certain field or subject where most of the scenarios have been well defined.

## 1.3. NLP challenges
One big challenge for natural language processing is that it’s not always perfect; sometimes, the complexity inherent in human languages can cause inaccuracies and lead machines astray when trying to understand our words and sentences.

### Ambiguity
In natural language, there is rarely a single sentence that can be interpreted without ambiguity. Ambiguity in natural language processing refers to sentences and phrases interpreted in two or more ways. Ambiguous sentences are hard to read and have multiple interpretations, which means that natural language processing may be challenging because it cannot make sense out of these sentences.
<div align="center">
    <img src="images/NLP-ambiguity.png" alt="drawing" width="600"/>
</div>

### Irony and Sarcasm
Irony, sarcasm, puns, and jokes all rely on this natural language ambiguity for their humor. These are especially challenging for sentiment analysis, where sentences may sound positive or negative but actually mean the opposite.

### Domain-specific Knowledge
Natural language processing isn’t limited just to understanding what words mean; there’s also interpreting how they should be used within the wider context; background information that may not be explicitly stated but inferred by the program based on surrounding text and domain-specific knowledge.

Models that are trained on processing legal documents would be very different from the ones that are designed to process healthcare texts. Same for domain-specific chatbots - the ones designed to work as a helpdesk for telecommunication companies differ greatly from AI-based bots for mental health support.

### Support for Multiple Languages
For example, the most popular languages, English or Chinese, often have thousands of pieces of data and statistics that are available to analyze in-depth. However, many smaller languages only get a fraction of the attention they deserve and consequently gather far less data on their spoken language. This problem can be simply explained by the fact that not every language market is lucrative enough for being targeted by common solutions.

### Lack of Trust Towards Machines
Another challenge is designing NLP systems that humans feel comfortable using without feeling dehumanized by their interactions with AI agents who seem apathetic about emotions rather than empathetic as people would typically expect.

## 1.4. Python packages

Although languages such as Java and R are used for natural language processing, Python is favored, thanks to its numerous libraries, simple syntax, and its ability to easily integrate with other programming languages.

NLTK and spaCy are two of the most popular Natural Language Processing (NLP) tools available in Python. There’s a real philosophical difference between NLTK and spaCy. NLTK was built by scholars and researchers as a tool to help you create complex NLP functions. It almost acts as a toolbox of NLP algorithms. In contrast, spaCy is similar to a service: it helps you get specific tasks done.

Each library utilizes either time or space to improve performance. While NLTK returns results much slower than spaCy (spaCy is a memory hog!), spaCy’s performance is attributed to the fact that it was written in Cython from the ground up.

spaCy supports more functionnalities than NLTK.

<div align="center">
    <img src="images/spacy-comparision.png" alt="drawing" width="600"/>
</div>

The rest of this course will focus on implementing NLP using spaCy because of its user-friendliness, features availability, performance and philosophy. spaCy just gets the job done!