<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/10_natural_language_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

### Reading
- https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3
- https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

### Tutorials
- https://www.kaggle.com/learn/natural-language-processing
- https://spacy.io/usage/spacy-101
- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
- https://www.datacamp.com/community/tutorials/machine-learning-hotel-reviews
- https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python
- https://realpython.com/natural-language-processing-spacy-python/

### Practice
- https://learn.datacamp.com/courses/introduction-to-natural-language-processing-in-python

## Reference
- [SPACy](https://spacy.io/usage/spacy-101)
- [regular expressions](https://digitalfortress.tech/tricks/top-15-commonly-used-regex/)


### Learning Outcomes
- what is natural language processing
- common Python libraries
- word tokenization
- stemming & lemmatization
- n-grams
- Text classification & sentiment analysis
- word vectors


# Overview

Natural Language Processing (NLP) applies data analysis and machine learning techniques to text and speech.

Some common scenarios for NLP are:
- word frequency analysis
- sentiment analysis
- plagiarism detection
- text-to-speech
- text-generation (chat bots)

## Common libraries

The Python ecosystem has two widely used NLP libraries that provide proven solutions for routine text-processing tasks:

- [Natural Language ToolKit (NLTK](https://www.nltk.org/) 
- [spaCy](https://spacy.io/)

In this course we'll focus on spaCy, but each is a solid choice with support for;
- classification
- tokenization
- stemming
- tagging 
- parsing 
- named-entity recognition
- semantic reasoning


# Summary

- **NLP** applies machine learning algorithms to text and speech.
- **NLTK** (Natural Language Toolkit) is a leading Python library for NLP
- **Sentence tokenization** divide a string of written language into its component sentences
- **Word tokenization** divide a string of written language into its component words
- **Stemming** and **Lemmatization** reduce word variants to a common base form.
- **Stop words** are words which are filtered out before or after processing of text. They usually refer to the most common words in a language.
- A **regular expression** is a sequence of characters that define a search pattern.**bold text**
- The bag-of-words model is a simple feature extraction technique that describes the occurrence of each word within a document.
- **TF-IDF** is a statistical measure used to evaluate a word's **importance**  to a document.

## Tokenization

## Stemming & Lemmatization

Stemming and lemmatization are special, but different, cases of **normalization**. 

- **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time.

- **Lemmatization** uses vocabulary and morphological analysis of words to remove grammatical endings only and to return the base or dictionary form of a word (the **lemma**).

a stemmer operates without knowledge of the context, and cannot understand the difference between words which have different meaning depending on part of speech. But the stemmers are easier to implement and usually run faster.

## Word Vectors & Similarity

Machine learning algorithms cannot work with raw text directly, so we convert the text into vectors of numbers (`word vectors` or `word embeddings`). 

Word vectors represent each word in a text numerically such that the vector corresponds to how that word is used or what it means. This allows algorithms to determine similarity of text.

One common approach is the `bag-of-words` method, which describes the occurrence of every word within a document. The simplest method for scoring words in text is to mark the presence of words with 1 for present and 0 for absence. 

Other more complex algorithms, such as `word2vec` derive vectors that take account of the word's context. Words that appear in similar contexts will have similar vectors. Relations between words can be examined with mathematical (matrix) operations.

