# NLP
## Tokenization 
 - Tokenization is the concept of splitting a document or sentence into small subsets of data that can be analyzed.
 - Tokenization can be performed by word or sentence
  - To tokenize by sentence, you would provide a document with at least one sentence. 

## Normalisation 
- Normalisation is taking misspelled words and converting them into their original form 
- It helps get the text to a readable forma and allows us to create other use cases on top of it. 

### Stemming 
- Removes the suffic from a word and reduce it to its original form
- e.g. To reduce "horses" to "horse" and "ponies" to "poni"

### Lemmatisation 
- Removes the sufix from a word and reduces it to its origical form
- Tends to be a smoother cut off the end of the word
  - Tries to return to the original root word 
- Lemmatisation always returns a real word 
- e.g. "am" --> "be" 

## Part-of-Speech (PoS) Tagging

In [1]:
import nltk
from nltk import word_tokenize
text = word_tokenize("I enjoy biking on the trails")
output = nltk.pos_tag(text)
print(output)

[('I', 'PRP'), ('enjoy', 'VBP'), ('biking', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('trails', 'NNS')]


## n-gram 
- A sequence of items from a given text 
- Common n-grams: 

      - Unigram - Size 1 n-gram 

      - Bigram - Size 2 n-gram 

      - Trigram - Size 3 n-gram 

- For instance: 

      - “I like pizzas” 

            - Unigram - “I,” “like,” and “pizzas.” 

            - Bigram - “I like” and “like pizzas.”

            - Trigram - "I like pizzas."



## NLP Analyses

### 1. Syntactic Analysis 

- Important to check the dictionary definition of each element of a sentence or document 

- Do not care about the words that come before or after the word in question 

### 2. Sentiment Analysis 

- Pertain to what the text means 

- Come up with a score of how positive or negative meaning of the text as a whole 

### 3. Semantic Analysis 

- Entails extracting the meaning of the text 

- Tend to analyse the meaning of each word and then relate that to the meaning of the text as a whole

## Named-Entity Recognition (NER) 

- Taking a document and finding all of the important terms therein 

- Train a model on data labelled with important entities so that the model can better distinguish which entities should be labelled in a different dataset.

## NLP Pipeline

### 1. Raw Text 

- Start with the raw data 

### 2. Tokenisation 

- Separate the words from paragraphs, to sentences, to individual words

### 3. Stop Words Filtering 

- Remove common words like "a" and "the" that add no real value to what we are looking to analyse 

### 4. Term Frequency-Inverse Document Frequency (TF-IDF)

- Statistically rank the words by importance compared to the rest of the words in the text 

- This is also when the words are converted from text to numbers 

### 5. Machine Learning

- Put everything together and run through the machine learning model to produce an output

In [None]:
import os
# Find the latest version of spark 3.0  from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.3'
spark_version = 'spark-3.0.3'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()