<a href="https://colab.research.google.com/github/ashikshafi08/Learning-Fastai/blob/main/Intro_to_NLP/Intro_to_NLP_Fastai_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Code-First Introduction to NLP course by Fastai (Rachel Thomas)

Here in this notebook I will jot down the notes and code snippets as I go throught this course. 

Link for the video tutorials: https://www.youtube.com/playlist?list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9 

Github repo for the materials: https://github.com/fastai/course-nlp


## Syllabus Topics Covered :

### 1. What is NLP?
- A changing field
- Resources
-Tools
- Python libraries
- Example applications
- Ethics issues

### 2. Topic Modeling with NMF and SVD

- Stop words, stemming, & lemmatization
- Term-document matrix
- Topic Frequency-Inverse Document Frequency (TF-IDF)
- Singular Value Decomposition (SVD)
- Non-negative Matrix Factorization (NMF)
- Truncated SVD, Randomized SVD
### 3. Sentiment classification with Naive Bayes, Logistic regression, and ngrams

- Sparse matrix storage
- Counters
- the fastai library
- Naive Bayes
- Logistic regression
- Ngrams
- Logistic regression with Naive Bayes features, with trigrams

### 4. Regex (and re-visiting tokenization)

### 5. Language modeling & sentiment classification with deep learning

- Language model
- Transfer learning
- Sentiment classification

### 6. Translation with RNNs

- Review Embeddings
- Bleu metric
- Teacher Forcing
- Bidirectional
- Attention

### 7. Translation with the Transformer architecture

- Transformer Model
- Multi-head attention
- Masking
- Label smoothing

### 8. Bias & ethics in NLP
- bias in word embeddings
- types of bias
- attention economy
- drowning in fraudulent/fake info


In [None]:
!pip install fastai --upgrade

In [2]:
import fastai 
print(fastai.__version__)
from fastai import * 
from fastai.text import *

2.4


## Topic Modelling with NMF and SVD

A good way to start the study of NLP, we will use two popular matrix decomposition techniques. 

We have laid a matrix (**term-document matrix**) of different character names and the acts wrote by Shakespeare, it's an example of bag of words. This is also called Latent Semantic Analysis. 

term —> names of the characters are considered as terms. 

Acts —> are considered as documents

The dataset we are about to use is consists of, 18000 newsgroups posts with 20 topics. 

In [3]:
# Importing the things we need 
import numpy as np 
from sklearn.datasets import fetch_20newsgroups # the dataset we're going to use
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt 

In [4]:
np.set_printoptions(suppress=True)

In [5]:
fetch_20newsgroups

<function sklearn.datasets._twenty_newsgroups.fetch_20newsgroups>

In [6]:
# Picking only 4 topics 
categories = ['alt.atheism' , 'talk.religion.misc' , 'comp.graphics' , 'sci.space' ]

# Things to be removed 
remove = ['headers' , 'footers' , 'quotes']

In [None]:
# Creating a train and test set 
newsgroups_train = fetch_20newsgroups(subset= 'train' , 
                                      categories = categories , 
                                      remove = remove)
newsgroups_test = fetch_20newsgroups(subset = 'test' , 
                                     categories = categories , 
                                     remove = remove)


In [10]:
# Checking the shapes (Post and target)
newsgroups_train.filenames.shape , newsgroups_train.target.shape

((2034,), (2034,))

In [12]:
# Checking first 3 example of the filenames (post)
print('\n'.join(newsgroups_train.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

In [16]:
# What are targets of the above sentences? 
np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space'], dtype='<U18')

In [17]:
# The target attribute is in index of each category 
newsgroups_train.target[:10]

array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1])

In [18]:
# The number of topics we want to look in and top words 
num_topics , num_top_words = 6 , 8

## Stop words, stemming, lemmatization

#### Stopwords 
https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

Some extremly common words which would appear to be of little value in helping select documents matching a user needs are excluded from the vocabulary entirely. 

These words are called stopwords. 

The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever.

**Things I have to read** 
- https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming
- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

In [20]:
# Printing out some of the stopwords 
from sklearn.feature_extraction import stop_words

# Displaying 20 stop words
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

#### Stemming and Lemmatization 
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

- Stemming and lemmatization both generates the root form the words. 

- Lemmatization uses the rules about a language. The resulting tokens are all actual words. 

- Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

Are the below words the same?

*organize, organizes, and organizing*

*democracy, democratic, and democratization*

> "Stemming is the poor-man’s lemmatization." (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words. The resulting tokens may not be actual words. Stemming is faster.

We will use NLTK to demonstrate these types of techniques. 

#####NLTK

In [21]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [23]:
from nltk import stem
wnl = stem.WordNetLemmatizer() # instantiating lemmatization function
porter = stem.porter.PorterStemmer()

In [24]:
# Creating a word list 
word_list = ['feet' , 'foot' , 'foots' , 'footing']

In [26]:
# Performing lemmatization
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

In [27]:
# Performing stemming 
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

In [28]:
# Creating lists of words to perform stemming and lemmatization 

fl_list = ['flies' , 'flying' , 'fly']
org_list = ['organize' , 'organizes' , 'organizing']
un_list = ['universe' , 'university']

In [29]:
# Performing lemmatization on above list 
print([wnl.lemmatize(word) for word in fl_list])
print([wnl.lemmatize(word) for word in org_list])
print([wnl.lemmatize(word) for word in un_list])

['fly', 'flying', 'fly']
['organize', 'organizes', 'organizing']
['universe', 'university']


In [30]:
# Performing stemming on the same list of words 
print([porter.stem(word) for word in fl_list])
print([porter.stem(word) for word in org_list])
print([porter.stem(word) for word in un_list])

['fli', 'fli', 'fli']
['organ', 'organ', 'organ']
['univers', 'univers']


##### Spacy

Trying out the same with Spacy. 

In [34]:
# Trying out the lemmatization function
import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups

In [37]:
# Creating a instance 
lookups = Lookups()
lemmatizer = Lemmatizer(lookups= lookups)

[lemmatizer.lookup(word) for word in word_list]

['feet', 'foot', 'foots', 'footing']

Spacy doesn't offer a stemmer (since lemmatization considered better). 

Also Stopwords vary from library to library 

In [41]:
nlp = spacy.load('en_core_web_sm')
sorted(list(nlp.Defaults.stop_words))[:20]

["'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also']

In [43]:
# Exercise: What stop words appear in spacy but not in sklearn?¶
nlp.Defaults.stop_words - stop_words.ENGLISH_STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'ca',
 'did',
 'does',
 'doing',
 'just',
 'make',
 "n't",
 'n‘t',
 'n’t',
 'quite',
 'really',
 'regarding',
 'say',
 'unless',
 'used',
 'using',
 'various',
 '‘d',
 '‘ll',
 '‘m',
 '‘re',
 '‘s',
 '‘ve',
 '’d',
 '’ll',
 '’m',
 '’re',
 '’s',
 '’ve'}

In [44]:
# Exercise: And what stop words are in sklearn but not spacy?¶
stop_words.ENGLISH_STOP_WORDS - nlp.Defaults.stop_words

frozenset({'amoungst',
           'bill',
           'cant',
           'co',
           'con',
           'couldnt',
           'cry',
           'de',
           'describe',
           'detail',
           'eg',
           'etc',
           'fill',
           'find',
           'fire',
           'found',
           'hasnt',
           'ie',
           'inc',
           'interest',
           'ltd',
           'mill',
           'sincere',
           'system',
           'thick',
           'thin',
           'un'})

These were long considered standard techniques, but they can often hurt your performance if using deep learning. Stemming, lemmatization, and removing stop words all involve throwing away information.

However, they can still be useful when working with simpler models.

Sub-word tokens: https://github.com/google/sentencepiece