<a href="https://colab.research.google.com/github/aduwillie/NLP-Tutorial/blob/main/nlp_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install nltk
!pip install spacy
!pip install sklearn
!pip install gensim
!pip install transformers
!pip install torch torchaudio torchvision torchtext torchdata
!pip install tensorflow
!pip install numpy
!pip install unidecode
!pip install textblob
!pip install afinn

!python -m spacy download en_core_web_sm

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
Collecting torchtext
  Downloading torchtext-0.18.0-cp310-cp310-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting torchdata
  Downloading torchdata-0.9.0-cp310-cp310-manylinux1_x86_64.whl.metadata (5.5 kB)
Downloading torchtext-0.18.0-cp310-cp310-manylinux1_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━

# Understanding Text Data

This is basically unstructured data and comes in different forms such as articles, social media posts, emails, chat messages etc.

## Text Preprocessing

This is a crucial step that ensures that text is clean and in the format that can be analyzed by models. This includes:

1. Punctuation removal
2. Stop word removal
3. Non-essential elements - special characters, HTML tags, that do not add any extra value

## Standardization

This includes converting text to a standardized format. This includes:

1. Lowercasing
2. Stemming - Reducing word to their base or root forms
3. Lemmatization - Similar to stemming. Reduces words to their dictionary or canonical form.

In [None]:
# Display text
text = "This is a sample text to play with. My name is Michael Kyle"
print(text)

text_len = len(text)
print("Length of text is:", text_len)

unique_chars = set(text)
print("Unique chars:", unique_chars)

words = text.split()
print("Words in text:", words)

words_len = len(words)
print("Total count of words:", words_len)

This is a sample text to play with. My name is Michael Kyle
Length of text is: 59
Unique chars: {'i', 'n', 'a', 't', 'K', 'l', 'e', 'h', 'w', 'T', 's', 'p', 'y', '.', 'x', 'o', 'c', 'm', ' ', 'M'}
Words in text: ['This', 'is', 'a', 'sample', 'text', 'to', 'play', 'with.', 'My', 'name', 'is', 'Michael', 'Kyle']
Total count of words: 13


# Challenges with Text Data

There are several challenges with text data that should be considered. There are:

1. Ambiguity - Multiple meanings for words
2. Variability - Different sentence structures, styles, slang, etc.
3. Noisy data - Text may include irrelevant or redundant information
4. High dimensionality - Each unique word can be considered a dimension. The vocabulary size can be ver huge.
5. Sentiment and subjectivity - Text date often contains subjective information eg. opinions, emotions, etc.
6. Context and dependency - A text often requires considering the contextand dependencies between words.
7. Language diversity
8. Sarcasm and irony

In [None]:
import string

text = "Natural Language Processing (NLP) enables computers to understand human language"

text_lower = text.lower()
print("Lowercased:", text_lower)

text_no_punct = text.translate(str.maketrans('', '', string.punctuation))
print("No punctuations:", text_no_punct)

Lowercased: natural language processing (nlp) enables computers to understand human language
No punctuations: Natural Language Processing NLP enables computers to understand human language


# Text Cleanup

This is crucial in the preprocessing pipeline. It involves transforming raw text into a clean and standardized format. It includes:

## Stop word removal

Common words that cary minimal meaningful information eg. "the", "is", "in", "and", etc. Often filtered out to reduce noise. The benefits of this include:

1. Dimensionality reduction
2. Processing speed
3. Improved accuracy - Classification and sentiment analysis improve a lot with stop words removal


In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

text = "Natural Language Processing (NLP) enables computers to understand a human language!"
tokens = text.split() # In this case, this is word tokens
print("Oringinal tokens", tokens)

en_stop_words = set(stopwords.words("english"))
print("English stop words:", en_stop_words)

filtered_tokens = [word for word in tokens if word.lower() not in en_stop_words] # Removes 'a' from the list of tokens
print("Filtered tokens:", filtered_tokens)

Oringinal tokens ['Natural', 'Language', 'Processing', '(NLP)', 'enables', 'computers', 'to', 'understand', 'a', 'human', 'language!']
English stop words: {'my', 'should', "mightn't", 't', "won't", 'its', 'about', 'what', 'only', 'been', 'if', 'no', 'those', 'his', 'off', 'but', "shouldn't", 'our', 'there', 'under', 'once', "shan't", 'the', "don't", 'y', 'such', "she's", 'has', 'does', 'd', 'theirs', 'me', 'in', "needn't", 'ourselves', 'own', 've', "doesn't", 'on', 'these', 'again', 'until', 'didn', 'below', 'him', 'than', 'don', 'myself', "aren't", 'herself', 'their', 'here', 'whom', 'yourselves', "isn't", 'just', 'to', 'with', 'into', 'she', 'not', 'an', 'having', 'where', 'aren', 'hers', 'your', "you'll", 'won', 'from', 'of', "should've", 'yourself', 'will', 'or', 'between', "wasn't", 'yours', 'am', 'same', 'that', "weren't", 'be', 'it', "hasn't", 'you', 'isn', 'her', 'other', 'by', 'as', 'are', 'have', 'wouldn', 'needn', 'wasn', "mustn't", 'haven', 'while', 'so', 'too', "that'll", 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Stemming

This involves reducing words to their base or true form. This is important because of:

1. Dimensionality reduction - Reduces the number of unique words
2. Improved accuracy - Text classification, search engines, sentiment analysis can standardize.
3. Resource efficiency - Size of vocabular is reduced when dealing with large datasets.

### Use cases of stemming

1. Search engines - When a user inputs a search term, the search engine can find all relevant documents that contain any form of that term. This improves the results and broadens the results.
2. Text classification - Stemming reduces the dimensionality of the text. This is a huge benefit of the classification algorithms.
3. Sentiment analysis - Different forms of the word do not need to skew  the analysis. Eg, "happy", "happiness" would all produce the same form and not treated as separate.

Note that a big issue with stemming is that it introduces **context ignorance**. Also, the generated stem doesn't match the base form as present in dictionary.

In [None]:
import nltk
from nltk.stem import *

text = "Natural Language Processing (NLP) enables computers to understand a human language!"
tokens = text.split() # In this case, this is word tokens
print("Oringinal tokens", tokens)

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Stemmed tokens:", stemmed_tokens)

Oringinal tokens ['Natural', 'Language', 'Processing', '(NLP)', 'enables', 'computers', 'to', 'understand', 'a', 'human', 'language!']
Stemmed tokens: ['natur', 'languag', 'process', '(nlp)', 'enabl', 'comput', 'to', 'understand', 'a', 'human', 'language!']


## Lemmatization

Everything under Stemming applies. The only difference is that lemma is the base form as representied in dictionary (morphological analysis). Another consideration is that this requires knowledge of the word's part of speech (POS). E.g., the word "saw" can be noun or verb and lemmatization can distinguish between these use cases.

In [None]:
import spacy # Used below under the "spacy" section
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet")
nltk.download("omw-1.4")

text = "Natural Language Processing (NLP) enables computers to understand a human language!"
tokens = text.split() # In this case, this is word tokens
print("Oringinal tokens", tokens)

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in tokens]
print("Lemmatized tokens", lemmatized_tokens)

# Try Same with spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
lemmatized_tokens = [token.lemma_ for token in doc]
print("spacy: Lemmatized tokens:", lemmatized_tokens) # Consider the word "enables" -> "enable" and "computers" -> "computer"



Oringinal tokens ['Natural', 'Language', 'Processing', '(NLP)', 'enables', 'computers', 'to', 'understand', 'a', 'human', 'language!']
Lemmatized tokens ['natural', 'language', 'processing', '(nlp)', 'enables', 'computer', 'to', 'understand', 'a', 'human', 'language!']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


spacy: Lemmatized tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'enable', 'computer', 'to', 'understand', 'a', 'human', 'language', '!']


## Regular Expressions

This is a powerful tool that is useful in all text processing and data manipulation scenarios. It allows for searching, matching and manipulating text based on a specific pattern. Practical applications include:

1. Text search - Finding specific words or phrases in text data.
2. Data validation - Verify if string matches pattern eg. email
3. Manipulation - Extracting or replacing parts of a string based on a pattern

In [None]:
import re

text = "Natural Language Processing (NLP) enables computers to understand a human language!"
print("Text:", text)

pattern = r"computers"

found = re.search(pattern, text)
if found: print("The word \"computers\" was found")
else: print("The word \"computers\" was not found")

# Replace the word "language" to speech
replace_pattern = r"language"
target = "speech"
new_text = re.sub(replace_pattern, target, text)
print("New text", new_text)

Text: Natural Language Processing (NLP) enables computers to understand a human language!
The word "computers" was found
New text Natural Language Processing (NLP) enables computers to understand a human speech!


## Tokenization

This involves breaking down a piece of text into smaller units. The unit is called **token**. The benefits of tokenization includes:

1. Simplification - We can focus on the components rather than the whole text
2. Standardization - Consistent and uniform representation of the text. Variations would lead to inconsistencies and errors in analysis.
3. Feature extraction - These features can be words, phrases and other elements that hold valuable information

Types of tokenization include:

1. Word tokenization
2. Sentence tokenization
3. Character tokenization
4. *Subword tokenization (Advanced and mostly for Neural Networks)

Applications of tokenization varies. Consider the following:

1. Word tokenization helps with text classification projects such as spam detetion, topic labeling as well as general organization.
2. Word tokenization also helps with sentiment analysis projects by looking for key words.
3. Named Entity Recognizer (NER) are good use cases for word tokenization.
3. Both word and sentence tokenization are useful in machine translation contexts

In [None]:
import spacy
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download("punkt") # NLTK tokenizer. Need for actual tokenization

text = "Natural Language Processing (NLP) enables computers to understand a human language!"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Word tokenization
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)

en_word_tokens = [token.text for token in doc]
print("Spacy: Word tokens:", en_word_tokens)

# Sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens:", sentence_tokens)

en_sentence_tokens = [sents.text for sents in doc.sents]
print("Spacy: Sentence tokens:", en_sentence_tokens)

# Character tokenization
char_tokens = list(text)
print("Character tokens:", char_tokens)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Word tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'enables', 'computers', 'to', 'understand', 'a', 'human', 'language', '!']
Spacy: Word tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'enables', 'computers', 'to', 'understand', 'a', 'human', 'language', '!']
Sentence tokens: ['Natural Language Processing (NLP) enables computers to understand a human language!']
Spacy: Sentence tokens: ['Natural Language Processing (NLP) enables computers to understand a human language!']
Character tokens: ['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'P', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', '(', 'N', 'L', 'P', ')', ' ', 'e', 'n', 'a', 'b', 'l', 'e', 's', ' ', 'c', 'o', 'm', 'p', 'u', 't', 'e', 'r', 's', ' ', 't', 'o', ' ', 'u', 'n', 'd', 'e', 'r', 's', 't', 'a', 'n', 'd', ' ', 'a', ' ', 'h', 'u', 'm', 'a', 'n', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '!']


# Feature Engineering

The goal of feature engineering is to transform text into numerical representations whiles preserving the underlying meaning and structure.

The process involves various techniques such as the following:

1. Bag of words
2. TF-IDF
3. Word Embeddings
4. BERT Embeddings

## Bag of Words

It converts text into numerical features by treating each document as an unordered collection of words, ignoring grammer, word order, context, but retaining the frequency of words.

To achieve this, there are 3 steps involved:

1. Tokenize the text eg. word
2. Building the vocabulary
3. Vectorizing the text

Bag of Words and provides the following advantages:

1. Simple - Very straighforward to implement
2. Efficiency - It processes text relatively quickly
3. Baseline - It serves as a baseline for other complex models

However, there are disadvantages that should be considered. These include:

1. Loss of context - It neglects order and context.
2. High dimensionality - Vocabulary size can grow with large data/corpus
3. Sparsity - Many elements would map to 0 which results in sparse representations.

In [None]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer # Implements BOW

# Sample corpus eg. a document split into sentences
documents = [
    "Natural language processing is fun as it can interpret human language.",
    "Language models are important in NLP.",
    "I care about AI but I need the foundations first."
]

vectorizer = CountVectorizer() # Convert a collection of text documents to a matrix of token counts.
matrix = vectorizer.fit_transform(documents)

bag_of_words = matrix.toarray()
print("Bag of words matrix", bag_of_words)

vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

Bag of words matrix [[0 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 2 0 1 0 0 1 0]
 [0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0]
 [1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1]]
Vocabulary: ['about' 'ai' 'are' 'as' 'but' 'can' 'care' 'first' 'foundations' 'fun'
 'human' 'important' 'in' 'interpret' 'is' 'it' 'language' 'models'
 'natural' 'need' 'nlp' 'processing' 'the']


### Text Classification using Bag of Words

This is a simple project where we use Bag of Words (BOW) for feature engineering.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB # This is classifier (model) that is based on the Naive Bayes algorithm
from sklearn.model_selection import train_test_split # A helper to split input into test and train data sets by fraction
from sklearn.metrics import accuracy_score # A metrics function to help verify score of multiclass models

documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "I enjoy learning about artificial intelligence.",
    "Machine learning and NLP are closely related",
    "Deep learning is a subset of machine learning."
]
# For the labels
# 1 -> NLP-related, 0 -> AI-related
labels = [ 1, 1, 0, 1, 0 ]

# Get a vectorizer for Bag of Words (BOW)
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

"""
In ML, there are 2 planes to use - X and Y
X represents the inputs
Y represents the predictions/labels

In this example the bow_matrix is the numerical representation of the input text
The text data needed to be converted to numerical representation before passing to the model.
"""

x_train, x_test, y_train, y_test = train_test_split(bow_matrix, labels, test_size = 0.2, random_state = 31)

# Initialize classifier (classification model)
classifier = MultinomialNB()
classifier.fit(x_train, y_train)

# Generate the predictions
y_pred = classifier.predict(x_test)

# Verify the accuracy of the classifier/model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


## Term Frequence - Inverse Document Frequency (TF-IDF)

Unlike Bag of Words (BOW), TF-IDF takes into account the importance of each word, as related to the entire text corpus.

This method helps in distinguishing the unique aspects of a document, thereby improving the performance of various NLP tasks such as document classification, clustering and information retrieval.

Consider the following advantages to TF-IDF:

1. Importance weighting - Assigns higher weights to words that are particularly significant to a document while assigning lower weights to common words.
2. Reduction of noise - Less important words such as "the", "is" are assigned lower weights.
3. Capture of nuanced representation - TF-IDF captures both frequency of words in a document as well as within the entire corpus
4. Effective for large corpus

There are downsides to this as well, such as:

1. Sparsity - For corpus with large vocabularies, many elements in the feature vectors may end up being zero, wasting storage.
2. Context ignorance - It does not capture the semantics of words i.e., it treats each word independently without considering its surrounding words/context.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus eg. a document split into sentences
documents = [
    "Natural language processing is fun as it can interpret human language.",
    "Language models are important in NLP.",
    "I care about AI but I need the foundations first."
]

vectorizer = TfidfVectorizer() # Convert a collection of text documents to a matrix of TfIdf features.
matrix = vectorizer.fit_transform(documents)

tf_idf = matrix.toarray()
print("TF-IDF matrix:", tf_idf)

vocab = vectorizer.get_feature_names_out()
print("Vocabulary:", vocab)

TF-IDF matrix: [[0.         0.         0.         0.29730323 0.         0.29730323
  0.         0.         0.         0.29730323 0.29730323 0.
  0.         0.29730323 0.29730323 0.29730323 0.45221354 0.
  0.29730323 0.         0.         0.29730323 0.        ]
 [0.         0.         0.42339448 0.         0.         0.
  0.         0.         0.         0.         0.         0.42339448
  0.42339448 0.         0.         0.         0.32200242 0.42339448
  0.         0.         0.42339448 0.         0.        ]
 [0.35355339 0.35355339 0.         0.         0.35355339 0.
  0.35355339 0.35355339 0.35355339 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.35355339 0.         0.         0.35355339]]
Vocabulary: ['about' 'ai' 'are' 'as' 'but' 'can' 'care' 'first' 'foundations' 'fun'
 'human' 'important' 'in' 'interpret' 'is' 'it' 'language' 'models'
 'natural' 'need' 'nlp' 'processing' 'the']


### Text Classification using TF-IDF

This is a simple project where we use Term Frequenct - Inverse Document Frequency (TF-IDF) for feature engineering.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB # This is classifier (model) that is based on the Naive Bayes algorithm
from sklearn.model_selection import train_test_split # A helper to split input into test and train data sets by fraction
from sklearn.metrics import accuracy_score # A metrics function to help verify score of multiclass models

documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "I enjoy learning about artificial intelligence.",
    "Machine learning and NLP are closely related",
    "Deep learning is a subset of machine learning."
]
# For the labels
# 1 -> NLP-related, 0 -> AI-related
labels = [ 1, 1, 0, 1, 0 ]

# Get a vectorizer for Bag of Words (BOW)
vectorizer = TfidfVectorizer()
tf_idf_matrix = vectorizer.fit_transform(documents)

"""
In ML, there are 2 planes to use - X and Y
X represents the inputs
Y represents the predictions/labels

In this example the bow_matrix is the numerical representation of the input text
The text data needed to be converted to numerical representation before passing to the model.
"""

x_train, x_test, y_train, y_test = train_test_split(tf_idf_matrix, labels, test_size = 0.2, random_state = 31)

# Initialize classifier (classification model)
classifier = MultinomialNB()
classifier.fit(x_train, y_train)

# Generate the predictions
y_pred = classifier.predict(x_test)

# Verify the accuracy of the classifier/model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


## Word Embeddings

This allows words to be represented as vectors in a continuous vector space. This approach has significant advantages over traditional BOW and TF-IDF. Word embeddings are designed to capture semantic relationships between words, enabling words with similar meaning to have similar representations in the vector space.

Here are key concepts around word embeddings:

1. Semantic similarity - Eg, "king" and "queen" might have similar vectors because they often appear in similar contex such as royalty, governance or historical narratives.
2. Continuous vector space - Each word is represented as a point in the continuous vector space. Eg, the difference between "king" and "man" should be the same as "queen" and "woman"
3. Dimensionality reduction - Similar words have map to same point or are very close.
4. Transfer learning - Pre-trained word embeddings can be used in other NLP tasks. Same embeddings can be used in machine translation, sentiment analysis and text classification projects.

Popular word embeddings are:

1. Word2Vec
2. Glove

### Word2Vec

Developed by Google. It comes in 2 main variants:

1. Continuous Bag of Words (CBOW) - It takes a set of context words as inputs and attempts to guess the word that is most likely to fit in the middle of these context words. _**This is very effective in identifying words that frequently appear in similar contexts**_.
2. Skip-Gram - It predicts the context words, given the target word. It essentially takes a single word as inputand tries to predict the words that are likely to appear around it within a specified a specified window of context. _**This is especially useful for identifying raw words and their context.**_

Both variants aim to capture the intricate relationship between words based on their context, thereby enabling more nuanced and sophisticated language models.

In [None]:
import nltk
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt") # Download the punkt tokenizer

# Sample text/corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. I enjoy learning about artificial intellignence. Machine learning and NLP are closely related. Deep learning is a subset of machine learning"

sentences = sent_tokenize(text)
tokens = [word_tokenize(word) for word in sentences]
print("Tokens", tokens)

"""
The mode to use is Word2Vec. It accepts a collection of all sentences in the corpora.

Dimensionality refers to the number of elements in the word vector.
For example, if a word vector has 100 dimensions, it means each word is represented by a vector of 100 numbers.
The dimensionality of these vectors is a crucial factor in how well they capture the nuances of word meanings and relationships.

The model defined has a vector_size of 100. This is the same of the dimensionality of the vector

In Word2Vec, the window size determines how many words before and after the target word are considered as context.
For example, with a window size of 2, the context for the word "dog" in the sentence
"The quick brown fox jumps over the lazy dog" would be ["quick", "brown", "jumps", "over"].

sg is the value for Subsampling
This specifies the technique to use
1 -> Skip gram i.e., predict context words given target word
0 -> or otherwise for CBOW i.e., predict target word given context words

min_count=1 means to ignore all words with a total frequency lower than this(1).
This setting can be used to remove rare words, by focusing on common words, or preserve them

"""

# Create and train a Word2Vec model for the corpus
model = Word2Vec(sentences = tokens, vector_size=100, window=5, sg=1, min_count=1, epochs=3)

# Test model
target = "language"
vector = model.wv["fun"]
print("Vector for language:", vector)

similar_words = model.wv.most_similar(target, topn=2)
print("Similar words", similar_words)

Tokens [['Natural', 'language', 'processing', 'is', 'fun', 'and', 'exciting', '.'], ['Language', 'models', 'are', 'important', 'in', 'NLP', '.'], ['I', 'enjoy', 'learning', 'about', 'artificial', 'intellignence', '.'], ['Machine', 'learning', 'and', 'NLP', 'are', 'closely', 'related', '.'], ['Deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning']]
Vector for language: [-0.00515624 -0.00666834 -0.00777684  0.00831073 -0.00198234 -0.00685496
 -0.00415439  0.00514413 -0.00286914 -0.00374966  0.00162143 -0.00277629
 -0.00158436  0.00107449 -0.00297794  0.00851928  0.00391094 -0.00995886
  0.0062596  -0.00675425  0.00076943  0.00440423 -0.00510337 -0.00211067
  0.00809548 -0.00424379 -0.00763626  0.00925791 -0.0021555  -0.00471943
  0.0085708   0.00428334  0.00432484  0.00928451 -0.00845308  0.00525532
  0.00203935  0.00418828  0.0016979   0.00446413  0.00448629  0.00610452
 -0.0032021  -0.00457573 -0.00042652  0.00253373 -0.00326317  0.00605772
  0.00415413  0.00776459  0.002

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Global Vectors for Word Representation (GloVe)

Developed by Stanford University. Unlike Word2Vec, which is based on predicting context words, GloVe relies on matrix factorization of word co-occurrence matrices.

To break it down, GloVe works this way:

1. Building co-occurence matrix - This is a matrix that records how often each pair of words appears together within a certain context window. For example, if our context window is 5 words, the matrix will count how often the word "cat" appears within 5 words of "dog," "fish," "pet," etc.
2. Calculating co-occurrence probabilities - Once the matrix is built, GloVe calculates the probability of seeing a pair of words together. This probability is based on the counts from the co-occurrence matrix. For instance, the probability that "cat" appears near "dog" might be higher than "cat" near "fish."
3. Creating the word embeddings - GloVe aims to find word vectors (embeddings) that can capture the co-occurrence probabilities. It does this by factorizing the co-occurrence matrix. This means GloVe breaks down the large matrix into smaller matrices of lower dimensions, making it easier to work with. The goal is to find word vectors such that the dot product of any two word vectors is proportional to the logarithm of their co-occurrence probability.
4. Optimising the word embeddings - GloVe uses an optimization process to adjust the word vectors so that they best represent the relationships captured in the co-occurrence matrix. It minimizes a loss function that measures the difference between the actual co-occurrence probabilities and the probabilities predicted by the dot products of the word vectors.

Due to its approach, GloVe is able to capture both local context as well as broader context across the corpus. It thus often lead to a more accurate embeddings for certain tasks.

In [None]:
import gensim.downloader as downloader

# Load a pretrained GloVe embeddings
model = downloader.load("glove-wiki-gigaword-100")

target = "food"
vector = model[target]
print("Vector for language:", vector)

similar_words = model.most_similar(target)
print("Similar words:", similar_words)

Vector for language: [-1.6486e-01  9.1997e-01  2.2737e-01 -4.9031e-01 -1.8082e-03 -3.3803e-01
  5.7221e-02  1.4601e-01  4.0202e-01 -2.8858e-01 -4.7495e-01 -5.6369e-01
  2.7037e-01  5.1702e-01 -1.1241e-01  1.8314e-01  2.2066e-01 -4.8606e-01
 -8.7284e-01 -6.2587e-02  4.3016e-02  2.3641e-01  5.9705e-01 -3.8640e-01
 -2.5194e-01  9.6862e-01 -4.3112e-01 -4.8370e-01 -1.1396e+00  9.2425e-02
 -1.1476e-01 -7.4291e-02 -6.2524e-02 -9.5122e-02 -2.2714e-01  8.8291e-01
  3.9978e-01  7.6631e-01 -6.7697e-01 -6.2829e-01 -1.1872e-01 -2.4492e-01
 -5.8893e-01 -8.5088e-01  1.1107e+00  4.2190e-01 -1.5072e+00 -1.9509e-01
 -2.6712e-01 -7.0801e-01  5.5075e-01 -4.6929e-02 -2.5203e-01  7.4411e-01
 -1.8325e-01 -1.4885e+00 -4.6393e-01 -1.0338e-01  2.3525e+00 -1.5421e-01
  3.9833e-01  1.5344e-02  8.0708e-02 -2.7373e-01  9.7057e-01 -1.9383e-02
  2.0899e-01 -6.4033e-01  9.2509e-01 -4.5371e-01 -7.0564e-01 -1.6033e-01
 -7.1761e-02  6.2856e-01  3.5732e-01  8.8802e-01 -6.9127e-01  4.9634e-02
 -9.3347e-01  6.5396e-01  3.71

### Bidrectional Encoder Representations from Transformers (BERT)

Developed by Google and use Transformer architecture to solve the problem with static embeddings. Unlike Word2Vec and Glove which provides static representations of words that remain unchanged regardless of context, BERT generates context-aware embeddings. The means the representation of the same word can change depending on the its context in a sentence. Eg, "bank" will have different embeddings in the contexts of "river bank" and "bank account".

BERT uses 2 main approaches:

1. Pre-training - Trained on a large corpususing 2 unsupervised tasks i.e., Masked Language Modeling (MLM) or Next Sentence Prediction (NSP)
2. Fine-tuning

To give an example: suppose you have a text classification task where you want to classify emails as spam or not, you can start with BERT model that understands language nuances. During the fine-tuning, the model is trained on the labeled data and adjusts its parameters to slighly optimise for this specific task without losing the broad language understanding it gained during the pre-training phase.

In [None]:
import torch # Import Pytorch for Neural Network-related work
from transformers import BertTokenizer, BertModel

# Load a pretrained BERT model and tokenizer
model = BertModel.from_pretrained("bert-base-uncased") # Model that uses lowercases
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", clean_up_tokenization_spaces=True)

# Sample text
text = "Natural Language processing is fun"
print("Text:", text)

inputs = tokenizer(text, return_tensors="pt") # Ensure that returned tensors are in pytorch format/friendly

# Without computing the gradients with backpropagation
# Another way to think of this is to not adjust the model weights, from the original training
with torch.no_grad():
  outputs = model(**inputs)
  #print(outputs)

"""
The model output has many parameters. One of them is last_hidden_state
The last hidden state has the values we want to really output
it is a 3 dimensional array (1 x total_size_of_tokens x  total_dimensions)
"""
cls_embedding = outputs.last_hidden_state[:, 0, :]
print("BERT embedding for text:", cls_embedding)


Text: Natural Language processing is fun
BERT embedding for text: tensor([[-3.1769e-01, -5.6979e-02, -2.9087e-01,  1.6923e-02, -4.7647e-01,
         -4.3632e-01,  1.4920e-01,  6.0166e-01, -8.7724e-02, -3.4493e-01,
         -3.2336e-01,  3.3781e-02, -2.4300e-01, -2.4146e-03,  2.9958e-01,
          5.7035e-03, -3.3114e-01,  2.3717e-01,  1.4185e-01, -1.2382e-01,
         -2.4081e-01, -5.9463e-02, -2.3821e-01, -1.5302e-01,  1.4933e-02,
         -1.6121e-01,  1.9571e-01, -2.4481e-01,  3.0754e-01, -4.2377e-02,
         -1.1753e-01,  6.8966e-02, -1.9826e-01, -2.4759e-01,  3.0578e-01,
         -7.2803e-02,  1.2484e-01, -4.4241e-02,  1.2230e-01,  4.0596e-02,
         -1.0609e-01,  1.2684e-01,  1.5265e-01,  8.7406e-03, -2.3450e-01,
         -2.6255e-01, -2.4217e+00, -4.5520e-02, -4.0230e-01, -2.6297e-01,
         -2.9933e-01, -1.1200e-01,  2.8905e-01,  6.4164e-01,  1.9518e-01,
          3.1939e-01, -1.0347e-01,  7.0154e-01,  1.7883e-01,  1.6085e-01,
          9.7144e-02,  3.1741e-01, -2.7081e-01

# Language Modeling

This involves predicting the next word or sequence of words in a sentence. It serves as the backbone for speech recognition, maching translation, text generation and more.

## N-grams

N-grams are contiguous sequence of N items derived from a given sample of text or speech. N can be 1, 2, 3, etc.

There are challenges with N-grams such as:

1. Sparsity - As N increases, many N-grams may not appear in the training corpus, making it difficult to estimate their possibilities accurately.
2. Context limitation - N-grams only capture a fixed window of context, which may be insufficient for capturing long-rang dependencies in a language.
3. Memory usage - storage related problems
4. Inability to capture semantic meaning

Despite these challenges, n-grams are applicable in the following areas:

1. Text prediction
2. Speech recognition - we can filter our improbable word combinations
3. Machine translations
4. Text generation

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import ngrams
from collections import Counter

# Download the punkt tokenizer
nltk.download("punkt")

# Sample text
text = "Natural language processing is a fascinating field to study"
print("Text:", text)

tokens = word_tokenize(text)
print("Tokens:",tokens)

def generate_ngrams(tokens, n):
  n_grams = ngrams(tokens, n)
  return [' '.join(grams) for grams in n_grams]

print("Unigrams:", generate_ngrams(tokens, 1))
print("Biigrams:", generate_ngrams(tokens, 2))
print("Trigrams:", generate_ngrams(tokens, 3))

Text: Natural language processing is a fascinating field to study
Tokens: ['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'to', 'study']
Unigrams: ['Natural', 'language', 'processing', 'is', 'a', 'fascinating', 'field', 'to', 'study']
Biigrams: ['Natural language', 'language processing', 'processing is', 'is a', 'a fascinating', 'fascinating field', 'field to', 'to study']
Trigrams: ['Natural language processing', 'language processing is', 'processing is a', 'is a fascinating', 'a fascinating field', 'fascinating field to', 'field to study']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Training an N-Gram Model

In [None]:
import nltk
from nltk.util import bigrams
from nltk.tokenize import word_tokenize
from collections import defaultdict

nltk.download("punkt")

# Sample corpus
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "I enjoy learning about artificial intelligence.",
    "Machine learning and NLP are closely related",
    "Deep learning is a subset of machine learning."
]

tokens = [word_tokenize(sentence) for sentence in documents]
print("Tokens", tokens)

# Define model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count bigrams
for sentence in tokens:
  bi_grams = list(bigrams(sentence))
  for w1, w2 in bi_grams:
    model[w1][w2] += 1

# Calculate probabilities
for w1 in model:
  total = float(sum(model[w1].values()))
  for w2 in model[w1]:
    model[w1][w2] /= total


def get_bigram_prob(w1, w2):
  return model[w1][w2]

print("Prob:", get_bigram_prob("and", "NLP")) # Add or remove a new char



Tokens [['Natural', 'language', 'processing', 'is', 'fun', '.'], ['Language', 'models', 'are', 'important', 'in', 'NLP', '.'], ['I', 'enjoy', 'learning', 'about', 'artificial', 'intelligence', '.'], ['Machine', 'learning', 'and', 'NLP', 'are', 'closely', 'related'], ['Deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning', '.']]
Prob: 0


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Recurrent Neural Networks (RNN)

This is a subclass of neural networks. They are designed for sequential data. The ability to remember past inputs makes RNNs well-suited for a wide range of tasks that involve time seres data or natural language where order is preserved. They are adept at problems where sequential order of data is important, such as speech recognition, video analysis, financial forecasting.

One of the biggest problems with RNNs is the vanishing gradient problem. Imagine you're playing a game of "telephone" with a long line of friends. In this game, you whisper a message to the first person, they whisper it to the next person, and so on. The goal is to see if the last person hears the same message you started with.

If each person whispers very softly, the message gets quieter and quieter.

By the time it reaches the last person, the message might be so quiet that its almost impossible to hear anything.

This is similar to what happens in Recurrent Neural Networks (RNNs) when they're learning from data

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN
from tensorflow.keras.utils import to_categorical

# Sample text
text = "hello world"

# Character level vocabulary
chars = sorted(set(text))
char_to_idx = { char:idx for idx, char in enumerate(chars) }
idx_to_char = { idx:char for char, idx in char_to_idx.items() }

"""
We want to generate input/output pairs
Eg, give the model "hel" and tell it to generate the next character tokens

To train the model, we need to plot the 3 char inputs to output chars
X = input chars
y = output chars
"""
seq_n = 3 # Expect 3 chars to be input
X = [] # Start empty
y = [] # Start empty

"""
The goal is to generate a map
Example:
h -> ell
e -> llo
l -> lo
"""
for i in range(len(text) - seq_n):
  X.append([char_to_idx[char] for char in text[i:i + seq_n]])
  y.append(char_to_idx[text[i + seq_n]])

X = np.array(X)
y = to_categorical(y, num_classes=len(chars))

# Reshape input
X = X.reshape(X.shape[0], X.shape[1], 1)

# Define the RNN Model
model = Sequential()
model.add(SimpleRNN(50, input_shape=(seq_n, 1)))
model.add(Dense(len(chars), activation="softmax"))

# Compile the model
model.compile(optimizer="adam", loss="categorical_crossentropy")

# Train the model
model.fit(X, y, epochs=10, verbose=1)

def generate_text(start_str, num_generate):
  input_eval = [char_to_idx[s] for s in start_str]
  input_eval = np.array(input_eval).reshape((1, len(input_eval), 1))

  text_gen = []

  for i in range(num_generate):
    preds = model.predict(input_eval)
    pred_id = np.argmax(preds[-1])

    input_eval = np.append(input_eval[:, 1:], [[pred_id]], axis=1)
    text_gen.append(idx_to_char[pred_id])

  return start_str + ''.join(text_gen);

# Generate some text
start_str = "hel"
gen_text = generate_text(start_str, 5)
print("Generated:", gen_text)



Epoch 1/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - loss: 2.3599
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step - loss: 2.2643
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - loss: 2.1769
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step - loss: 2.0977
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step - loss: 2.0263
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - loss: 1.9621
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step - loss: 1.9042
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - loss: 1.8517
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - loss: 1.8038
Epoch 10/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - loss: 1.7596
[1m1/1[0m [32m━━━━

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 2 dimension(s)

### RNN Applications

1. One to one - Image classification
2. One to many - Image captioning. A single image is fed but we output a sequence of characters describing the image
3. Many to one - Sentiment classification
4. Many to many - Machine translation
5. *Many to many - Video translation

Note that the biggest pro of this model is that computation takes into account historical data. However, it can lost information from a long time ago.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_train.shape) # (60k, 28, 28)
# This is a sequence, each row is part of a seq
print(x_train[0].shape) # (28,28)

# Normalize the data
x_train = x_train / 255.0
x_test = x_test / 255.0

model = Sequential()
# Only return True for sequences if going down to a RNN and not a Dense
model.add(LSTM(128, input_shape=(x_train.shape[1:]), activation="relu", return_sequences=True))
model.add(Dropout(0.2)) # 20% dropout

model.add(LSTM(128, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(32, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(10, activation="softmax"))

opt = tf.keras.optimizers.Adam()

model.compile(loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

model.fit(x_train, y_train, epochs=3, validation_data=(x_test, y_test))


(60000, 28, 28)
(28, 28)


  super().__init__(**kwargs)


Epoch 1/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m199s[0m 103ms/step - accuracy: 0.5658 - loss: 1.2381 - val_accuracy: 0.9554 - val_loss: 0.1445
Epoch 2/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 101ms/step - accuracy: 0.9499 - loss: 0.1819 - val_accuracy: 0.9707 - val_loss: 0.1041
Epoch 3/3
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 101ms/step - accuracy: 0.9700 - loss: 0.1148 - val_accuracy: 0.9819 - val_loss: 0.0672


<keras.src.callbacks.history.History at 0x7a45aee73d30>

### RNN using Pytorch

In [None]:
import os
import numpy as np
from unidecode import unidecode
import string

import torch
import torch.nn as nn

## Download data file if not exists
if not os.path.isdir("/content/data"):
  print("Downloading data and unzipping")
  !wget https://download.pytorch.org/tutorial/data.zip
  !unzip *.zip
else:
  print("No need to download data")

# Helpers
def unicode_to_ascii(input: str) -> str:
  return unidecode(str)

def get_country_from_filename(f: str) -> str:
  # Another alterntive is to use regular expression to extract name before .txt
  return f.split(".")[0]

# Prepare
letters = string.ascii_letters + " .,;'"
n_letters = len(letters)
print("All letters:", n_letters)

all_names = []
all_country = []

"""
Pull all the names and country from the file
This is the expected output:

all_names = [n1, n2, n3]
all_country = [c1, c1, c2]

Goal is to get the same size for names and countries (x -> y mapping)
"""
for f in os.listdir("/content/data/names"):
  fl = open("/content/data/names/" + f, "r")
  lines = [line.strip() for line in fl.readlines()]
  cleaned_list = list(map(unidecode, lines))
  n_names_in_file = len(cleaned_list)
  all_names.extend(cleaned_list)

  country_name = get_country_from_filename(f)
  all_country.extend([country_name] * n_names_in_file)

n_rows = len(all_names)
print("Rows to process:", n_rows)
print("All names count:", len(all_names))
print("All countries count", len(all_country))

# Prepare the inputs for processing
# Start with one-hot encoding
emb = torch.eye(n_letters) # Represent each char with a one-hot encoding
print("One-hot encoding:", emb)
print("Shape of embedding:", emb.shape) # 57 unique chars

# Create a country to index map eg. { french: 0, english: 1, ... }
unique_countries = np.unique(all_country)
mapping = dict(zip(np.unique(all_country), range(n_rows)))
print("Country to index map:", mapping)

# Get data from an index, from names and country list
def get_data(idx):
  name = all_names[idx]
  country = all_country[idx]

  char_list_for_name = np.array(list(name)) # Convert name to numpy
  """
  There are 2 sides
  (1)char_list_for_name -> ['a', 'b', 'c']
  char_list_for_name[..., None] -> [['a'], ['b'], ['c']] # Trick to reshape
  (2) np.array(list(letters)) -> ['a', 'b', 'c']

  The np.where performs a comparison where each char on the left is compared to all letters
  [[ True, False, False, ..., False],  # 'a' compared with 'a', 'b', 'c', ...
  [False,  True, False, ..., False],  # 'b' compared with 'a', 'b', 'c', ...
  [False, False,  True, ..., False]]  # 'c' compared with 'a', 'b', 'c', ...

  np.where returns a tuple of row indices and column indices
  [1] corresponds to the column indices
  """
  indices = np.where(char_list_for_name[..., None] == np.array(list(letters)))[1]
  return emb[torch.from_numpy(indices)], torch.tensor(mapping[country]) # name (X) to country (y)


# print(get_data(0)) # [A, b, l] -> [26, 1, 11]

# Define the RNN model
class RNN(nn.Module):
  def __init__(self, n_country, n_letters):
    super(RNN, self).__init__()
    self.rnn = nn.RNN(input_size=n_letters, hidden_size=2 * n_letters) # 57 (letters), hidden is 57 * 2 (whatever)
    self.fc = nn.Linear(2 * n_letters, n_country)

  def forward(self, x):
    out, _ = self.rnn(x)
    out1 = self.fc(out[-1, :])
    return out1

model = RNN(len(np.unique(all_country)), n_letters)
print(model)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Define optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Define epochs
n_epochs = 10
all_losses = []
for epoch in range(n_epochs):
  arr = np.arange(n_rows)
  np.random.shuffle(arr)

  epoch_loss = 0
  for idx in arr:
    x, y = get_data(idx)
    pred = model(x)
    loss = loss_fn(pred, y)
    epoch_loss += loss.detach().numpy()

    print("Epoch", epoch)
    print("Prediction:", pred)
    print("Actual:", y)
    print("Loss:", loss.item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

  all_losses.append(epoch_loss)
  print("Epoch:", epoch, "Loss:", epoch_loss)


No need to download data
All letters: 57
Rows to process: 20074
All names count: 20074
All countries count 20074
One-hot encoding: tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 1., 0.],
        [0., 0., 0.,  ..., 0., 0., 1.]])
Shape of embedding: torch.Size([57, 57])
Country to index map: {'Arabic': 0, 'Chinese': 1, 'Czech': 2, 'Dutch': 3, 'English': 4, 'French': 5, 'German': 6, 'Greek': 7, 'Irish': 8, 'Italian': 9, 'Japanese': 10, 'Korean': 11, 'Polish': 12, 'Portuguese': 13, 'Russian': 14, 'Scottish': 15, 'Spanish': 16, 'Vietnamese': 17}
RNN(
  (rnn): RNN(57, 114)
  (fc): Linear(in_features=114, out_features=18, bias=True)
)
Epoch: 0 Loss: 22554.54584160888


KeyboardInterrupt: 

# Syntax and Parsing

This section focuses on the structure and organization of sentences. With these techniques, we can decipher gramatical structures of sentences, which is essential for enabling machines to interpret and generate human language accurately.

## Parts of Speech Tagging (POS)

This is the process of assigning grammatical categories such as nouns, verbs, adjectives, and adverbs to each word in a sentence.

In English, here are the categories:

1. Noun (NN)
2. Verb (VB)
3. Adjective (JJ)
4. Adverb (RB)
5. Pronoun (PRP)
6. Preposition (IN)

The right way to evaluate this is with "accuracy". However, the performance of these taggers can vary significantly depending on several factors:

1. Text domain - A tagger trained on news articles may not perform well on social media text because of the language, style, of vocabulary used.
2. Language - Some languages may have better taggers compared to others
3. Ambiguity - The word "run" can be both a nounch and a verg
4. Quality of training data

In [None]:
import nltk
from nltk import word_tokenize, pos_tag

nltk.download("punkt")
nltk.download('averaged_perceptron_tagger')

text = "Natural Language Procesing with Python is very fun."

tokens = word_tokenize(text)
print("Tokens:", tokens)

pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Tokens: ['Natural', 'Language', 'Procesing', 'with', 'Python', 'is', 'very', 'fun', '.']
POS Tags: [('Natural', 'JJ'), ('Language', 'NNP'), ('Procesing', 'VBG'), ('with', 'IN'), ('Python', 'NNP'), ('is', 'VBZ'), ('very', 'RB'), ('fun', 'JJ'), ('.', '.')]


## Named Entity Recognition (NER)

This is a subtask of information extraction that aims to identify and classify named entities mentioned within unstructured text.

Common categories extraced include:

1. Person (PER)
2. Organization (ORG)
3. Location (LOC)
4. Miscellanous (MISC)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple is looking to buy a U.K. startup at 2 million dollars"
doc = nlp(text)

print("Named entities:")
for ent in doc.ents:
  print(ent.text, ent.label_)

Named entities:
Apple ORG
U.K. GPE
2 million dollars MONEY


In [None]:
# Training a custom NER

import spacy
from spacy.tokens import DocBin
from spacy.training import Example
from spacy.util import minibatch, compounding

# Create a blank English model
nlp = spacy.blank("en")

# Create an NER component and add it to the pipeline
ner = nlp.add_pipe("ner")
ner.add_label("GADGET")

# Sample training data
TRAIN_DATA = [
  ("Apple is releasing a new iPhone", {"entities": [(25, 32, "GADGET")]}),
  ("The new iPad Pro is amazing.", {"entities": [(8, 16, "GADGET")]}),
]

# Transform training format to Spacy format
doc_bin = DocBin()
examples = []
for text, annot in TRAIN_DATA:
  doc = nlp.make_doc(text)
  example = Example.from_dict(doc, annot)
  examples.append(example)

# Train the NER model
optimizer = nlp.begin_training()
for i in range(10):
  losses = {}
  batches = minibatch(examples, size=compounding(4.0, 32.0, 1.001))
  for batch in batches:
    nlp.update(batch, sgd=optimizer, losses=losses)
  print("Losses:", losses)

# Test the trained model
doc = nlp("I just bought a new iPhone and an iPad Pro")
print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])

Losses: {'ner': 10.333333730697632}
Losses: {'ner': 9.872534215450287}
Losses: {'ner': 8.998019397258759}
Losses: {'ner': 7.608083248138428}
Losses: {'ner': 5.530070662498474}
Losses: {'ner': 3.4145753979682922}
Losses: {'ner': 1.820153184235096}
Losses: {'ner': 0.9663957273587584}
Losses: {'ner': 6.857765997759998}
Losses: {'ner': 2.010009991150582}
Entities: []


# Sentiment Analysis

Also called opinion mining. By understanding the sentiment behind text, businesses and organizations can gain valuable insights into public opinion, customer satisfaction an overall sentiment trends.

There are 3 main approaches:

1. Rule based
2. Machine Learning
3. Deep learning

## Rule based Sentiment Analysis

This approach relies on manually crafted rules. The steps involved are:

1. Tokenization - Split text into words or tokens
2. Normalization - Convert to standard form such as lowercase
3. **Lexicon lookup** - A sentiment lexicon is used to assign sentiment score to the tokens. Popular sentiment lexicons include afinn, sentiWordNet and NRC Emotion.
4. Rule application

In [None]:
from textblob import TextBlob

text = "I love this product! It's amazing."
print("Text:", text)

# Perform sentiment analysis
blob = TextBlob(text)
print("Sentiment:", blob.sentiment)

"""
Polarity scores ranges from -1(very negative) to 1(very positive)
Subjectivity scores ranges from 0(objective) to 1(subjective
"""

Text: I love this product! It's amazing.
Sentiment: Sentiment(polarity=0.6125, subjectivity=0.75)


## Machine Learning Approach

This approach involves training models to automatically learn patterns from labeled data. Unlike rule-based approach, which rely on predefined linguistic rules and often struggle with nuanced language, machine learning methods can capture more complex patterns and relationships in data.

The analysis typically follow the following steps:

1. Data collection - Gathering diverse labeled data
2. Data Preprocessing - Tokenizatin, normalization and vectorization
3. Feature extraction - Techniques such as TF-IDF, word embeddings or more advanced methods like BERT are used to capture the semantic meaning and context of the text.
4. Model Training - Algorithms such as Naive Bayes, Support Vector Machines (SVM) and advanced deep learning techniques such as Convolution Neural Network (CNN) and Recurrent Neural Network (RNN)
5. Model Evaluation - accuracy, precision, recall or F1 score
6. Prediction

Advantages in this approach:

1. Better Performance compared to rule-based approach
2. Scalable - More suitable on real-world applications and can be trained on large datasets.
3. Flexibility - Can be adapted to different domains and languages. Single model can be fine-tuned to different domains

However, the limitations of this include:

1. Data dependency - Requires large amount of data for training. Without sufficient data, the performance is degraded.
2. Complexity - Involves extensive experimentation and parameter optimization, which can be time and resource intensive.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
corpus = [
  "I love this product!",
  "This is the worst service I have ever experienced",
  "I am very happy with my purchase",
  "I am disappointed with the quality of this item"
]
labels=[1, 0, 1, 0] # Each matching a single sentence in the corpus

# Transform text to vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Split the data
x_train, x_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train the model
model = LogisticRegression(class_weight="balanced")
model.fit(x_train, y_train)

# Make predictions
y_pred = model.predict(x_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.0
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Deep Learning Approach

This approach can handle long-range dependencies, large vocabularies and learn hierarchical representations of text, making them particularly powerful for sentiment analysis. LSTM is particularly adept at maintaining context over longer text sequences.

Popular architectures include:

1. Convolutional Neural Networks (CNNs)
2. Recurrent Neural Networs (RNNs)
3. Long Short-Term Memory Networks (LSTM)
4. Transformer-Based - E.g., BERT

In [None]:
# LSTMs approach

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Sample text corpus
corpus = [
  "I love this product!",
  "This is the worst service I have ever experienced",
  "I am very happy with my purchase",
  "I am disappointed with the quality of this item"
]
labels=[1, 0, 1, 0] # Each matching a single sentence in the corpus

# Tokenize and pad the text data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)
padded_sequences = pad_sequences(sequences, maxlen=100) # Assume the longest sentence is 10 words long

# Split the data
x_train, x_test, y_train, y_test = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)

# Convert lists to NumPy arrays
x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

# Define the LSTM
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=100))
model.add(LSTM(100))
model.add(Dense(1, activation="sigmoid"))

# Compile the model
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model
model.fit(x_train, y_train, epochs=5, verbose=1, validation_data=(x_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print("Loss:", loss)
print("Accuracy:", accuracy)

# Predict the sentiment of new text
new_text = [ "This product is excellent and I love it." ]
new_text_seq = tokenizer.texts_to_sequences(new_text)
new_text_padded = pad_sequences(new_text_seq, maxlen=10)
prediction = model.predict(new_text_padded)
print("Prediction:", "Positive" if prediction[0][0] > 0.5 else "Negative")


Epoch 1/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.6667 - loss: 0.6888 - val_accuracy: 0.0000e+00 - val_loss: 0.7048
Epoch 2/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step - accuracy: 0.6667 - loss: 0.6726 - val_accuracy: 0.0000e+00 - val_loss: 0.7217
Epoch 3/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 304ms/step - accuracy: 0.6667 - loss: 0.6560 - val_accuracy: 0.0000e+00 - val_loss: 0.7431
Epoch 4/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 292ms/step - accuracy: 0.6667 - loss: 0.6376 - val_accuracy: 0.0000e+00 - val_loss: 0.7726
Epoch 5/5
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 142ms/step - accuracy: 0.6667 - loss: 0.6165 - val_accuracy: 0.0000e+00 - val_loss: 0.8164
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step - accuracy: 0.0000e+00 - loss: 0.8164
Loss: 0.8163905143737793
Accuracy: 0.0
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[

In [None]:
# Using BERT
import numpy as np
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split

# Sample text corpus
corpus = [
  "I love this product!",
  "This is the worst service I have ever experienced",
  "I am very happy with my purchase",
  "I am disappointed with the quality of this item"
]
labels=[1, 0, 1, 0] # Each matching a single sentence in the corpus

# Tokenize and pad the text data
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
X = tokenizer(corpus, padding=True, truncation=True, return_tensors="tf", max_length=10)

# Convert the TensorFlow tensor to a NumPy array before splitting
input_ids = X["input_ids"].numpy() # Convert to NumPy array

# Split the data
x_train, x_test, y_train, y_test = train_test_split(input_ids, labels, test_size=0.2, random_state=42)

# Define the BERT model
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Compile the model
model.compile(optimizer="adam",
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"])

# Train the model
model.fit(x_train, np.array(y_train), epochs=3, batch_size=5, validation_data=(x_test, np.array(y_test)))

# Evaluate the model
loss, accuracy = model.evaluate(x_test, np.array(y_test))
print("Loss:", loss)
print("Accuracy:", accuracy)

# Predict the sentiment of new text
new_text = [ "This product is excellent and I love it." ]
new_text_enc = tokenizer(new_text, padding=True, truncation=True, return_tensors="tf", max_length=10)
prediction = model.predict(new_text_enc["input_ids"])
print("Logits:", np.argmax(prediction.logits))
print("Prediction:", "Positive" if np.argmax(prediction.logits) == 1 else "Negative")

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Epoch 2/3
Epoch 3/3
Loss: 2.815908908843994
Accuracy: 0.0
Logits: 1
Prediction: Positive


# Topic Modeling

A technique to automatically identify the underlying topics present in a collection of documents. This helps in organizing, understanding, and summarizing large datasets by discovering the hidden thematic structure within the text.

Various techniques are used, including:

1. Latent Semantic Analysis (LSA)
2. Latent Dirichlet Allocation (LDA)
3. Hierarchical Dirichlet Process (HDP)

## Latent Semantic Analysis

This is a foundational technique in the field of topic modeling and information retrieval that has been extensively studied and applied in various domains.

It works by reducing the dimensionality of the text data, which involves transforming original term-document matrix into a lower-dimensional space. This transformation is achieved through a mathematical process known as Singular Value Decomposition (SVD), which decomposes the matrix into several component matrices.

The steps involved are:

1. Create a term-document matrix - Represent the text data as a matrix, each row corresponds to a term (word), each column corresponds to a document, each entry represents frequency of the term in respective document.
2. Apply SVD (Singular Value Decomposition)
3. Reduce dimensionality
4. Interpret topics

Advantages:

1. Dimensionality Reduction
2. Captures Synonym
3. Noise Reduction
4. Enhanced Information Retrieval

Disadvantages:

1. Linear Assumption -
2. Interpretability
3. Computationally Intensive
4. Limited Context Understanding


In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample text corpus
corpus = [
  "The cat sat on the mat",
  "The dog is chasing the cat",
  "The mat is on the table",
  "The cat is sitting on the mat"
  "The cat is chasing the dog"
]

# Create a TD-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Apply SVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)

# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
  terms_comp = zip(terms, comp)
  sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:10]
  print("Topic", i)
  for term, weight in sorted_terms:
    print(f"- {term}: {weight:.4f}")

Topic 0
- the: 0.6749
- is: 0.3495
- cat: 0.3495
- on: 0.2639
- mat: 0.2411
- chasing: 0.2263
- dog: 0.2263
- sat: 0.1529
- table: 0.1529
- matthe: 0.1076
Topic 1
- mat: 0.5070
- table: 0.3215
- sat: 0.3215
- on: 0.3088
- the: 0.0778
- matthe: -0.1593
- sitting: -0.1593
- is: -0.2085
- cat: -0.2085
- dog: -0.3854


## Latent Dirichlet Allocation (LDA)

Unlike LSA, which relies on mathematical foundations rooted in linear algebra, LDA is a generative probabilistic model that aims to uncover the hidden thematic strucure in a collection of documents by assuming a statistical framework.

The core assumption is that documents are a mixture of various topics, and each topic, itself is a mixture of words with certain probabilities. By employing LDA, researchers and practitioners can discover the underlying topics that best explain the observed documents, which helps in understanding the thematic composition of large text corpora.

Can be used in the following areas:

1. Document classification
2. Recommendation systems
3. Gaining insights from massive datasets in fields like social sciences and digital humanities.

LDA offers significant advantages in terms of its probabilistic foundation, flexibility and interpretability, making it a powerful tool for topic modeling. However, it also faces limitations related to scalability, hyperparameter tuning, and the validity of its underlying assumptions.

In [None]:
!pip install pprintpp

import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprintpp import pprint

# Sample text corpus
corpus = [
  "The cat sat on the mat",
  "The dog is chasing the cat",
  "The mat is on the table",
  "The cat is sitting on the mat"
  "The cat is chasing the dog"
]

# Tokenize the text and remove stop words
texts = [[word for word in doc.lower().split()] for doc in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Create a bag-of-words representation of the documents
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)

# Print the topics and their associated words
pprint(lda_model.print_topics(num_words=5))

# Assign topics to a new document
new_doc = "The cat is chasing the dog"
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
print(lda_model.get_document_topics(new_doc_bow))

Collecting pprintpp
  Downloading pprintpp-0.4.0-py2.py3-none-any.whl.metadata (7.9 kB)
Downloading pprintpp-0.4.0-py2.py3-none-any.whl (16 kB)
Installing collected packages: pprintpp
Successfully installed pprintpp-0.4.0
[
    (
        0,
        '0.269*"the" + 0.127*"is" + 0.127*"cat" + 0.099*"on" + 0.070*"chasing"',
    ),
    (
        1,
        '0.094*"the" + 0.093*"mat" + 0.092*"sat" + 0.092*"on" + 0.091*"table"',
    ),
]
[(0, 0.92085874), (1, 0.07914128)]


## Hierarchical Dirichlet Process (HDP)

This is an extension of LDA. It removes the necessity of specifying the number of topics in advance. Instead, HDP automatically determines the appropriate number of topics based on the data it analyzes.

In [None]:
!pip install pprintpp

import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprintpp import pprint

# Sample text corpus
corpus = [
  "The cat sat on the mat",
  "The dog is chasing the cat",
  "The mat is on the table",
  "The cat is sitting on the mat"
  "The cat is chasing the dog"
]

# Tokenize the text and remove stop words
texts = [[word for word in doc.lower().split()] for doc in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Create a bag-of-words representation of the documents
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the LDA model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)

# Print the topics and their associated words
pprint(hdp_model.print_topics(num_topics=4, num_words=5))

# Assign topics to a new document
new_doc = "The cat is chasing the dog"
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
print(hdp_model[new_doc_bow])

[
    (0, '0.238*sitting + 0.211*the + 0.155*mat + 0.154*on + 0.090*dog'),
    (1, '0.255*sat + 0.214*the + 0.190*is + 0.073*cat + 0.061*dog'),
    (2, '0.306*the + 0.199*sat + 0.166*matthe + 0.131*on + 0.064*dog'),
    (3, '0.410*sitting + 0.179*mat + 0.152*the + 0.127*table + 0.039*on'),
]
[(0, 0.05881033096301249), (1, 0.8810314306848641), (2, 0.021141745942392966), (3, 0.013692257035095827)]


# Text Summarization

The primary object is to retain the most important information and key points while significantly reducing the amont of text that needs to be read.

Can be broadly classified into 2 categories:

1. Extractive summarization
2. Abstractive summarization

## Extractive Summarization

This involves selecting the most important sentences from the original text and combining them to form a summary. The steps involved are:

1. Preprocessing - Tokenization, stop word removeal, normalization
2. Sentence scoring - Term frequence, sentence position and similarity to title
3. Sentence selection
4. Summary generation

This method is straightforward but comes with limitations. The biggest is that the resulting summary may lack coherence and fluency. Additionally, it doesn't generate new sentences or paraphrase content.

In addition to the simple term frequency method, there are more advanced techniques for extractive summarization, including:

1. TextRank - A graph-based ranking algorithm that uses sentence similarity to rank sentences
2. Latent Semantic Analysis (LSA) - An unsupervised learning technique that captures the latent structure of the text and identifies key sentences.
3. Supervised learning - Using labeled data to train a machine learning model to score and select sentences for summarization.

## Abstractive Summarization

This is a more advanced and sophisticated technique. It involves generating new sentences that effectively convey the meaning of the original text. It goes beyond simply selecting key sentences.

This techniques involve 2 main components that work together to transform lengthy input text into a concise summary.

1. Encoder - Convert the input text into a fixed-size context vector.
2. Decoder - Generate the summary

Various models such as RNN, LSTM and Transformer-based models are among the most commonly used architectures.

For advanved scenarios, varios techniques can be used, such as:

1. BERT (Bidirectional Encoder Representations from Transformers)
2. GPT (Generative Pre-trained Transformer)
3. BART (Bidirectional and Auto-Regressive Transformers)

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Sample text to summarize
text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.
It is a field focused on enabling computers to understand, interpret, and generate human language.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and machine translation."""

# Tokenize and encode the text
inputs = tokenizer.encode(f"summarize: {text}", return_tensors="pt", max_length=1024, truncation=True)

# Generate the summary
summary_ids = model.generate(inputs, max_length=150, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(f"Summary: {summary}")



Summary: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence. It is a field focused on enabling computers to understand, interpret, and generate human language.Challenges in natural language processing frequently involve speech recognition and machine translation.


# Machine Translation

