<a href="https://colab.research.google.com/github/chenwh0/Natural-Language-Processing-work/blob/main/module2/TextProcessingEncodingTechniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Processing with Encoding Techniques**
This lab explores three fundamental encoding techniques for NLP: one-hot encoding, bag-of-words (BoW), and TF-IDF. You'll implement these methods on text data and analyze their differences.
# *Sources used*
* https://github.com/opengeos/geospatial-data-catalogs
* https://www.geeksforgeeks.org/pandas/pandas-access-columns/
* https://www.geeksforgeeks.org/python/difference-between-map-applymap-and-apply-methods-in-pandas/
* https://www.geeksforgeeks.org/python/how-to-compare-two-dataframes-with-pandas-compare/
* https://www.geeksforgeeks.org/pandas/python-pandas-dataframe-sum/
* https://www.geeksforgeeks.org/pandas/python-pandas-series-nlargest/

# *Installs & Imports*

In [None]:
!pip install nltk -q

In [None]:
from typing import List
# Data preprocessing Libraries
import pandas as pd
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords # To remove stopwords in NLTK
nltk_stopwords = set(stopwords.words("english")) # Load English stopwords once for efficiency

# Libraries
from sklearn.feature_extraction.text import CountVectorizer # To create one-hot encoding & bag of words
from sklearn.feature_extraction.text import TfidfVectorizer # To create TF-IDF

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# **1. Data Preparation**
I chose this dataset because I wanted to learn more about how to do natural language processing on geospatial-related data. The dataset's source also contained concise instructions on how to retrieve the dataset.

In [None]:
# Select text dataset
url = 'https://github.com/opengeos/geospatial-data-catalogs/raw/master/nasa_cmr_catalog.tsv'
dataframe = pd.read_csv(url, sep='\t')
title_description_dataframe = dataframe[["title", "description"]]
title_description_dataframe.head()

# Preprocess text by removing punctuation, extra whitespace, and stopwords.
def preprocess_text(text: str) -> str:
    text = text.lower() # Lowercase all text.
    text = re.sub(r"[^\w\s]", "", text) # Remove punctuation
    text = re.sub(r"\s+", " ", text) # remove extra whitespace
    text_no_stopwords = [token for token in text.split(" ") if token not in nltk_stopwords] # Store only the non-stopwords
    text_no_stopwords = " ".join(text_no_stopwords)
    return text_no_stopwords

preprocessed_dataframe = title_description_dataframe.copy() # Make a copy of original dataframe
preprocessed_dataframe["description"] = title_description_dataframe["description"].map(preprocess_text) # Preprocess the copy's data

# Tokenization & stopwords
print("Original:", title_description_dataframe["description"][0])
print("Preprocessed:", preprocessed_dataframe["description"][0])

Original: Indian Remote Sensing satellites (IRS) are a series of Earth Observation satellites, built, launched and maintained by Indian Space Research Organisation. The IRS series provides many remote sensing services to India and international ground stations. With 5 m resolution and products covering areas up to 70 km x 70 km IRS LISS-IV mono data provide a cost effective solution for mapping tasks up to 1:25'000 scale.
Preprocessed: indian remote sensing satellites irs series earth observation satellites built launched maintained indian space research organisation irs series provides many remote sensing services india international ground stations 5 resolution products covering areas 70 km x 70 km irs lissiv mono data provide cost effective solution mapping tasks 125000 scale


# **2. Implement Encoding Techniques**

*TF vs. IDF components*:

>**Formula for Term Frequency (TF)**
$$
\mathrm{TF}(t, d) = \frac{\text{# of occurrences of term } t \text{ in document } d}{\text{total # of words in document } d}
$$

>**Formula for Inverse Document Frequency (IDF)**
$$
\mathrm{IDF}(t) = \log\left(\frac{1 + N}{1 + \mathrm{df}(t)}\right) + 1
$$
Where *N* = total number of documents. df(*t*) = # of documents containing term *t*.

> **Formula for Term Frequency-Inverse Document Frequency (TF-IDF)**
$$
\mathrm{TF\text{-}IDF}(t, d) = \mathrm{TF}(t, d) \times \mathrm{IDF}(t)
$$
Used weighting scheme to highlight important (rare) words in documents. Reduces impact of common words. Widely used for information retrieval, search engines, and classification.



In [None]:
# Extract unique words to form vocabulary for each text
descriptions = preprocessed_dataframe["description"].tolist()[:10] # Get list of all description row values
vectorizer_vocab = CountVectorizer()
vectorizer_vocab.fit(descriptions)
vocab = vectorizer_vocab.get_feature_names_out()

# One-hot encoding
vectorizer = CountVectorizer(vocabulary=vocab, binary=True)
onehot = vectorizer.transform(descriptions)
onehot_dataframe = pd.DataFrame(onehot.toarray(),
                      columns=vocab) # Display frequency of each word in the given text through Pandas DataFrame
print("one-hot encoding dataframe:")
display(onehot_dataframe.head())

# Bag of Words
vectorizer = CountVectorizer()
vectorizer.fit(descriptions) # Fit vectorizer to list of text to build vocabulary
bag_of_words = vectorizer.transform(descriptions) # Transform list of text into a bag-of-words matrix
bow_dataframe = pd.DataFrame(bag_of_words.toarray(),
                      columns=vocab) # Display frequency of each word in the given text through Pandas DataFrame
print("\n\nBag of words dataframe:")
display(bow_dataframe.head())

# TF-IDF
vectorizer = TfidfVectorizer()
vectorizer.fit(descriptions) # Fit vectorizer to list of text to build vocabulary
tfidf_matrix = vectorizer.transform(descriptions) # Transform the documents into a TF-IDF-weighted term-document matrix
tfidf_dataframe = pd.DataFrame(tfidf_matrix.toarray(),
                      columns=vocab) # Display frequency of each word in the given text through Pandas DataFrame
print("\n\nTF-IDF matrix dataframe:")
display(tfidf_dataframe.head())

one-hot encoding dataframe:


Unnamed: 0,000017966259,00344257,025,101109tgrs20172734070,11,113,125000,190pp207216,19781101,19822016,...,windsat,within,work,wã¼rzler,year,yearly,yy,yyyymmdd,zircon,âimplications
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0




Bag of words dataframe:


Unnamed: 0,000017966259,00344257,025,101109tgrs20172734070,11,113,125000,190pp207216,19781101,19822016,...,windsat,within,work,wã¼rzler,year,yearly,yy,yyyymmdd,zircon,âimplications
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0




TF-IDF matrix dataframe:


Unnamed: 0,000017966259,00344257,025,101109tgrs20172734070,11,113,125000,190pp207216,19781101,19822016,...,windsat,within,work,wã¼rzler,year,yearly,yy,yyyymmdd,zircon,âimplications
0,0.0,0.0,0.0,0.0,0.0,0.0,0.133695,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.049577,...,0.0,0.042145,0.0,0.049577,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.056143,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.073856,0.0,0.0,...,0.0,0.062785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.073856
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Compare one-hot encoding with bag of words encoding
print("Onehot = self. BoW = other. NaN = same values")
display(onehot_dataframe.compare(bow_dataframe))

Onehot = self. BoW = other. NaN = same values


Unnamed: 0_level_0,2017,2017,2018,2018,70,70,absorption,absorption,aerosol,aerosol,...,transantarctic,transantarctic,upb,upb,used,used,wagner,wagner,zircon,zircon
Unnamed: 0_level_1,self,other,self,other,self,other,self,other,self,other,...,self,other,self,other,self,other,self,other,self,other
0,,,,,1.0,2.0,,,,,...,,,,,,,,,,
1,,,1.0,2.0,,,,,,,...,,,,,,,,,,
2,,,,,,,1.0,6.0,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,1.0,2.0,,,,
4,,,,,,,,,1.0,3.0,...,,,,,,,,,,
5,1.0,3.0,,,,,,,,,...,,,,,,,1.0,3.0,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,1.0,3.0,,,...,,,,,,,,,,
8,,,,,,,,,,,...,1.0,2.0,1.0,3.0,,,,,1.0,4.0
9,,,,,,,,,,,...,1.0,2.0,1.0,3.0,,,,,1.0,4.0


# **3. Analysis and Visualization**

Let's compare the top 5 features (words) from each method.


In [None]:
# five largest values in first 10 rows word columns across 3 dataframes
def display_top_n_features(dataframe, n=5):
    column_counts = dataframe.sum(axis=0) # sum the totals in each column (rows with more 1s/larger values will have larger sums)
    top_5_columns = column_counts.nlargest(n).index # Get indicies of top 5 columns with max sums
    display(dataframe[top_5_columns])
print("TOP 5 WORDS FROM EACH METHOD\n")
print("One hot encoding:")
display_top_n_features(onehot_dataframe)

print("\n\nBag of words:")
display_top_n_features(bow_dataframe)

print("\n\nTF-IDF:")
display_top_n_features(tfidf_dataframe)

TOP 5 WORDS FROM EACH METHOD

One hot encoding:


Unnamed: 0,data,also,esa,products,project
0,1,0,0,1,0
1,1,1,1,1,1
2,1,1,1,1,1
3,1,0,1,0,1
4,1,0,1,1,1
5,1,1,1,1,1
6,1,1,1,1,1
7,1,1,1,1,1
8,0,1,0,0,0
9,0,1,0,0,0




Bag of words:


Unnamed: 0,data,dataset,samples,also,cloud
0,1,0,0,0,0
1,5,9,0,2,13
2,5,4,0,2,0
3,4,0,0,0,0
4,1,1,0,0,0
5,3,2,0,2,0
6,3,4,0,2,0
7,4,2,0,3,0
8,0,0,7,1,0
9,0,0,7,1,0




TF-IDF:


Unnamed: 0,data,samples,dataset,cloud,products
0,0.059349,0.0,0.0,0.0,0.065171
1,0.110039,0.0,0.239528,0.644499,0.024167
2,0.146586,0.0,0.141815,0.0,0.032193
3,0.131143,0.0,0.0,0.0,0.0
4,0.073153,0.0,0.088465,0.0,0.240988
5,0.071781,0.0,0.057871,0.0,0.052549
6,0.157849,0.0,0.254519,0.0,0.115556
7,0.170128,0.0,0.102869,0.0,0.093409
8,0.0,0.445671,0.0,0.0,0.0
9,0.0,0.445671,0.0,0.0,0.0


# **4. Technical Reflection**
## Comparing encoding techniques: one-hot encoding, bag-of-words matrix, and TF-IDF
**one-hot encoding** reveals that the top 5 "important" words were: "data", "also", "esa", "products", "project". These words shows up the most consistently across the 10 documents (descriptions).

**Bag of Words (BoW) matrix** reveals that the top 5 "important" words were: "data", "dataset", "samples", "also", "cloud". These are the highest reoccurring words in the 10 documents. For words such as "sample" and "cloud" they did not show up very consistently across all 10 documents but there were enough reoccurrences in a mere few documents (descriptions) for them to become one of the top 5 words in BoW.

**TF-IDF matrix** reveals that the top 5 "important" words were: "data", "samples", "dataset", "cloud", "products". These were words that either were frequently showing up or words that rarely showed up across all documents but when they did show up, they showed up a lot.

## BoW vs TF-IDF
* BoW scores were high for stopwords that occur frequently but weren't important.
* TF-IDF resolved this by assigning lower scores to words that occurred frequently across ALL documents (descriptions). This allows words like "also" to not have a higher score than say, "cloud".

## Limitations of sparse representations
Sparse representations can't capture the true meaning behind words (only their occurrences and frequencies). Thus, the usages of these encodings are severely limited compared to under text representations.