<img src='https://www.di.uniroma1.it/sites/all/themes/sapienza_bootstrap/logo.png' width="200"/>  

# Part_1_11_Vector Semantics (Sparse)

In Natural Language Processing (`NLP`), vector semantics provides a powerful framework for representing words and documents in a numerical form, enabling efficient computation and semantic analysis. Sparse vector representations, such as **Bag of Words (`BoW`)**, **TF-IDF**, and **Pointwise Mutual Information (`PPMI`)**, have been foundational in the evolution of `NLP`. These approaches rely on statistical co-occurrence patterns and word frequency to capture linguistic meaning, forming the basis for more sophisticated methods like dense embeddings and contextualized models.

Sparse representations are particularly useful in understanding the core principles of vector semantics and building intuition about the role of word-document relationships in tasks like text classification, clustering, and retrieval systems.

### **Objectives:**
In this notebook, Parham provides an overview of sparse vector semantics, including the key methods used to represent text data and their significance in `NLP`. Through practical exercises, Parham will demonstrate the implementation of **Bag of Words (`BoW`)** for document representation, **TF-IDF** to highlight significant terms within documents, and **PPMI** to extract meaningful statistical relationships from co-occurrence matrices.

### **References:**
- [https://www.datacamp.com/tutorial/python-bag-of-words-model](https://www.datacamp.com/tutorial/python-bag-of-words-model)  
- [https://spotintelligence.com/2022/12/20/bag-of-words-python](https://spotintelligence.com/2022/12/20/bag-of-words-python/)  
- [https://stackoverflow.com/questions/58701337/how-to-construct-ppmi-matrix-from-a-text-corpus](https://stackoverflow.com/questions/58701337/how-to-construct-ppmi-matrix-from-a-text-corpus)   

### **Tutors**:
- Professor Stefano Farali
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: Stefano.faralli@uniroma1.it
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/stefano-faralli-b1183920/) 
- Professor Iacopo Masi
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: masi@di.uniroma1.it  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/iacopomasi/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/iacopomasi)  
    

### **Contributors:**
- Parham Membari  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: p.membari96@gmail.com  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/p-mem/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/parham075)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Medium_logo_Monogram.svg" alt="Logo" width="20" height="20"> **Medium**: [Medium](https://medium.com/@p.membari96)  

**Table of Contents:**
1. Import Libraries  
2. Introduction to Vector Semantics  
3. Bag of Words (`BoW`) Representation      
4. Term Frequency-Inverse Document Frequency (`TF-IDF`)
5. Pointwise Mutual Information (`PMI`)   
6. Closing Thoughts


## 1. Import Libraries 

In [15]:
import os
import requests
import tarfile
import zipfile
import pandas as pd
import nltk
import numpy as np
import spacy
import torch
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tag import pos_tag
from loguru import logger
from tqdm import tqdm
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from pprint import pprint
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/p/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/p/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Introduction to Vector Semantics
Vector semantics is a methodology in `NLP` that uses numerical representations to encode the meaning of words, phrases, or entire documents. These numerical representations are essential for enabling machines to process and analyze human language. By transforming textual data into vectors, computational models can perform mathematical operations to assess similarity, context, and relationships between words or documents.

### Why Vector Semantics?
- **Quantitative Representation**: Text data, being inherently qualitative, is challenging for computers to process directly. Vector semantics bridges this gap by converting text into a mathematical form.
- **Efficient Computation**: Mathematical representations allow for quick calculations of similarity, clustering, and classification tasks.
- **Foundation for Machine Learning Models**: Many machine learning models rely on vectorized representations of data as input.


## 3. Bag of Words (`BoW`) Representation 

**Bag of Words** (`BoW`) is a simple and widely-used representation of text data in Natural Language Processing. It represents text as a collection of words, ignoring grammar, word order, and context, while preserving the frequency of words.BoW boadly used in tasks such as text classification and sentiment analysis. This is important because machine learning algorithms can’t process textual data. The process of converting the text to numbers is known as feature extraction or feature encoding.

### 3.1 Understanding Bag of Words with example:
Imagine two sentences:
1. Document 1: "Natural Language Processing is amazing."
2. Document 2: "Language models are important for NLP."


The `BoW` model begins by creating a vocabulary, a unique list of all words across the corpus. Each document is then represented as a vector of word frequencies. Table below, represents the Bag of Words vectors:

| **Vocabulary** | **Document 1** | **Document 2** |
|-----------------|----------------|----------------|
| Natural         | 1              | 0              |
| Language        | 1              | 1              |
| Processing      | 1              | 0              |
| is              | 1              | 0              |
| amazing         | 1              | 0              |
| models          | 0              | 1              |
| are             | 0              | 1              |
| important       | 0              | 1              |
| for             | 0              | 1              |
| NLP             | 0              | 1              |

Each position in the vector corresponds to a word in the vocabulary, and the value represents its frequency in the document.

### How to implement `BoW`
The steps involved to create `BoW` are:
- Tokenization: Split the text into individual words or tokens.
- Preprocessing:
    - Convert text to lowercase.
    - Remove special characters, punctuation, and numbers.
    - Remove stopwords (e.g., "the", "is", "and").
- Apply stemming or lemmatization to normalize words.
- Vocabulary Creation: Build a unique list of words from the corpus.
- Vectorization: Represent each document as a vector of word frequencies based on the vocabulary.

In [21]:

paragraph = """
In a world where language became the ultimate bridge, 
NLP emerged as humanity's greatest ally, 
decoding ancient texts, translating cultures, and even giving voice to the voiceless. 

As algorithms learned to empathize, crafting poetry, 
solving conflicts, and teaching forgotten tongues, 
they blurred the line between human creativity and machine precision.

The future no longer feared miscommunication; instead, it thrived on understanding, 
with NLP not just interpreting words but shaping a world where every voice, 
in any language, could be heard, understood, and celebrated.
"""
sentences = nltk.sent_tokenize(paragraph)

# Step 2: Preprocessing each sentence
ps = PorterStemmer()
corpus = []

for sentence in sentences:
    # Remove special characters, numbers, and punctuations
    review = re.sub('[^a-zA-Z]', ' ', sentence)
    # Convert to lowercase
    review = review.lower()
    # Split into words
    review = review.split()
    # Remove stopwords and apply stemming
    review = [ps.stem(word) for word in review if word not in set(stopwords.words('english'))]
    # Join the words back into a sentence
    review = ' '.join(review)
    # Add the cleaned sentence to the corpus
    corpus.append(review)

# Step 3: Create Bag of Words model
cv = CountVectorizer(max_features=1500)  # Limit to top 1500 features (if necessary)
X = cv.fit_transform(corpus).toarray()

# Get feature names (vocabulary)
vocabulary = cv.get_feature_names_out()

# Create DataFrame with each sentence as a column
bow_df = pd.DataFrame(X.T, index=vocabulary, columns=[f"Document {i+1}" for i in range(X.shape[0])])
bow_df

Unnamed: 0,Document 1,Document 2,Document 3
algorithm,0,1,0
alli,1,0,0
ancient,1,0,0
becam,1,0,0
blur,0,1,0
bridg,1,0,0
celebr,0,0,1
conflict,0,1,0
could,0,0,1
craft,0,1,0


## 4. Term Frequency-Inverse Document Frequency (`TF-IDF`)  

## 5. Pointwise Mutual Information (`PMI`) 

## 6. Closing Thoughts 