# About Me  

My name is **Federico Barusolo**, and I’m a **Computer Engineer** with a Master of Science (MoS) path specializing in **Artificial Intelligence, Data Science, Machine Learning**.  

I studied at **Politecnico di Milano** (2013-2018), worked in consultancy (2018-2021) and startup (2021 to date). 

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings as wn

wn.filterwarnings("ignore")

# Text Mining

# 📖 Introduction to Text Mining  

## What is Text Mining?  

**Text Mining** is a set of techniques and methodologies used to extract meaningful information from unstructured textual data. By leveraging **natural language processing (NLP), machine learning, and statistical analysis**, text mining enables the identification of patterns, relationships, and valuable insights within text data.  

## Applications  
Text mining is widely used across various fields, including:  
- **Automatic document classification** for organizing and managing large volumes of information
- **Information Retrieval** to find documents matching some specific queries  
- **Information extraction** to identify names, locations, events, and key concepts in text
- **Sentiment analysis** to understand opinions and trends in social media and reviews
- **Fraud detection and anomaly recognition** to uncover hidden patterns in financial or legal documents
- **Natural Language Understanding and Generation** (arguably the hottest trends right now)

Text mining plays a crucial role in transforming raw text into actionable knowledge, driving data-driven decision-making.  


#### Structured Data v Unstructured Data

In [None]:
# Creating a structured dataset (CSV format)
data = {
    "Name": ["Alice", "Bob", "Charlie", "Carol"],
    "Age": [25, 30, 35, 40],
    "Salary": [50000, 60000, 70000, 80000]
}

df = pd.DataFrame(data)

# Display the structured data
print("Structured Data (Tabular Format):")
display(df)

In [None]:
# Compute the average Salary
average_salary = 
print(f"\nAverage Salary: {average_salary}")

In [None]:
# Extract the age of Charlie
charlies_age = 

print(f"\nCharlie's age: {charlies_age}")

In [None]:
import re
from collections import Counter

# Simple example of "unstructured" text
text_data = """
Alice is 25 years old and earns $50,000. Bob is 30 years old and earns $60,000.
Charlie is 35 years old and earns $70,000. Carol is 40 years old and earns $80,000.
"""

print(text_data)

In [None]:
# Obtain the previous Name-Age-Salary table from the given text

df = 
display(df)

In [None]:
# Real-world unstructured text
text_data = """
Alice is 25 years old and earns $50,000. She has less experience than Bob, who's 30, 
and has a salary of $60000. Then Charlie (35) earns 70k dollars annually, and Carol, who's
forty years old, is the most senior and therefore earns $10,000 more than Charlie.
"""

print(text_data)

## Did You Know?  
Approximately **80% of the world's data is unstructured**! Unlike structured data, which is neatly organized in tables and databases, unstructured data lacks a predefined format, making it more complex to process and analyze.

## Examples of textual unstructured data:

-

## Document Representation

#### Bag-of-words (BoW)

Bag of Words (BoW) is a text representation model used in NLP. It represents a document as a set of words, ignoring grammar and word order. Each document is converted into a vector based on word frequency from a predefined vocabulary. While simple and effective for many tasks, BoW does not capture word meaning or context.

In [None]:
text_data = """
Bad is so bad, that we cannot but think good an accident; 
good is so good, that we feel certain that evil could be explained.
"""

print(text_data)

In [None]:
# perform some basic cleaning
clean_text = 

# create a vocabulary
vocabulary = 
print(f"vocabulary: {vocabulary}")

In [None]:
# transform the text into a BoW
bow = 

# sort by occurrences and then alphabetically
print(sorted(bow, key=lambda x: (-x[1], x[0])))

In [None]:
# ... or just use the proper tools
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform([clean_text])

bow = list(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
result = sorted(bow, key=lambda x: (-x[1], x[0]))
print(result)

#### Term Frequency

In [None]:
# Compute the frequency of each term in text

tf = 

print(tf)

#### Term Frequency - Inverse Document Frequency

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection (or corpus). It balances the frequency of a term within a document and its rarity across the entire corpus, helping to highlight significant terms.

TF-IDF = TF(t, d) * IDF(t) = TF(t, d) * log(N/DF(t))

where:
- TF(t, d): frequency of term t in document d
- IDF(t): inverse frequency of term t across all documents
- N: number of documents
- DF(t): number of documents containing term t
- the logarithm aims at dampening the effect of inverse document frequency

In [None]:
documents = [
"""
Recipe: Carbonara; Ingredients: pasta, guanciale, eggs, pecorino romano cheese, black pepper, salt.
""",

"""
The ingredients you need to use to make this recipe (Carbonara) are: pasta, eggs, roman pecorino, guanciale, salt and black pepper.
""",

"""
Recipe - the Carbonara: eggs, cream, guanciale, pasta, salt, pecorino cheese, black pepper.
""",

"""
For making carbonara you need these ingredients: egg, guanciale, pasta, salt, pecorino roman cheese and black pepper.
""",

"""
To make roman carbonara you need to use the eggs, guanciale, pecorino, pasta, salt and pepper.
"""]

for ix, doc in enumerate(documents):
    print(f"Document {ix + 1}: {doc}")

In [None]:
# perform same cleaning as before
clean_documents = [re.sub(r"[^\w\s]", "", x.replace("\n", "").lower()) for x in documents]
print(clean_documents)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(clean_documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert the result to a dense array (optional, but easier to read)
dense_matrix = tfidf_matrix.todense()

# Print the TF-IDF matrix with corresponding words
df = pd.DataFrame(dense_matrix, index=[f"doc{x+1}" for x in range(len(clean_documents))], columns=feature_names)

print("\nTF-IDF DataFrame:")
df.round(3)

## Further pre-processing

### Stemming

Stemming is a text normalization technique in **Natural Language Processing (NLP)** that reduces words to their root or base form. It helps in **reducing inflected or derived words** to a common base, which improves text analysis tasks such as search and classification.

For example:
- **"running" → "run"**
- **"flies" → "fli"** (stemming may not always produce a valid word)
- **"better" → "better"** (stemming does not handle lemmatization)

One of the most commonly used stemming algorithms is **Porter’s Stemmer**, which applies a set of rules to remove common suffixes from words.


In [None]:
import nltk
from nltk.stem import PorterStemmer

# Download necessary NLTK data (if not already done)
nltk.download('punkt')

# Initialize the stemmer
stemmer = PorterStemmer()

# Example words for stemming
word = "cooking"

print(f"stemmed version of {word}: {stemmer.stem(word)}")

In [None]:
# use the Porter Stemmer to stem the documents in our collection

stemmed_documents = 

print(stemmed_documents)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(stemmed_documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert the result to a dense array (optional, but easier to read)
dense_matrix = tfidf_matrix.todense()

# Print the TF-IDF matrix with corresponding words
df = pd.DataFrame(dense_matrix, index=[f"doc{x+1}" for x in range(len(stemmed_documents))], columns=feature_names)

print("\nTF-IDF DataFrame:")
df.round(3)

## Stopword Removal

Stopwords are **common words** (e.g., *"the"*, *"is"*, *"and"*) that **do not carry significant meaning** and are often removed in text preprocessing for **Natural Language Processing (NLP)** tasks.

### Why Remove Stopwords?
- They appear frequently but contribute **little to meaning**.
- Removing them helps **reduce text size** and **improve model performance** in tasks like search, classification, and sentiment analysis.

### Example of Stopwords:
- **English:** *"the", "is", "in", "and", "a", "to", "of"*
- **Spanish:** *"el", "la", "y", "de", "en"*
- Different languages have different sets of stopwords.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

In [None]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize

# Function to remove stopwords
def remove_stopwords(doc):

# Apply stopword removal to each document
cleaned_documents = [remove_stopwords(doc) for doc in clean_documents]

In [None]:
# Stem documents
split_documents = [x.split(" ") for x in cleaned_documents]
stemmed_documents = [" ".join([stemmer.stem(word) for word in doc]) for doc in split_documents]

# Show the cleaned documents
for i, doc in enumerate(stemmed_documents):
    print(f"Document {i+1}: {doc}")

In [None]:
# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(stemmed_documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert the result to a dense array (optional, but easier to read)
dense_matrix = tfidf_matrix.todense()

# Print the TF-IDF matrix with corresponding words
df = pd.DataFrame(dense_matrix, index=[f"doc{x+1}" for x in range(len(stemmed_documents))], columns=feature_names)

print("\nTF-IDF DataFrame:")
df.round(3)

#### Cosine Similarity

**Cosine Similarity** measures the similarity between two vectors (words or text documents) by calculating the cosine of the angle between their vector representations. It is commonly used in NLP for comparing document or word similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

word1 = "carbonara"
word2 = "pasta"

word_index1 = vectorizer.vocabulary_.get(word1)
word_index2 = vectorizer.vocabulary_.get(word2)

if word_index1 is not None and word_index2 is not None:
    # Get the TF-IDF vector for the two words
    word_tfidf_vector1 = tfidf_matrix[:, word_index1].toarray().reshape(1, -1)  # Reshaping to make it a 2D array (1xN)
    word_tfidf_vector2 = tfidf_matrix[:, word_index2].toarray().reshape(1, -1)  # Reshaping to make it a 2D array (1xN)

    # Calculate cosine similarity
    cosine_sim_words = cosine_similarity(word_tfidf_vector1, word_tfidf_vector2)

    print(f"Cosine similarity between the word '{word1}' and the word '{word2}': {cosine_sim_words}'")
else:
    print(f"The words are not in the TF-IDF vocabulary.")

## Questions

1. what are the two most similar documents in the list? why do you think that is?
2. add a new document to the list such that:
    - the score of the stem 'egg' in the new document is higher than it is in all other documents;
    - cosine similarity between words 'carbonara' and 'pasta' drops below 1.

## Information Retrieval Task Example

In [None]:
# Add more recipes to our collection
documents += [

    """
    To prepare amatriciana you need pasta, guanciale, peeled tomatoes, roman pecorino cheese.
    """,

    """
    Ingredients for amatriciana: guanciale, pasta, tomato sauce, pecorino romano DOP and black pepper.
    """,

    """
    Pasta all'amatriciana - ingredients: guanciale, white wine, peeled tomatoes, pecorino, pasta.
    """
    
]

In [None]:
clean_documents = [re.sub(r"[^\w\s]", "", x.replace("\n", "").lower()) for x in documents]
clean_documents = [remove_stopwords(doc) for doc in clean_documents]

split_documents = [x.split(" ") for x in clean_documents]
stemmed_documents = [" ".join([stemmer.stem(word) for word in doc]) for doc in split_documents]

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(stemmed_documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert the result to a dense array (optional, but easier to read)
dense_matrix = tfidf_matrix.todense()

# Print the TF-IDF matrix with corresponding words
df = pd.DataFrame(dense_matrix, index=[f"doc{x+1}" for x in range(len(stemmed_documents))], columns=feature_names)

print("\nTF-IDF DataFrame:")
df.round(3)

In [None]:
query = "how do I prepare amatriciana pasta?"

print(query)

In [None]:
# Preprocess query and apply tf-idf
clean_query = 
split_query = 
stemmed_query = 

vector_query = 

In [None]:
# Rank our documents based on similarity with respect to our query
rank = 

In [None]:
# Select top4 relevant documents
rank = 

rank

In [None]:
# Evaluate precision and recall of our search task. What can we say about these metrics?
from sklearn.metrics import precision_score, recall_score

print(f"The precision of our retrieval task is {}%")
print(f"The recall of our retrieval task is {}%")

# Tokenization in Text Mining

## What is Tokenization?
Tokenization is the process of breaking down text into smaller units called **tokens**. These tokens can be **words**, **sentences**, or **subwords**, depending on the approach used.

## Why is Tokenization Important?
- It is the first step in **Natural Language Processing (NLP)**.
- Helps in **text analysis, search engines, chatbots, and machine learning models**.
- Makes it easier to process and analyze textual data.

## Types of Tokenization:
1. **Word Tokenization**: Splitting text into individual words.
2. **Sentence Tokenization**: Splitting text into sentences.
3. **Subword Tokenization**: Splitting words into smaller meaningful parts (used in deep learning models like BERT).


In [None]:
import nltk
from nltk.tokenize import word_tokenize

text = "One cannot be too careful with words, they change their minds just as people do."
word_tokens = word_tokenize(text)
print(f"text: {text}")
print("Word Tokens:", word_tokens)
print("\n")

### Code Cell: Sentence Tokenization with NLTK
from nltk.tokenize import sent_tokenize

text = "War is peace. Freedom is slavery. Ignorance is strength."
sent_tokens = sent_tokenize(text)
print(f"text: {text}")
print("Sentence Tokens:", sent_tokens)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Tokenization is crucial for modern text-mining models."
tokens = tokenizer.tokenize(text)
print("Subword Tokens:", tokens)

# Attention Mechanism

The **Attention Mechanism** is a technique that allows models to focus on specific parts of an input sequence when making predictions, rather than processing the entire sequence uniformly. This mechanism is especially useful in tasks like machine translation, summarization, and other sequence-to-sequence tasks.

In traditional sequence models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM), each input is processed sequentially. The attention mechanism enhances this by enabling the model to focus on important words or tokens from the input sequence while making predictions at each step.


In [None]:
from transformers import BertModel, BertTokenizer
from bertviz import head_view
import torch

In [None]:
text = """
The kid was playing with the ball. Then he threw it at his grandma.
"""

print(text)

In [None]:
# Let's load a vanilla BERT-base model. 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
tokens = tokenizer.encode(text)
inputs = torch.tensor(tokens).unsqueeze(0) # unsqueeze changes the shape from (20,) -> (1, 20)
inputs

In [None]:
attention = model(inputs, output_attentions=True)[2]

In [None]:
tokens_as_list = tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention, tokens_as_list)

# Transformer Architectures: Encoder-Decoder, BERT, GPT

Transformers are a type of neural network architecture that revolutionized natural language processing (NLP). They rely on the **attention mechanism** to process input data in parallel (unlike older sequential models like RNNs), making them faster and more effective for tasks like translation, summarization, and text generation.

---

## Key Components of Transformers

1. **Encoder**:  
   - The encoder processes the input data (e.g., a sentence) and converts it into a set of hidden representations (vectors) that capture the meaning of the input.  
   - It uses **self-attention** to understand relationships between all words in the input, regardless of their distance from each other.

2. **Decoder**:  
   - The decoder generates output data (e.g., a translated sentence) based on the encoder's hidden representations.  
   - It also uses self-attention but adds an extra step to focus on the encoder's output, ensuring the generated output is aligned with the input.
  

![Alt text](https://aiml.com/wp-content/uploads/2023/09/Annotated-Transformers-Architecture.png)

---

## Encoder-Decoder Architecture

The original Transformer model (introduced in the paper *"Attention is All You Need"*) uses both an encoder and a decoder. This architecture is ideal for **sequence-to-sequence tasks**, such as:
- **Machine Translation**: Translating a sentence from one language to another.
- **Text Summarization**: Generating a shorter version of a long document.

---

## BERT vs. GPT: Two Popular Transformer Variants

While both BERT and GPT are based on the Transformer architecture, they are designed for different purposes:

### BERT (Bidirectional Encoder Representations from Transformers)
- **Architecture**: Uses only the **encoder** part of the Transformer.
- **Key Feature**: Bidirectional attention, meaning it looks at both the left and right context of a word simultaneously. This makes BERT great for understanding the meaning of words in context.
- **Use Cases**:  
  - Sentence classification (e.g., spam detection).  
  - Question answering (e.g., finding answers in a paragraph).  
  - Named entity recognition (e.g., identifying names, dates, or locations in text).

### GPT (Generative Pre-trained Transformer)
- **Architecture**: Uses only the **decoder** part of the Transformer.
- **Key Feature**: Unidirectional attention, meaning it processes text from left to right. This makes GPT excellent for generating coherent and contextually relevant text.
- **Use Cases**:  
  - Text generation (e.g., writing essays, stories, or code).  
  - Chatbots and conversational AI.  
  - Autocompletion (e.g., suggesting the next word in a sentence).

---

## Summary

| Feature               | BERT (Encoder)                     | GPT (Decoder)                     |
|-----------------------|------------------------------------|------------------------------------|
| **Attention**         | Bidirectional (looks at both sides)| Unidirectional (left-to-right)    |
| **Use Case**          | Understanding text                 | Generating text                   |
| **Example Tasks**     | Question answering, classification | Text generation, chatbots         |

Transformers have become the backbone of modern NLP, and understanding their architecture and variants (like BERT and GPT) is key to leveraging their power for real-world applications.

# References

- https://www.ibm.com/think/insights/managing-unstructured-data
- https://machinelearningmastery.com/gentle-introduction-bag-words-model/ (Bag of Words deep dive)
- https://en.wikipedia.org/wiki/Tf%E2%80%93idf (tf-idf wiki)
- https://www.kopp-online-marketing.com/what-is-bm25 (more powerful representation with bm25)
- https://spotintelligence.com/2023/09/07/vector-space-model/ (Vector space model for documents)

- https://arxiv.org/abs/1706.03762 (Attention is all you need - first paper on transformers)
- https://h2o.ai/wiki/bert/ (BERT deep dive)
- https://www.ibm.com/think/topics/gpt (GPT deep dive)

### Other NLP Tasks
- https://www.techtarget.com/searchbusinessanalytics/definition/opinion-mining-sentiment-mining (Sentiment Analysis)
- https://huggingface.co/tasks/text-classification (Text Classification)