<a href="https://colab.research.google.com/github/alessandrocapialbi/Book_Detection/blob/main/A1/Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Group members

|  Name   |  Surname   |     Email                            |    Student ID                                             |
| :-----: | :--------: | :----------------------------------: | :-----------------------------------------------------: |
| Ludovico  | Gorrieri   | `ludovico.gorrieri@studio.unibo.it`   |  To Be Determinned |
| Alessandro  | Capialbi | `alessandro.capialbi@studio.unibo.it`  | 0001191564 |
| Faezeh  | Sarlakifar | `faezeh.sarlakifar@studio.unibo.it`  | 0001164608 |

## Task 1 & 2

### Download the dataset

In [None]:
# !wget https://github.com/nlp-unibo/nlp-course-material/tree/main/2025-2026/Assignment%201/data

In [None]:
!git clone https://github.com/nlp-unibo/nlp-course-material.git
%cd "nlp-course-material/2025-2026/Assignment 1"

Cloning into 'nlp-course-material'...
remote: Enumerating objects: 391, done.[K
remote: Counting objects: 100% (391/391), done.[K
remote: Compressing objects: 100% (288/288), done.[K
remote: Total 391 (delta 174), reused 294 (delta 90), pack-reused 0 (from 0)[K
Receiving objects: 100% (391/391), 8.56 MiB | 20.16 MiB/s, done.
Resolving deltas: 100% (174/174), done.
/content/nlp-course-material/2025-2026/Assignment 1/data/nlp-course-material/2025-2026/Assignment 1/nlp-course-material/2025-2026/Assignment 1/nlp-course-material/2025-2026/Assignment 1/nlp-course-material/2025-2026/Assignment 1/nlp-course-material/2025-2026/Assignment 1/nlp-course-material/2025-2026/Assignment 1


# **Tweet Preprocessing and Label Aggregation Script**

This script prepares the dataset of tweets for NLP tasks.
It handles text cleaning, tokenization, lemmatization, and label aggregation for supervised learning.
Below is a detailed explanation of each section.

# 1. Importing Required Libraries:

    a) pandas / numpy → for data manipulation.
    b) re → regular expressions for text cleaning.
    c) nltk → for tokenization, POS tagging, and lemmatization.
    d) Counter → to count occurrences of labels and select the majority vote.

# 2. Preparing NLTK Resources:

    This block ensures that all required NLTK corpora and models are available locally.
    If a resource is missing, it is automatically downloaded.

# 3. Initializing Tools:

    WhitespaceTokenizer → splits text based on spaces (useful after cleaning).
    WordNetLemmatizer → reduces words to their base or dictionary form using WordNet.

# 4. Helper Function: get_wordnet_key(pos_tag):
    This function maps Penn Treebank POS tags (e.g., NN, VB, JJ) to WordNet’s format (noun, verb, adjective, adverb).
    This step is essential because WordNetLemmatizer requires the part of speech to perform accurate lemmatization.

# 5. Lemmatization Function: lem_text(row):
    This function:

    1) Tokenizes the tweet into words.
    2) Assigns POS tags using NLTK’s pos_tag.
    3) Lemmatizes each word according to its part of speech.
    4) Returns the lemmatized tweet as a single string.

# 6. Cleaning Function: cleaner(row):

    Purpose: Remove noise and standardize text before analysis.

    Steps:

    1)	lower(): Converts all text to lowercase
    2)	Remove URLs: Regex https?:\/\/.\S+ removes URLs and links
    3)	Remove mentions & hashtags:	Regex [@#].\S+ removes @user and #topic
    4)	Remove emojis/symbols: Unicode ranges cover emoticons, flags, pictographs
    5)	Remove non-alphanumeric:	Keeps only letters, digits, and spaces
    6)	Normalize whitespace:	Collapses multiple spaces into one

# 7. Label Aggregation:

    This part aggregates multiple label votes for a tweet into a single numeric label.

    How it works:

    1) For each row, it collects all values in labels_task2 except "UNKNOWN".
    2) Uses Counter to find the most common label (majority vote).
    3) Maps that label to a numerical ID using the mapping dictionary.

# Summary

    This script prepares tweets by performing:

    1) cleaner():	Remove unwanted characters and normalize text
    2) lem_text():	Lemmatize words for consistent representation
    3) aggregator():	Convert multiple annotations into a single label

    Together, these functions create a clean, normalized, and labeled dataset,
    ideal for tasks like text classication that we will perform.


In [None]:
import json
import pandas as pd
import numpy as np
from collections import Counter
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.tokenize import (word_tokenize,
                            sent_tokenize,
                            WhitespaceTokenizer);

# Prepare NLTK
resources = [
    ('corpora/omw-1.4', 'omw-1.4'),
    ('corpora/wordnet', 'wordnet'),
    ('taggers/averaged_perceptron_tagger', 'averaged_perceptron_tagger'),
    ('taggers/averaged_perceptron_tagger_eng', 'averaged_perceptron_tagger_eng'),
    ('tokenizers/punkt_tab', 'punkt_tab'),
    ('tokenizers/punkt', 'punkt')
]

for resource_path, download_name in resources:
    try:
        nltk.data.find(resource_path)
    except LookupError:
        nltk.download(download_name, quiet=True)

tokenizer = WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()


def get_wordnet_key(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

# Lemmatize a row's tweet
def lem_text(row):
    tokens = tokenizer.tokenize(row.tweet)
    tagged = pos_tag(tokens)
    words = [lemmatizer.lemmatize(word, get_wordnet_key(tag))
             for word, tag in tagged]
    return " ".join(words)

# Clean a row's tweet
def cleaner(row):
    text = row.tweet
    text = text.lower()
    text = re.sub(r'https?:\/\/.\S+', '', text)
    text = re.sub(r'[@#].\S+', '', text)
    text = re.sub(
        "["
            u"\U0001F600-\U0001F64F"  # Emoticons
            u"\U0001F300-\U0001F5FF"  # Symbols & pictographs
            u"\U0001F680-\U0001F6FF"  # Transport & map symbols
            u"\U0001F1E0-\U0001F1FF"  # Flags
                                    "]+", '', text
    )
    text = re.sub(r'[^a-z^0-9^\s]*', '', text)
    text = ' '.join(text.split())
    return text

# Aggregate the labels (labels_task2)
aggregator = lambda row: \
    mapping[Counter([vote for vote in row.labels_task2 if vote != "UNKNWON"]).most_common(1)[0][0]]

mapping = {
    '-': 0,
    'DIRECT': 1,
    'JUDGEMENTAL': 2,
    'REPORTED': 3
}

### Clean, split and lemmatize the dataset.

In [None]:
# Load the files
with open("data/training.json", "r") as tr, \
     open("data/validation.json", "r") as te, \
     open("data/test.json", "r") as va:
    train_json = json.load(tr)
    val_json = json.load(te)
    test_json = json.load(va)

# Create the dataframes (setting the index to id_EXIST)
dts = {
    "train": pd.DataFrame.from_dict(train_json, orient="index").set_index("id_EXIST"),
    "test": pd.DataFrame.from_dict(test_json, orient="index").set_index("id_EXIST"),
    "val": pd.DataFrame.from_dict(val_json, orient="index").set_index("id_EXIST")
}

# Unnecessary columns
drop_cols = ["number_annotators", "annotators", "gender_annotators",
    "age_annotators", "labels_task1", "labels_task3", "split"]

# Clean and lemmatize the data
for name, df in dts.items():
    df = df[df.lang == "en"] # Drop spanish.

    df = df.drop(columns=drop_cols) # Drop unnecessary cols.

    df["labels"] = df.apply(aggregator, axis=1) # Aggregate the labels (maj. voting).
    df = df.drop(columns="labels_task2")

    for func in [cleaner, lem_text]:
        df["tweet"] = df.apply(func, axis=1) # Clean the tweets.

    dts[name] = df

train, test, val = dts.values()

## Task 3: Text Encoding

### Setup

In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m58.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
!wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
!unzip -q glove.twitter.27B.zip

--2025-10-24 05:02:46--  http://nlp.stanford.edu/data/glove.twitter.27B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.twitter.27B.zip [following]
--2025-10-24 05:02:46--  https://nlp.stanford.edu/data/glove.twitter.27B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.twitter.27B.zip [following]
--2025-10-24 05:02:46--  https://downloads.cs.stanford.edu/nlp/data/glove.twitter.27B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1520408563 (1.4G) [ap

In [None]:
import os
import tensorflow as tf
tf_data = tf.data
import keras
from keras import layers

import gensim
import gensim.downloader as gloader
from gensim.models import KeyedVectors

import numpy as np

In [None]:
os.environ["KERAS_BACKEND"] = "tensorflow"

### Build the vocabulary

In [None]:
texts = train["tweet"].values
labels = train["labels"].values

text_ds = tf.data.Dataset.from_tensor_slices((texts, labels)).batch(64)

In [None]:
vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=100)
vectorizer.adapt(text_ds.map(lambda x, y: x))

In [None]:
vocab = vectorizer.get_vocabulary()
print(vocab[:10])

['', '[UNK]', np.str_('be'), np.str_('the'), np.str_('a'), np.str_('to'), np.str_('and'), np.str_('of'), np.str_('i'), np.str_('it')]


### Use GloVe Embedding vectors

#### Convert GloVe format to Word2Vec format

In [None]:
# embedding dimension: 100 (For now I want to test this one, then I'll change this hyperparameter to get better results)
glove_file = "glove.twitter.27B.100d.txt"

# Load GloVe into Gensim
twitter_glove = KeyedVectors.load_word2vec_format(glove_file, binary=False, no_header=True)

print(f"Loaded Twitter GloVe with {len(twitter_glove.key_to_index):,} tokens")

Loaded Twitter GloVe with 1,193,514 tokens


#### Build TensorFlow embedding matrix

In [None]:
vocab = vectorizer.get_vocabulary()
embedding_dim = twitter_glove.vector_size
embedding_matrix = np.zeros((len(vocab), embedding_dim))

for i, word in enumerate(vocab):
    if word in twitter_glove:
        embedding_matrix[i] = twitter_glove[word]

### OOV handling

#### Random embedding initialization for OOV words

Then we will learn them by training

In [None]:
for i, word in enumerate(vocab):
    if word in twitter_glove:
        embedding_matrix[i] = twitter_glove[word]
    else:
        embedding_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim,))

##### Create Keras Embedding layer

In [None]:
embedding_layer = layers.Embedding(
    input_dim=len(vocab),
    output_dim=embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
    trainable=True  # Allow the model to adapt embeddings during training
                    # So, OOV vectors will be learned to something more meaningful which are currently initialized radnomly
                    # This also adapts pre-trained embeddings for our specific task (Should be a problme??)
)