# Homemade LSTM

In [1]:
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
from tqdm import tqdm

## Tokenization

In [2]:
df = pd.read_csv('../data/tweets.csv')

In [3]:
df[["airline_sentiment", "text"]].head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


### Enlever les mentions, hashtags et stop words

In [4]:
def remove_mentions(text):
    return ' '.join(word for word in text.split() if not word.startswith('@'))

def remove_hashtags(text):
    return ' '.join(word for word in text.split() if not word.startswith('#'))

df["cleaned_text"] = df["text"].apply(remove_mentions).apply(remove_hashtags).str.lower()

In [5]:
df["cleaned_text"]

0                                               what said.
1        plus you've added commercials to the experienc...
2        i didn't today... must mean i need to take ano...
3        it's really aggressive to blast obnoxious "ent...
4                 and it's a really big bad thing about it
                               ...                        
14635    thank you we got on a different flight to chic...
14636    leaving over 20 minutes late flight. no warnin...
14637                    please bring american airlines to
14638    you have my money, you change my flight, and d...
14639    we have 8 ppl so we need 2 know how many seats...
Name: cleaned_text, Length: 14640, dtype: object

On veut ensuite retirer les mots vides (Stop Words) (eg: "the", "a", ...) qui n'apportent pas de signification aux sentiments

In [6]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Error loading stopwords: <urlopen error [WinError 10054]
[nltk_data]     An existing connection was forcibly closed by the
[nltk_data]     remote host>


In [7]:
def clean_text(text):
    # Supprimer la ponctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Supprimer les stop words
    cleaned = ' '.join(word for word in text.split() if word not in stop_words)
    return cleaned

In [8]:
df["cleaned_text"] = df["cleaned_text"].apply(clean_text)

In [9]:
df["cleaned_text"]

0                                                     said
1            plus youve added commercials experience tacky
2             didnt today must mean need take another trip
3        really aggressive blast obnoxious entertainmen...
4                                     really big bad thing
                               ...                        
14635                   thank got different flight chicago
14637                       please bring american airlines
14638    money change flight dont answer phones suggest...
14639    8 ppl need 2 know many seats next flight plz p...
Name: cleaned_text, Length: 14640, dtype: object

### Stemming

On va appliquer le Stemming pour réduire les mots à leur racine (ex. : "courir", "court", "courait" deviennent "courir").

In [10]:
nltk.download('punkt')
stemmer = PorterStemmer()

[nltk_data] Error loading punkt: <urlopen error [WinError 10054] An
[nltk_data]     existing connection was forcibly closed by the remote
[nltk_data]     host>


In [11]:
def stem_text(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

In [12]:
df["cleaned_text"] = df["cleaned_text"].apply(stem_text)

On aurait aussi pu faire de la lemmatisation qui est plus sophistiquée et qui prend en compte une partie du discours (verbe, nom, etc.) et le mot canonique:

| Méthode       | Avantages                             | Inconvénients                        |
| ------------- | ------------------------------------- | ------------------------------------ |
| Stemming      | Rapide, simple                        | Peut produire des mots non existants |
| Lemmatisation | Plus précis, linguistiquement correct | Un peu plus lent, nécessite spaCy    |


Mais pour cette première itération on va continuer avec le stemming de nltk par simplicité.

In [13]:
df["cleaned_text"]

0                                                     said
1                        plu youv ad commerci experi tacki
2               didnt today must mean need take anoth trip
3        realli aggress blast obnoxi entertain guest fa...
4                                     realli big bad thing
                               ...                        
14635                      thank got differ flight chicago
14636    leav 20 minut late flight warn commun 15 minut...
14637                          pleas bring american airlin
14638    money chang flight dont answer phone suggest m...
14639    8 ppl need 2 know mani seat next flight plz pu...
Name: cleaned_text, Length: 14640, dtype: object

Tokenisation simple car on a déjà nettoyé le texte:

In [14]:
def tokenize(text):
    return text.split()

On split en train, test et validation:

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X, y = df["cleaned_text"], df["airline_sentiment"]
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.1, random_state=42, stratify=y_temp)

## Création du Vocabulaire

On va maintenant utiliser une classe d'outils pour parcourir l'ensemble du texte d'entraînement (X_train).
Puis générer une structure qui mappe chaque mot unique à un entier (index). C'est le vocabulaire.

On va aussi définir un index spécial pour les mots inconnus (Out-Of-Vocabulary ou OOV).

Le vocabulaire ne doit être construit QUE sur le jeu d'entraînement.

In [17]:
from collections import Counter

In [18]:
def build_vocab(texts, min_freq=1):
    counter = Counter()

    for text in texts:
        tokens = tokenize(text)
        counter.update(tokens)

    vocab = ["<PAD>", "<UNK>"] + [w for w, f in counter.items() if f >= min_freq]
    word2idx = {w:i for i, w in enumerate(vocab)}

    return vocab, word2idx


In [19]:
vocab, word2idx = build_vocab(X_train)

Transformation de chaque tweet:

In [20]:
def encode(text, word2idx):
    return [word2idx.get(t, word2idx["<UNK>"]) for t in tokenize(text)]

In [21]:
from torch.nn.utils.rnn import pad_sequence
import torch

def encode_dataset(texts, word2idx):
    sequences = [torch.tensor(encode(t, word2idx)) for t in texts]
    return pad_sequence(sequences, batch_first=True, padding_value=word2idx["<PAD>"])


In [22]:
X_train_tensor = encode_dataset(X_train, word2idx)
X_val_tensor   = encode_dataset(X_val, word2idx)
X_test_tensor  = encode_dataset(X_test, word2idx)

Ensuite on encode les labels:

In [23]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train_enc = torch.tensor(le.fit_transform(y_train))
y_val_enc   = torch.tensor(le.transform(y_val))
y_test_enc  = torch.tensor(le.transform(y_test))


Après on crée un dataset et dataloader avec pytorch:

In [24]:
from torch.utils.data import DataLoader, TensorDataset

train_dataset = TensorDataset(X_train_tensor, y_train_enc)
val_dataset   = TensorDataset(X_val_tensor, y_val_enc)
test_dataset  = TensorDataset(X_test_tensor, y_test_enc)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=32)
test_loader  = DataLoader(test_dataset, batch_size=32)


Ensuite on initialise le modèle, c'est un RNN (simple LSTM):

In [25]:
import os
import sys
parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(parent_dir)

In [26]:
from src.lstm_model_from_scratch import ManualLSTM
import torch.nn as nn

In [27]:
model = model = ManualLSTM(
    vocab_size=len(vocab),
    embed_dim=64,
    hidden_dim=128,
    num_classes=len(le.classes_)
).to("cpu")

In [28]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Ensuite on entraine en utilisant la loss et l'optimizer définit plus haut:

In [29]:
def train_model(model, train_loader, val_loader, epochs=3, device="cpu"):
    for epoch in range(epochs):
        model.train()
        total_loss = 0

        # Pour la barre de progression tqdm
        train_iter = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}", leave=False)

        for X_batch, y_batch in train_iter:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            
            # Mise à jour de la barre de progression
            train_iter.set_postfix(loss=loss.item())

        # Validation
        model.eval()
        correct = 0
        total = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch = X_batch.to(device)
                y_batch = y_batch.to(device)
                preds = model(X_batch).argmax(dim=1)

                correct += (preds == y_batch).sum().item()
                total += y_batch.size(0)

        val_acc = correct / total
        print(f"Epoch {epoch+1} | Train Loss = {total_loss:.3f} | Val Acc = {val_acc:.4f}")


In [30]:
train_model(model, train_loader, val_loader, epochs=5)

                                                                        

Epoch 1 | Train Loss = 298.614 | Val Acc = 0.6737


                                                                        

Epoch 2 | Train Loss = 246.872 | Val Acc = 0.7185


                                                                        

Epoch 3 | Train Loss = 218.428 | Val Acc = 0.7329


                                                                        

Epoch 4 | Train Loss = 194.101 | Val Acc = 0.7360


                                                                        

Epoch 5 | Train Loss = 167.097 | Val Acc = 0.7511


Evaluation finale sur le test set:

In [31]:
from sklearn.metrics import classification_report

model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for X, y in test_loader:
        preds = model(X.to("cpu")).argmax(dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(y.numpy())

print(classification_report(all_labels, all_preds, target_names=le.classes_))

              precision    recall  f1-score   support

    negative       0.87      0.84      0.85       918
     neutral       0.57      0.59      0.58       310
    positive       0.61      0.66      0.63       236

    accuracy                           0.76      1464
   macro avg       0.68      0.69      0.69      1464
weighted avg       0.76      0.76      0.76      1464

