# Réseau Neuronal Récurrent (RNN Simple)

In [13]:
import re
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk

## Tokenization

In [2]:
df = pd.read_csv('../data/tweets.csv')

In [3]:
df[["airline_sentiment", "text"]].head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


### Enlever les mentions, hashtags et stop words

In [4]:
def remove_mentions(text):
    return ' '.join(word for word in text.split() if not word.startswith('@'))

def remove_hashtags(text):
    return ' '.join(word for word in text.split() if not word.startswith('#'))

df["cleaned_text"] = df["text"].apply(remove_mentions).apply(remove_hashtags).str.lower()

In [5]:
df["cleaned_text"]

0                                               what said.
1        plus you've added commercials to the experienc...
2        i didn't today... must mean i need to take ano...
3        it's really aggressive to blast obnoxious "ent...
4                 and it's a really big bad thing about it
                               ...                        
14635    thank you we got on a different flight to chic...
14636    leaving over 20 minutes late flight. no warnin...
14637                    please bring american airlines to
14638    you have my money, you change my flight, and d...
14639    we have 8 ppl so we need 2 know how many seats...
Name: cleaned_text, Length: 14640, dtype: object

On veut ensuite retirer les mots vides (Stop Words) (eg: "the", "a", ...) qui n'apportent pas de signification aux sentiments

In [11]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\clemm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [None]:
def clean_text(text):
    # Supprimer la ponctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Supprimer les stop words
    cleaned = ' '.join(word for word in text.split() if word not in stop_words)
    return cleaned

In [14]:
df["cleaned_text"] = df["cleaned_text"].apply(clean_text)

In [15]:
df["cleaned_text"]

0                                                     said
1                  plus added commercials experience tacky
2             didnt today must mean need take another trip
3        really aggressive blast obnoxious entertainmen...
4                                     really big bad thing
                               ...                        
14635                                got different chicago
14637                              bring american airlines
14638    money change answer phones suggestions make co...
14639    8 ppl need know many next plz put us standby next
Name: cleaned_text, Length: 14640, dtype: object

### Stemming

On va appliquer le Stemming pour réduire les mots à leur racine (ex. : "courir", "court", "courait" deviennent "courir").

In [16]:
nltk.download('punkt')
stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\clemm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [21]:
def stem_text(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

In [23]:
df["cleaned_text"] = df["cleaned_text"].apply(stem_text)

On aurait aussi pu faire de la lemmatisation qui est plus sophistiquée et qui prend en compte une partie du discours (verbe, nom, etc.) et le mot canonique:

| Méthode       | Avantages                             | Inconvénients                        |
| ------------- | ------------------------------------- | ------------------------------------ |
| Stemming      | Rapide, simple                        | Peut produire des mots non existants |
| Lemmatisation | Plus précis, linguistiquement correct | Un peu plus lent, nécessite spaCy    |


Mais pour cette première itération on va continuer avec le stemming de nltk par simplicité.

In [24]:
df["cleaned_text"]

0                                                     said
1                             plu ad commerci experi tacki
2               didnt today must mean need take anoth trip
3        realli aggress blast obnoxi entertain guest fa...
4                                     realli big bad thing
                               ...                        
14635                                   got differ chicago
14636    leav 20 late warn commun 15 late that call shi...
14637                                bring american airlin
14638         money chang answer phone suggest make commit
14639    8 ppl need know mani next plz put us standbi next
Name: cleaned_text, Length: 14640, dtype: object

## Création du Vocabulaire