# Projet - Classification de documents d'opinions
---
- Un jeu de données textuelles est mis à disposition sur Moodle.  
- Il s'agit d'un corpus d'à peu près 8000 documents contenant des avis d'internautes sur des films.  
- A chaque document est associé sa polarité selon l'avis (+1 : positif, -1 : négatif).  
- Le fichier des documents est formaté dans un tableau cvs (un avis par ligne), un autre fichier csv contient les polarités d'avis par document (- 1/+1).  
- Une correspondance directe existe entre les numéros des lignes des documents et des polarités.

## Configuration de l'environnement
---

In [1]:
import pandas as pd
import numpy as np

from collections import defaultdict

from nltk import pos_tag
from nltk import punkt
from nltk.corpus import stopwords, wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn import model_selection, naive_bayes, svm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

##### UNCOMMENT THIS SECTION IF FIRT TIME RUNNING
#import nltk
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('word_tokenize')
#nltk.download('averaged_perceptron_tagger')
#####

# Set seed for random results base calculation
np.random.seed(500)

import seaborn as sns
sns.set(style = "darkgrid")

## Lecture des fichiers
---

- Récupération du fichier dataset.csv et labels.csv
- Intégration de leur contenu dans 2 dataframes pandas différents

In [2]:
print("\nDataframe des avis")
df_avis = pd.read_csv('Dataset/dataset.csv', sep = '\t', header = None, names = ['Avis'], encoding ='utf8')
display(df_avis.head())
display(df_avis.shape)

print("\nDataframe des scores")
df_score = pd.read_csv('Dataset/labels.csv', sep = '\t', header = None, names = ['Score'], encoding ='utf8')
display(df_score.head())
display(df_score.shape)

print("\nDataframe merged")
df = df_avis.join(df_score)
display(df.head())
display(df.shape)


Dataframe des avis


Unnamed: 0,Avis
0,Obviously made to show famous 1950s stripper M...
1,This film was more effective in persuading me ...
2,Unless you are already familiar with the pop s...
3,From around the time Europe began fighting Wor...
4,Im not surprised that even cowgirls get the bl...


(10000, 1)


Dataframe des scores


Unnamed: 0,Score
0,-1
1,-1
2,-1
3,-1
4,-1


(10000, 1)


Dataframe merged


Unnamed: 0,Avis,Score
0,Obviously made to show famous 1950s stripper M...,-1
1,This film was more effective in persuading me ...,-1
2,Unless you are already familiar with the pop s...,-1
3,From around the time Europe began fighting Wor...,-1
4,Im not surprised that even cowgirls get the bl...,-1


(10000, 2)

## Pré-traitement des données
---

Structuration des données afin d'y réaliser des manipulations

### Tokenization

In [3]:
# Copie du DF originale
df2 = shuffle(df)
# Réinitialisation des index
#df2.reset_index()
df2.reset_index(inplace = True, 
                drop = True)
display(df2.head())

Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


In [4]:
# Change all characters to lower case
df2['Avis'] = [entry.lower() for entry in df['Avis']]

# Tokenization
df2['Avis'] = [word_tokenize(entry) for entry in df['Avis']]

display(df2.head())

Unnamed: 0,Avis,Score
0,"[Obviously, made, to, show, famous, 1950s, str...",1
1,"[This, film, was, more, effective, in, persuad...",1
2,"[Unless, you, are, already, familiar, with, th...",-1
3,"[From, around, the, time, Europe, began, fight...",-1
4,"[Im, not, surprised, that, even, cowgirls, get...",1


### Lemmatization

In [5]:
# Lemmenting

## liste d'étiquettes
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

# version rapide
#df2_small = df2.iloc[:100]

for index, entry in enumerate(df2['Avis']):
    final_words = []
    word_lem = WordNetLemmatizer()
    for word, tag in pos_tag(entry):
        if word not in stopwords.words('english') and word.isalpha():
            word_final = word_lem.lemmatize(word, tag_map[tag[0]])
            final_words.append(word_final)
    df2.loc[index, 'Avis'] = str(final_words)

display(df2.head())

Unnamed: 0,Avis,Score
0,"['Obviously', 'make', 'show', 'famous', 'strip...",1
1,"['This', 'film', 'effective', 'persuade', 'Zio...",1
2,"['Unless', 'already', 'familiar', 'pop', 'star...",-1
3,"['From', 'around', 'time', 'Europe', 'begin', ...",-1
4,"['Im', 'surprise', 'even', 'cowgirls', 'get', ...",1


### Checkpoint 

In [6]:
df3 = df2
# manipulation de df3 par la suite

### Train/Test Data sets

In [7]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(df3['Avis'],
                                                                  df3['Score'],
                                                                  test_size = 0.3)

### Encoding

In [8]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
#Train_X = Encoder.fit_transform(Train_X)
#Test_X = Encoder.fit_transform(Test_X)

### Word Vectorization

In [9]:
vect = TfidfVectorizer(max_features = 5000)
vect.fit(df3['Avis'])

Train_X_vect = vect.transform(Train_X)
Test_X_vect = vect.transform(Test_X)

print(vect.vocabulary_)



### ML algorithms
---
#### Naive Bayes

In [10]:
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_vect, Train_Y)

res_NB = Naive.predict(Test_X_vect)

print("NB accuracy -> ", accuracy_score(res_NB, Test_Y)*100, "%")

NB accuracy ->  48.13333333333333 %


#### SVM

In [11]:
SVM = svm.SVC(C = 1.0, kernel = 'linear', degree = 3, gamma = 'auto')
SVM.fit(Train_X_vect, Train_Y)

res_SVM = SVM.predict(Test_X_vect)

print("SVM accuracy -> ", accuracy_score(res_SVM, Test_Y)*100, "%")

SVM accuracy ->  47.5 %
