##  Projet HMIN326 -- Fouille de données

# Classification de documents par opinion
Encadrement :
Dino Ienco, Konstantin Todorov, Pascal Poncelet
Octobre 2020

Le but de ce projet consiste à mettre en oeuvre et évaluer des méthodes de classification de documents par opinion.
Le corpus

Un jeu de données textuelles nous est mis à disposition. Il s'agit d'un corpus de 10000 documents contenant des avis d'internautes sur des films. A chaque document est associé sa polarité selon l'avis (+1 : positif, -1 : négatif). Le fichier des documents est formaté dans un tableau cvs (un avis par ligne), un autre fichier csv contient les polarités d'avis par document (- 1/+1). Une correspondance directe existe entre les numéros des lignes des documents et les polarités.

## Etape 1 : Transformation des données

On utilise Scikit Learn à la place de WEKA pour effectuer les transformations et vectorisations donc pas besoin de transformer en .arff comme pour le cas WEKA.Par la suite, les valeurs textuelles doivent être rendues numériques en utilisant une pondération fréquentielle (tf-idf, tf, ou autres). normalisation !.



In [None]:
import pandas as pd
import numpy as np
import nltk
import time
import re
from difflib import SequenceMatcher
import string 
import matplotlib.pyplot as plt
#Pour afficher les nuages de donnees
from wordcloud import WordCloud
import seaborn as sns
#from nltk.tokenize import word_tokenize 
from nltk import pos_tag,word_tokenize
from nltk.corpus import stopwords 
from nltk.corpus import wordnet 
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer 

In [None]:
#permet de telecharger tous les bibliothèques
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('omw')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
nltk.download('all')

In [None]:
# Creation datafrane a partir de fichiers csv 
Dataset=pd.read_csv('dataset.csv', sep='\t', index_col=False, header = None)
Labels = pd.read_csv('labels.csv', sep='\t', index_col=False, header = None)
pd.options.display.max_colwidth = 300

In [None]:
Dataset.info()
Dataset.head(10)  # si on veut afficher 15 lignes sinon 5 par defaut .head()

In [None]:
Labels.info()
Labels.head(100)

### Affichage et exploration du dataset

In [None]:
""" Check for missing values: Machine learning models usually require complete data. """

#retourne le nombre de valeurs manquantes. refere aux cellules vides ou donnees absentes selon le modele 
#https://miamioh.instructure.com/courses/38817/pages/data-cleaning pour plus de comprehension a mettre en annexe
Dataset.isnull().sum()   

#### Concatenation of the two files dataset and labels( in positive and negative )

In [None]:
#Pour augmenter la largeur d'affichage
pd.set_option('display.max_columns', 350)

In [None]:
#la concatenation va nous permettre de travailler sur un meme fichier, il suffit juste d'appeler la colonne concernee
fullDataset=pd.concat([Dataset,Labels],axis=1,ignore_index=True,verify_integrity=True)
fullDataset

In [None]:
fullDataset=fullDataset.rename(columns={0:'comment', 
                                        1:'labels'
                                        })
fullDataset.columns

In [None]:
Features=['comment','labels']
full_Dataset=pd.DataFrame(fullDataset,columns=Features)
full_Dataset

In [None]:
#compte les categories de labels
fullDataset['labels'].value_counts()

## Visualisation graphique

In [None]:
plt.figure(figsize=(6,5))
sns.set(style = "darkgrid" , font_scale = 1.2)
sns.countplot(x=fullDataset.labels, y=None)

### Afficher les mots positifs de notre dataset sous format nuages

In [None]:
#wordcloud permet de générer les nuages de mot
#En utilisant le package Wordcloud , nous pourrons générer une image qui nous donne les mots les plus représentatifs 
#dans un ensemble de critiques choisi. Ici nous avons choisi les 150 mots les plus répétés
positive_values = fullDataset[(fullDataset.comment.notnull()) & (fullDataset.labels == 1)]
wordcloud = WordCloud(width=500,height=250, max_font_size=80, max_words=150, background_color="white").generate(positive_values.comment[5000])

f = plt.figure() 
f.set_figwidth(15) 
f.set_figheight(10)

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

### Visualisation des mots négatifs sous format nuages

In [None]:
negative_values = fullDataset[(fullDataset.comment.notnull()) & (fullDataset.labels == -1)]

wordcloud = WordCloud(width=500,height=250, max_font_size=80, max_words=150, background_color="white").generate(negative_values.comment[0])

f = plt.figure() 
f.set_figwidth(15) 
f.set_figheight(10)

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
 
plt.show()

## Etape 2 : Prétraitements des documents
Vous utiliserez les différents types de données d'entrée selon les prétraitements. ​Le but est d'utiliser vos textes avec différentes informations, en préparant 3 versions du corpus :
(1) Textes bruts (avec ou sans suppression de stop-words),
(2) Textes lemmatisés,
(3) Textes lemmatisés avec analyse morphosyntaxique (à l'aide de l'outil Tree-tagger vu en cours).


In [None]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for index,message,labels in full_Dataset.itertuples():  # iterate over the DataFrame
    if type(labels)==str:            # avoid NaN values
        if labels.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks))

# le resultat de la longueur de la liste blanks est 0 donc pas de whitespace
" =============================================== "

text = fullDataset['comment']

#corriger les contractions
import contractions
comments=text.apply(contractions.fix)

#expression reguliere qui remplace les caracteres differents de A a z - ' et espace par rien('') 
#ou plutot un espace afin d'eviter deux mots separes par une ponctuation d'etre colles
#les whitespaces ou espace > 1 par 1 seul espace
#On pouvait transformer les valeurs numeriques en texte mais c'est pas interessant
#garder les - pour les mots composes
comments = comments.str.replace('[^A-z -]',' ').str.replace(' +',' ').str.strip()

#splitwords = [ nltk.word_tokenize( str(message) ) for message in messages ]  # une facon de tokeniser
#print(splitwords)  # ne s'affiche pas trop large



####  Tokenisation du texte brut 

In [None]:
comments_bruts = comments.copy().apply(word_tokenize)  # une seconde maniere d'appliquer le meme outil
comments_bruts.head()                                  # par defaut les 5 premieres lignes

#### Texte brut avec suppression stop-words

In [None]:
#importer les stop words en anglais 
stopwords = nltk.corpus.stopwords.words('english')
new_words = ('us','oh')
for word in new_words:
   stopwords.append(word) #personnalisation de nos stopwords

def supprime_sw_and_nonalpha(text): # meme chose que la fonction remove_stopwords
    return [word for word in text if word.lower() not in stopwords and word.isalpha()] 

def remove_stopwords(text):
    words = []
    for word in text:
            if word.lower() not in stopwords:
                words.append(word)

    return words


comments_bruts_sans_sw = comments.copy().apply(word_tokenize).apply(remove_stopwords)
#messages_bruts_sans_sw = messages_bruts_sans_sw.apply(lambda sentence: [item for item in sentence if item not in stopwords])
comments_bruts_sans_sw.head()                         

#### Lemmatisation avec wordnet

In [None]:
# lemmatisation en considerant les noms, adjectifs, adverbes et verbes

# lemmatise le mot word selon sa categorie de tag mais pas efficace, 
#ignore des mots plutot definir une fonction pour chaque type de tag
def lemmatize_withtag(word,tag):   
    wn = WordNetLemmatizer()
    if tag.startswith("NN"):  #noun
        return wn.lemmatize(word, pos='n')
    elif tag.startswith('VB'): #verbe
        return wn.lemmatize(word, pos='v')
    elif tag.startswith('JJ'): #adjectif
        return wn.lemmatize(word, pos='a')
    elif tag.startswith('RB'): #adverbe
        return wn.lemmatize(word, pos='r')
    else:
        return word

#la librairie pos_tag renvoie la liste des mots d'un texte avec la categorie de tag pour chaque
def lemmatize_text_verbe(text):
    return [wn.lemmatize(word, pos='v') for word in text]
def lemmatize_text_nom(text):
    return [wn.lemmatize(word, pos='n') for word in text]
def lemmatize_text_adj(text):
    return [wn.lemmatize(word, pos='a') for word in text]
def lemmatize_text_adv(text):
    return [wn.lemmatize(word, pos='r') for word in text]

comments_lemmes_wnet = comments_bruts_sans_sw.copy().apply(lemmatize_text_verbe)
comments_lemmes_wnet = comments_lemmes_wnet.apply(lemmatize_text_nom)
comments_lemmes_wnet = comments_lemmes_wnet.apply(lemmatize_text_adj)
comments_lemmes_wnet = comments_lemmes_wnet.apply(lemmatize_text_adv)


comments_lemmes_wnet.head()    

#### Lemmatisation avec la librairie spacy

In [None]:
from spacy import displacy
doc = nlp(u"A three movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking.")
displacy.render(doc, style='ent', jupyter=True)


In [None]:
#cette partie de code mets du temps a l'execution. Patience !
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the displaCy library
from spacy import displacy

comments_lemmes_spacy = comments.copy().apply(nlp)

#affiche le resultat pour les 5 premieres lignes
for comment in comments_lemmes_spacy.head():
    displacy.render(comment, style='ent', jupyter=True)
    #print([token.lemma_ for token in message])



In [None]:
displacy.render(messages_lemmes_spacy.head(), style='ent', jupyter=True)

## Etape 3 : Mise en oeuvre d'algorithmes de classification
La suite du travail consistera à utiliser Weka et à évaluer rigoureusement les résultats de classification obtenus en prenant en entrée les différents corpus préparés dans l'étape précédent. Rappelons que de nombreuses approches d'apprentissage peuvent alors être utilisées pour la classification de textes :
• K plus proches voisins,
• Arbres de décisions,
• Naïve Bayes,
• Machines à support de vecteurs 
• Les règles d'association
• Ensemble classifier

Paramétrage : Pour chaque méthode de classification, il existe plusieurs paramètres à choisir, tels que le paramètre K de l'algorithme des KPPV, le noyau pour les SVM, le support pour les règles, etc.

Dans le cas de Scikit Learn, il est possible d’utiliser la fonction gridsearchCV pour tester différents classifieurs.


In [None]:
# librairie pour la classification
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.pipeline import Pipeline

### Split the data into train & test sets:

In [None]:
X = fullDataset['comment']  # full_Dataset['comment'] after tokenisation this time we want to look at the text
y = fullDataset['labels']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train_t, X_test_t, y_train, y_test = train_test_split(X.apply(word_tokenize), y, test_size=0.33, random_state=42)

# Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors.

In [None]:
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

This shows that our training set is comprised of 6700 documents, and 42772 features.

### Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.
To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.
Both tf and tf–idf can be computed as follows using TfidfTransformer:

In [None]:
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
print(X_train_tfidf)

### Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using TfidVectorizer:

In [None]:
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

# Train  Classifiers

### *** Earlier we named our SVC classifier svc_model. Here we're using the more generic name clf (for classifier).

# Test classifiers and display results

### 1 - Machines à support de vecteurs 
Here we'll introduce an SVM classifier that's similar to SVC, called LinearSVC. LinearSVC handles sparse input better, and scales well to large numbers of samples.

####  Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a Pipeline class that behaves like a compound classifier.

In [None]:

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('text_clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

In [None]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [None]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
acc1 = metrics.accuracy_score(y_test,predictions)
print('The prediction score :',acc1)

## ==========  Sentence of test  =======

In [None]:
myreview = "A movie I really wanted to love was terrible. \
I'm sure the producers had the best intentions, but the execution was lacking."

In [None]:
print(text_clf.predict([myreview]))  # be sure to put "myreview" inside square brackets

#### interpretation
[-1] means that the opinion is negative  

In [None]:
myreview2 = "A movie I really liked."

In [None]:
print(text_clf.predict([myreview2]))  # be sure to put "myreview" inside square brackets

[1] means that the opinion is positive


### 2- K plus proches voisins

In [None]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('text_clf', KNeighborsClassifier()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  


In [None]:
predictions = text_clf.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))


In [None]:
print(metrics.classification_report(y_test,predictions))

In [None]:
acc2 = metrics.accuracy_score(y_test,predictions)
print('The prediction score :',acc2)

### 3 - Arbre de decision

In [None]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('text_clf', DecisionTreeClassifier()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  


In [None]:
predictions = text_clf.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))


In [None]:
print(metrics.classification_report(y_test,predictions))

In [None]:
acc3 = metrics.accuracy_score(y_test,predictions)
print('The prediction score :',acc3)

### 4 - Train a naïve Bayes classifier:

In [None]:
# Naïve Bayes:
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('text_clf', MultinomialNB()),
])


In [None]:
# Run predictions and analyze the results (naïve Bayes) 
text_clf.fit(X_train, y_train)
# Form a prediction set
predictions = text_clf.predict(X_test)
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
acc5 = metrics.accuracy_score(y_test,predictions)
print('The prediction score :',acc4)

### 5 - LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('text_clf', LogisticRegression()),
])


In [None]:
# Run predictions and analyze the results (naïve Bayes) 
text_clf.fit(X_train, y_train)
# Form a prediction set
predictions = text_clf.predict(X_test)
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
acc5 = metrics.accuracy_score(y_test,predictions)
print('The prediction score :',acc5)

### 6 - Ensemble Classifier

In [None]:
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[('SVM', LinearSVC()), ('DTree', DecisionTreeClassifier()), ('KPPVoisin', KNeighborsClassifier()), ('NaiveBaye', MultinomialNB())], voting='hard')
voting_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('voting_clf', MultinomialNB()),
])


In [None]:
# Run predictions and analyze the results (naïve Bayes) 
voting_clf.fit(X_train, y_train)
# Form a prediction set
predictions = voting_clf.predict(X_test)
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
acc6 = metrics.accuracy_score(y_test,predictions)
print('The prediction score :',acc6)

# Comparaison

In [None]:

fig = plt.figure(figsize=(7,5))
ax = fig.add_axes([0,0,1,1])
models = ['SVM', 'K_PP_Voisin ', 'Dtree ', 'NaiveBaye ' , ' L_Reg ','Classifier_Set ']
accurisy = [acc1*100 ,acc2*100,acc3*100,acc4*100,acc5*100,acc6*100]
ax.bar(models,accurisy,color = 'bgmrc',width = 0.8) 
plt.xlabel("Classifier model",size=15)
plt.ylabel("Accuracy",size=15)
 
plt.show()

# Conclusion 