# Projet - Classification de documents d'opinions
---
- Un jeu de données textuelles est mis à disposition sur Moodle.  
- Il s'agit d'un corpus d'à peu près 8000 documents contenant des avis d'internautes sur des films.  
- A chaque document est associé sa polarité selon l'avis (+1 : positif, -1 : négatif).  
- Le fichier des documents est formaté dans un tableau cvs (un avis par ligne), un autre fichier csv contient les polarités d'avis par document (- 1/+1).  
- Une correspondance directe existe entre les numéros des lignes des documents et des polarités.

## Configuration de l'environnement
---

In [59]:
import pandas as pd
import numpy as np
import unicodedata
import contractions
import inflect
import re

from collections import defaultdict

from nltk import pos_tag
from nltk import punkt
from nltk.corpus import stopwords, wordnet as wn
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn import model_selection, naive_bayes, svm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

##### UNCOMMENT THIS SECTION IF FIRT TIME RUNNING
# import nltk
# nltk.download('wordnet')
# nltk.download('stopwords')
# nltk.download('word_tokenize')
# nltk.download('averaged_perceptron_tagger')
#####

# Set seed for random results base calculation
np.random.seed(500)

import seaborn as sns
sns.set(style = "darkgrid")

## Lecture des fichiers
---

- Récupération du fichier dataset.csv et labels.csv
- Intégration de leur contenu dans 2 dataframes pandas différents

In [60]:
print("\nDataframe des avis")
df_avis = pd.read_csv('Dataset/dataset.csv', sep = '\t', header = None, names = ['Avis'], encoding ='utf-8')
display(df_avis.head())
display(df_avis.shape)

print("\nDataframe des scores")
df_score = pd.read_csv('Dataset/labels.csv', sep = '\t', header = None, names = ['Score'], encoding ='utf-8')
display(df_score.head())
display(df_score.shape)

print("\nDataframe merged")
df = df_avis.join(df_score)
display(df.head())
display(df.shape)


Dataframe des avis


Unnamed: 0,Avis
0,Obviously made to show famous 1950s stripper M...
1,This film was more effective in persuading me ...
2,Unless you are already familiar with the pop s...
3,From around the time Europe began fighting Wor...
4,Im not surprised that even cowgirls get the bl...


(10000, 1)


Dataframe des scores


Unnamed: 0,Score
0,-1
1,-1
2,-1
3,-1
4,-1


(10000, 1)


Dataframe merged


Unnamed: 0,Avis,Score
0,Obviously made to show famous 1950s stripper M...,-1
1,This film was more effective in persuading me ...,-1
2,Unless you are already familiar with the pop s...,-1
3,From around the time Europe began fighting Wor...,-1
4,Im not surprised that even cowgirls get the bl...,-1


(10000, 2)

## Pré-traitement des données
---

Structuration des données afin d'y réaliser des manipulations

### Création d'une copie du dataset et shuffling des entrées

In [61]:
# Copie du DF originale
df2 = shuffle(df)
# Réinitialisation des index
#df2.reset_index()
df2.reset_index(inplace = True, 
                drop = True)
display(df2.head())

Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


### Code non utilisé, laissé à part

### Encodage du dataset en ASCII

In [63]:
def remove_non_ascii(words):
    '''Removes non-ascii characters from words in dataset'''
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

print('Avant')
display(df2.head())

df2['Avis'] = remove_non_ascii(df2['Avis'])
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


Après


Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


### Remplacement des contractions

In [64]:
def replace_contractions(sentences):
    new_sentences = []
    for sentence in sentences:
        new_sentences.append(contractions.fix(sentence))
    return new_sentences

print('Avant')
display(df2.head())

df2['Avis'] = replace_contractions(df2['Avis'])
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


Après


Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


### Transformation du corpus en minuscule

In [65]:
# Change all characters to lower case
print('Avant')
display(df2.head())

df2['Avis'] = [entry.lower() for entry in df2['Avis']]
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,After having read two or three negative review...,1
1,I recently (May 2008) discovered that this chi...,1
2,"Pathetic is the word. Bad acting, pathetic scr...",-1
3,Spencer Tracy and Katherine Hepburn would roll...,-1
4,This in my opinion is one of the best action m...,1


Après


Unnamed: 0,Avis,Score
0,after having read two or three negative review...,1
1,i recently (may 2008) discovered that this chi...,1
2,"pathetic is the word. bad acting, pathetic scr...",-1
3,spencer tracy and katherine hepburn would roll...,-1
4,this in my opinion is one of the best action m...,1


### Tokenisation

In [66]:
print('Avant')
display(df2.head())

# Number Replacement
df2['Avis'] = [word_tokenize(entry) for entry in df2['Avis']]
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,after having read two or three negative review...,1
1,i recently (may 2008) discovered that this chi...,1
2,"pathetic is the word. bad acting, pathetic scr...",-1
3,spencer tracy and katherine hepburn would roll...,-1
4,this in my opinion is one of the best action m...,1


Après


Unnamed: 0,Avis,Score
0,"[after, having, read, two, or, three, negative...",1
1,"[i, recently, (, may, 2008, ), discovered, tha...",1
2,"[pathetic, is, the, word, ., bad, acting, ,, p...",-1
3,"[spencer, tracy, and, katherine, hepburn, woul...",-1
4,"[this, in, my, opinion, is, one, of, the, best...",1


### Remplacement des numériques

In [67]:
def replace_numbers(tokens):
    '''Transforms all digits into their letter equivalents in dataset'''
    p = inflect.engine()
    new_tokens = []

    for token in tokens:
        if token.isdigit():
            new_token = p.number_to_words(token)
            new_tokens.append(new_token)
        else:
            new_tokens.append(token)
    return new_tokens

print('Avant')
display(df2.head())

# Number Replacement
df2['Avis'] = [replace_numbers(tokens) for tokens in df2['Avis']]
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,"[after, having, read, two, or, three, negative...",1
1,"[i, recently, (, may, 2008, ), discovered, tha...",1
2,"[pathetic, is, the, word, ., bad, acting, ,, p...",-1
3,"[spencer, tracy, and, katherine, hepburn, woul...",-1
4,"[this, in, my, opinion, is, one, of, the, best...",1


Après


Unnamed: 0,Avis,Score
0,"[after, having, read, two, or, three, negative...",1
1,"[i, recently, (, may, two thousand and eight, ...",1
2,"[pathetic, is, the, word, ., bad, acting, ,, p...",-1
3,"[spencer, tracy, and, katherine, hepburn, woul...",-1
4,"[this, in, my, opinion, is, one, of, the, best...",1


### Stop Words

In [68]:
def remove_stopwords(tokens, exceptions):
    '''Removes all stopwords from dataset'''
    new_tokens = []
    stop = set(stopwords.words('english')) - exceptions
    
    for token in tokens:
        if token not in stop:
            new_tokens.append(token)
    return new_tokens

exceptions = set(('not',))

print('Avant')
display(df2.head())

# Stop Words Removal
df2['Avis'] = [remove_stopwords(tokens, exceptions) for tokens in df2['Avis']]
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,"[after, having, read, two, or, three, negative...",1
1,"[i, recently, (, may, two thousand and eight, ...",1
2,"[pathetic, is, the, word, ., bad, acting, ,, p...",-1
3,"[spencer, tracy, and, katherine, hepburn, woul...",-1
4,"[this, in, my, opinion, is, one, of, the, best...",1


Après


Unnamed: 0,Avis,Score
0,"[read, two, three, negative, reviews, main, pa...",1
1,"[recently, (, may, two thousand and eight, ), ...",1
2,"[pathetic, word, ., bad, acting, ,, pathetic, ...",-1
3,"[spencer, tracy, katherine, hepburn, would, ro...",-1
4,"[opinion, one, best, action, movies, 1970s, .,...",1


### Suppression des ponctuations

In [69]:
def remove_punctuation(words):
    '''Removes all punctuation from dataset'''
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

print('Avant')
display(df2.head())

# Punctuation removal
df2['Avis'] = [remove_punctuation(tokens) for tokens in df2['Avis']]
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,"[read, two, three, negative, reviews, main, pa...",1
1,"[recently, (, may, two thousand and eight, ), ...",1
2,"[pathetic, word, ., bad, acting, ,, pathetic, ...",-1
3,"[spencer, tracy, katherine, hepburn, would, ro...",-1
4,"[opinion, one, best, action, movies, 1970s, .,...",1


Après


Unnamed: 0,Avis,Score
0,"[read, two, three, negative, reviews, main, pa...",1
1,"[recently, may, two thousand and eight, discov...",1
2,"[pathetic, word, bad, acting, pathetic, script...",-1
3,"[spencer, tracy, katherine, hepburn, would, ro...",-1
4,"[opinion, one, best, action, movies, 1970s, no...",1


### Lemmatisation

In [70]:
#creation of the default dictionary of POS tags
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()

def lemmatize(tokens):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(tokens):
        # Below condition is to check for Stop words and consider only alphabets
        word_Final = word_Lemmatized.lemmatize(word, tag_map[tag[0]])
        Final_words.append(word_Final)
    return Final_words

print('Avant')
display(df2.head())

df2['Avis'] = [lemmatize(tokens) for tokens in df2['Avis']]
print('Après')
display(df2.head())

Avant


Unnamed: 0,Avis,Score
0,"[read, two, three, negative, reviews, main, pa...",1
1,"[recently, may, two thousand and eight, discov...",1
2,"[pathetic, word, bad, acting, pathetic, script...",-1
3,"[spencer, tracy, katherine, hepburn, would, ro...",-1
4,"[opinion, one, best, action, movies, 1970s, no...",1


Après


Unnamed: 0,Avis,Score
0,"[read, two, three, negative, review, main, pag...",1
1,"[recently, may, two thousand and eight, discov...",1
2,"[pathetic, word, bad, act, pathetic, script, c...",-1
3,"[spencer, tracy, katherine, hepburn, would, ro...",-1
4,"[opinion, one, best, action, movie, 1970s, not...",1


### Checkpoint 

In [71]:
df3 = df2.copy()
# manipulation de df3 par la suite

### Word Vectorization

In [72]:
def identity_tokenizer(text):
    return text

vect = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False)
vectors = vect.fit_transform(df3['Avis'])
display(vect.get_feature_names())


# X = vectors.toarray()
# y = df3['Score']

df_test = pd.DataFrame(data=vect.transform(df3['Avis']).toarray(), columns=vect.get_feature_names())
display(df_test)

['0',
 '00',
 '0000000000001',
 '000001',
 '0001',
 '00015',
 '001',
 '002',
 '00383042',
 '007',
 '00s',
 '01',
 '010',
 '01000',
 '010guinea',
 '010ps',
 '0110',
 '012310',
 '02',
 '02i',
 '0310',
 '04',
 '048',
 '05',
 '050',
 '0510',
 '053105',
 '06',
 '07',
 '089',
 '09082009',
 '091505',
 '0mood',
 '0stars',
 '1',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '1000000',
 '10000000',
 '1000lb',
 '100am',
 '100k',
 '100m',
 '100minute',
 '100mph',
 '100square',
 '100th',
 '100thgrade',
 '100x',
 '100yards',
 '100years',
 '101',
 '1010',
 '1010eliason',
 '1010james',
 '1010replayable',
 '1010zafoid',
 '1011',
 '1012',
 '1013',
 '1014',
 '1015',
 '101503',
 '101end',
 '101year',
 '101yearold',
 '102030',
 '1025',
 '102nd',
 '1035',
 '1040a',
 '105lbs',
 '107',
 '1075',
 '10acting',
 '10as',
 '10bad',
 '10check',
 '10day',
 '10four',
 '10hour',
 '10i',
 '10line',
 '10minute',
 '10now',
 '10pm',
 '10ps',
 '10scale',
 '10second',
 '10star',
 '10th',
 '10the',
 '10this',
 '10times',
 '10

MemoryError: 

### Train/Test Data sets

In [7]:
validation_size = 0.3
test_size = 1 - validation_size
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(
    X,
    y,
    train_size = validation_size,
    test_size = test_size
)

### ML algorithms
---
#### Naive Bayes

In [10]:
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X, Train_Y)

res_NB = Naive.predict(Test_X)
print("NB accuracy -> ", accuracy_score(res_NB, Test_Y)*100, "%")

NB accuracy ->  48.13333333333333 %


#### Gaussian Naive Bayes

In [None]:
GNB = naive_bayes.GaussianNB()
GNB.fit(Train_X, Train_Y)

res_NB = GNB.predict(Test_X)
print("GNB accuracy -> ", accuracy_score(res_NB, Test_Y)*100, "%")

#### SVM

In [11]:
SVM = svm.SVC(C = 1.0, kernel = 'linear', degree = 3, gamma = 'auto')
SVM.fit(Train_X_vect, Train_Y)

res_SVM = SVM.predict(Test_X_vect)

print("SVM accuracy -> ", accuracy_score(res_SVM, Test_Y)*100, "%")

SVM accuracy ->  47.5 %


### TODO - Optimization
1. preserve '!' and '?' in corpus
1. preserve fully capitalized words