<p>BELLEL OUSSAMA : 21817080</p>
<p>BOURGIN Jérémy : 20140477</p>
<p>DIABATE Amara : 21509428</p>
<p>HERRI Abdallah : 21712664</p>

<h1>Initialisation des données</h1>

In [1]:
import pandas as pd
import sys
import re

sys.path.insert(0, "../lib")

import os

import nltk
from nltk import ngrams
from nltk.util import everygrams
from nltk.parse import CoreNLPParser
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import pickle

os.environ["NLTK_DATA"] = "../"
nltk.data.path.append("../nltk_data")

pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')

<h1>Import et association des données</h1>

In [2]:
df_set = pd.read_csv('../data/dataset.csv',sep="\t", header=None, encoding="utf8")
df_set = df_set.rename(index = int, columns = {0: "comment"})
#display(df1)

df_labels = pd.read_csv('../data/labels.csv',sep="\t", header=None, encoding="utf8")
df_labels = df_labels.rename(index = int, columns = {0: "result"})
#display(df2)

df_raw = pd.concat([df_set, df_labels], axis=1)

df_raw

Unnamed: 0,comment,result
0,Obviously made to show famous 1950s stripper M...,-1
1,This film was more effective in persuading me ...,-1
2,Unless you are already familiar with the pop s...,-1
3,From around the time Europe began fighting Wor...,-1
4,Im not surprised that even cowgirls get the bl...,-1
5,(48 out of 278 people found this comment usefu...,-1
6,Went to watch this movie expecting a 'nothing ...,-1
7,A good cast and they do their best with what t...,-1
8,The only thing that kept me from vomiting afte...,-1
9,"I just watched this film 15 minutes ago, and I...",-1


<h1>Pré-traitement</h1>

<h2>Nettoyage du texte</h2>
<ul>
    <li>Supression des liens</li>
    <li>Supression des adresses mails</li>
    <li>Supression des doubles quotes (NLTK va considérer "word" et word comme 2 mots distinct)</li>
    <li>Ajout d'un espace avant et après chaque ponctuation (si quelqu'un fait une faute de frappe et écrit (word1.word2 NLTK va considérer cela comme un seul mot)</li>
    <li>Supression des simple quote inutile</li>
</ul>

In [3]:
def clean_text(text):
    replace_chars = [
        [r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', "", True],
        [r'[\w\.-]+@[\w\.-]+', "", True],
        ["\"", "", False],
        ["|", " ", False],
        [".", " . ", False],
        [",", " , ", False],
        [";", " ; ", False],
        ["?", " ? ", False],
        [":", " : ", False],
        ["!", " ! ", False],
        ["``", "", False],
        ["`", "'", False],
        [" '", " ", False],
        ["' ", " ", False]
    ]

    for e in replace_chars:
        if (e[2]):
            text = re.sub(e[0], e[1], text, flags=re.MULTILINE)
        else:
            text = text.replace(e[0], e[1])

    return text


<h2>Tokeniser du texte avec NLTK</h2>

In [4]:
def text_to_token(text):
    tokens = nltk.word_tokenize(text)
    
    return tokens


<p>détokenizer</p>

In [5]:
def token_to_text(tokens, delete_words = [], rename_words = {}, concat_words = []):    
    r = ""
    rename_words_keys = rename_words.keys()
    
    for w in tokens:
        if (w in delete_words):
            continue
        
        if (w in rename_words_keys):
            w = rename_words[w]
        
        r += w
        
        if (w in concat_words):
            r += "|"
        else:
            r += " "
    
    return r

<h2>Filtrage des tokens</h2>
<p>Les adjectifs permette de qualifier différentes choses dans une phrase. C'est pourquoi, nous gardons seulement les adjectifs.</p>

In [6]:
def filter_tags(tokens, ls):
    temp_tokens = []
    
    for e in tokens:
        o = e.split("|")
        e = o[-1]
        
        if (len(e) == 0):
            e = o[0]
        
        temp_tokens.append(e)
        
    #old way : NLTK
    tags = nltk.pos_tag(temp_tokens)
    
    #new way : stanford
    #tags = pos_tagger.tag(temp_tokens)
    
    r = []
    
    for i in range(0, len(tags)):
        e = tags[i]
        
        for t in ls:
            if e[1] == t:
                w = tokens[i].replace("|", "")
                r.append(w)
        
    return r


<h2>NLTK : n-grams, every-grams, and Stop-Word</h2>

<p>Test</p>
<p>Conclusion du test : n-grams et every-grams sont tous les deux intéressants. Cependant, il est encore difficile de les départager (on pense quand même que every-grams sera meilleur, car il sera peut être plus adapté sur d'autre jeu de données alors que n-grams sera peut être meilleur sur le jeu de données initial mais moins bon sur d'autres. On notera que l'utilisation de ces 2 algorithmes nécessite l'utilisation des stop-word pour avoir des associations de mots plus intéressants.</p>

In [7]:
text = "N-GRAMS and EVERYGRAMS are pretty good, but I don't know which will be better. This exemple is not so good. I'm into bad."
tokens = nltk.word_tokenize(text)

n = 4
twograms = list(ngrams(tokens, n))
twoeverygrams = list(everygrams(tokens, max_len=n))

stopWords = set(stopwords.words('english'))
tokens_word = word_tokenize(text)

tokens_filtered = []
text_filtered = ""

for w in tokens_word:
    if (w == "n't"):
        w = "not"
    
    if (w == "not") or (w not in stopWords):
        tokens_filtered.append(w)
        text_filtered += w + " "

twogramsstopword = list(ngrams(tokens_filtered, n))
twoeverygramsstopword = list(everygrams(tokens_filtered, max_len=n))

print("----------------------------------------------")

print ("original text : ")
print(text)

print ("stop word : ")
print(text_filtered)

print("----------------------------------------------")

print("TWO GRAMS")
display (twograms)
print("EVERY GRAMS")
display (twoeverygrams)

print("----------------------------------------------")

print("TWO GRAMS + stop words")
display (twogramsstopword)
print("EVERY GRAMS + stop words")
display (twoeverygramsstopword)

print("----------------------------------------------")

----------------------------------------------
original text : 
N-GRAMS and EVERYGRAMS are pretty good, but I don't know which will be better. This exemple is not so good. I'm into bad.
stop word : 
N-GRAMS EVERYGRAMS pretty good , I not know better . This exemple not good . I 'm bad . 
----------------------------------------------
TWO GRAMS


[('N-GRAMS', 'and', 'EVERYGRAMS', 'are'),
 ('and', 'EVERYGRAMS', 'are', 'pretty'),
 ('EVERYGRAMS', 'are', 'pretty', 'good'),
 ('are', 'pretty', 'good', ','),
 ('pretty', 'good', ',', 'but'),
 ('good', ',', 'but', 'I'),
 (',', 'but', 'I', 'do'),
 ('but', 'I', 'do', "n't"),
 ('I', 'do', "n't", 'know'),
 ('do', "n't", 'know', 'which'),
 ("n't", 'know', 'which', 'will'),
 ('know', 'which', 'will', 'be'),
 ('which', 'will', 'be', 'better'),
 ('will', 'be', 'better', '.'),
 ('be', 'better', '.', 'This'),
 ('better', '.', 'This', 'exemple'),
 ('.', 'This', 'exemple', 'is'),
 ('This', 'exemple', 'is', 'not'),
 ('exemple', 'is', 'not', 'so'),
 ('is', 'not', 'so', 'good'),
 ('not', 'so', 'good', '.'),
 ('so', 'good', '.', 'I'),
 ('good', '.', 'I', "'m"),
 ('.', 'I', "'m", 'into'),
 ('I', "'m", 'into', 'bad'),
 ("'m", 'into', 'bad', '.')]

EVERY GRAMS


[('N-GRAMS',),
 ('and',),
 ('EVERYGRAMS',),
 ('are',),
 ('pretty',),
 ('good',),
 (',',),
 ('but',),
 ('I',),
 ('do',),
 ("n't",),
 ('know',),
 ('which',),
 ('will',),
 ('be',),
 ('better',),
 ('.',),
 ('This',),
 ('exemple',),
 ('is',),
 ('not',),
 ('so',),
 ('good',),
 ('.',),
 ('I',),
 ("'m",),
 ('into',),
 ('bad',),
 ('.',),
 ('N-GRAMS', 'and'),
 ('and', 'EVERYGRAMS'),
 ('EVERYGRAMS', 'are'),
 ('are', 'pretty'),
 ('pretty', 'good'),
 ('good', ','),
 (',', 'but'),
 ('but', 'I'),
 ('I', 'do'),
 ('do', "n't"),
 ("n't", 'know'),
 ('know', 'which'),
 ('which', 'will'),
 ('will', 'be'),
 ('be', 'better'),
 ('better', '.'),
 ('.', 'This'),
 ('This', 'exemple'),
 ('exemple', 'is'),
 ('is', 'not'),
 ('not', 'so'),
 ('so', 'good'),
 ('good', '.'),
 ('.', 'I'),
 ('I', "'m"),
 ("'m", 'into'),
 ('into', 'bad'),
 ('bad', '.'),
 ('N-GRAMS', 'and', 'EVERYGRAMS'),
 ('and', 'EVERYGRAMS', 'are'),
 ('EVERYGRAMS', 'are', 'pretty'),
 ('are', 'pretty', 'good'),
 ('pretty', 'good', ','),
 ('good', ',', 'b

----------------------------------------------
TWO GRAMS + stop words


[('N-GRAMS', 'EVERYGRAMS', 'pretty', 'good'),
 ('EVERYGRAMS', 'pretty', 'good', ','),
 ('pretty', 'good', ',', 'I'),
 ('good', ',', 'I', 'not'),
 (',', 'I', 'not', 'know'),
 ('I', 'not', 'know', 'better'),
 ('not', 'know', 'better', '.'),
 ('know', 'better', '.', 'This'),
 ('better', '.', 'This', 'exemple'),
 ('.', 'This', 'exemple', 'not'),
 ('This', 'exemple', 'not', 'good'),
 ('exemple', 'not', 'good', '.'),
 ('not', 'good', '.', 'I'),
 ('good', '.', 'I', "'m"),
 ('.', 'I', "'m", 'bad'),
 ('I', "'m", 'bad', '.')]

EVERY GRAMS + stop words


[('N-GRAMS',),
 ('EVERYGRAMS',),
 ('pretty',),
 ('good',),
 (',',),
 ('I',),
 ('not',),
 ('know',),
 ('better',),
 ('.',),
 ('This',),
 ('exemple',),
 ('not',),
 ('good',),
 ('.',),
 ('I',),
 ("'m",),
 ('bad',),
 ('.',),
 ('N-GRAMS', 'EVERYGRAMS'),
 ('EVERYGRAMS', 'pretty'),
 ('pretty', 'good'),
 ('good', ','),
 (',', 'I'),
 ('I', 'not'),
 ('not', 'know'),
 ('know', 'better'),
 ('better', '.'),
 ('.', 'This'),
 ('This', 'exemple'),
 ('exemple', 'not'),
 ('not', 'good'),
 ('good', '.'),
 ('.', 'I'),
 ('I', "'m"),
 ("'m", 'bad'),
 ('bad', '.'),
 ('N-GRAMS', 'EVERYGRAMS', 'pretty'),
 ('EVERYGRAMS', 'pretty', 'good'),
 ('pretty', 'good', ','),
 ('good', ',', 'I'),
 (',', 'I', 'not'),
 ('I', 'not', 'know'),
 ('not', 'know', 'better'),
 ('know', 'better', '.'),
 ('better', '.', 'This'),
 ('.', 'This', 'exemple'),
 ('This', 'exemple', 'not'),
 ('exemple', 'not', 'good'),
 ('not', 'good', '.'),
 ('good', '.', 'I'),
 ('.', 'I', "'m"),
 ('I', "'m", 'bad'),
 ("'m", 'bad', '.'),
 ('N-GRAMS', 'EVER

----------------------------------------------


<h1>Stop word + détection négation</h1>

In [8]:
def clean_and_negation(text):
    tokens = word_tokenize(text)
    stopWords = set(stopwords.words('english'))
    stopWords.remove('not') 
    
    r = token_to_text(
        tokens,
        delete_words=stopWords,
        rename_words={"n't": "not"},
        concat_words=["not"]
    )
    
    return r

<h1>Fonction de pré-traitement</h1>

In [9]:
def pre_traitement(text):
    text = clean_text(text)
    text = clean_and_negation(text)
    tokens = text_to_token(text)
    r = filter_tags(tokens, ["JJ", "JJR", "JJS", "RBR", "RBS"])
    
    return r


<p>exemple pre-traitement</p>

In [10]:
exemple = df_raw["comment"][4]
filtered_tags = pre_traitement(exemple)

exemple2 = "this is not good"
filtered_tags2 = pre_traitement(exemple2)

display(filtered_tags)
display(filtered_tags2)

['cowgirls', 'better', 'awful', 'incapable']

['notgood']

<h1>Différentes fonctions</h1>

<p>Fonction pour appliquer un algorithme</p>

In [11]:
def apply_algo(algo_avis_fun, algo_sarcasme_fun, text):
    result_avis = algo_avis_fun(text)
    result_sarcasme = algo_sarcasme_fun(text)
    
    if (result_sarcasme):
        return -result_avis
    else:
        return result_avis


<p>Fonction pour tester le taux de réussite d'un algorithme</p>

In [12]:
def test_algo(algo_avis_fun, algo_sarcasme_fun, df):
    size = df.shape[0]
    coef = 0
    acc = []

    for i in range(0, size):
        filtered_tags = df["filtered_tags"][i]
        algo_result = apply_algo(algo_avis_fun, algo_sarcasme_fun, filtered_tags)
        acc.append(algo_result)
        
    accuracy = accuracy_score(df["result"], acc)
    report = classification_report(df["result"], acc)
    confusion = confusion_matrix(df["result"], acc)
    r = {"accuracy": accuracy, "report": report, "confusion": confusion}
    
    return r


<p>Algorithme de sarcasme par défaut (uniquement pour faire des tests tant que l'on a pas d'algorithme pour détecter le sarcasme)</p>

In [13]:
def dumb_sarcasme(text):
    return False


<h1>Fonction de pré-traitement</h1>
<p>Fonction permettant de créer un nouveau classifieur avec tous les pré-traitements appliqué</p>

In [14]:
def parse_df(df):
    size = df.shape[0]
    
    data = {
        "filtered_tags": [],
        "text_filtered_tags": [],
        "clean_comment": [],
        "neg_comment": []
    }
    
    for i in range(0, size):
        text = df["comment"][i]
        text_cleaned = clean_text(text)
        text_neg = clean_and_negation(text_cleaned).replace("|", "")
        filtered_tags = pre_traitement(text)
        text_filtered_tags = token_to_text(filtered_tags)
        
        data["filtered_tags"].append(filtered_tags)
        data["text_filtered_tags"].append(text_filtered_tags)
        data["clean_comment"].append(text_cleaned)
        data["neg_comment"].append(text_neg)
    
    df_temp = pd.DataFrame(data)
    df_result = pd.concat([df, df_temp], axis=1)
    
    return df_result

In [15]:
clean_df = parse_df(df_raw)

clean_df

Unnamed: 0,comment,result,filtered_tags,text_filtered_tags,clean_comment,neg_comment
0,Obviously made to show famous 1950s stripper M...,-1,"[famous, bad, little, tale, innocent, white, l...",famous bad little tale innocent white likable ...,Obviously made to show famous 1950s stripper M...,Obviously made show famous 1950s stripper Mist...
1,This film was more effective in persuading me ...,-1,"[effective, Jewish, Freshman, alarmist, palata...",effective Jewish Freshman alarmist palatable p...,This film was more effective in persuading me ...,This film effective persuading Zionist conspir...
2,Unless you are already familiar with the pop s...,-1,"[familiar, stop, next, insist, Japanese, teen,...",familiar stop next insist Japanese teen idol p...,Unless you are already familiar with the pop s...,"Unless already familiar pop stars star film , ..."
3,From around the time Europe began fighting Wor...,-1,"[significant, young, enlist, serviceman, appea...",significant young enlist serviceman appear slo...,From around the time Europe began fighting Wor...,From around time Europe began fighting World W...
4,Im not surprised that even cowgirls get the bl...,-1,"[cowgirls, better, awful, incapable]",cowgirls better awful incapable,Im not surprised that even cowgirls get the bl...,Im notsurprised even cowgirls get blues movie ...
5,(48 out of 278 people found this comment usefu...,-1,"[found, useful, much, hollow, iconic, worse, m...",found useful much hollow iconic worse murder-s...,(48 out of 278 people found this comment usefu...,"( 48 278 people found comment useful , countin..."
6,Went to watch this movie expecting a 'nothing ...,-1,"[Went, much, disappointed, opening, little, fi...",Went much disappointed opening little first re...,Went to watch this movie expecting a nothing r...,Went watch movie expecting nothing really much...
7,A good cast and they do their best with what t...,-1,"[good, best, inexplicable, many, unintentional...",good best inexplicable many unintentional pier...,A good cast and they do their best with what t...,"A good cast best 're given , story makes sense..."
8,The only thing that kept me from vomiting afte...,-1,"[main, much, root, abyssmal, bad, worse, poor,...",main much root abyssmal bad worse poor first,The only thing that kept me from vomiting afte...,The thing kept vomiting seeing movie fact acto...
9,"I just watched this film 15 minutes ago, and I...",-1,"[star, realistic, murdered, sister, investigat...",star realistic murdered sister investigate loc...,"I just watched this film 15 minutes ago , and...","I watched film 15 minutes ago , I still idea I..."


<h1>Classifieur naïf</h1>

In [16]:
# dumb classification
def dumb_classifier(df, begin, end, step):
    adj_ls = {}

    for i in range(begin, end):
        filtered_tags = df["filtered_tags"][i]

        for e in filtered_tags:
            if not (e in adj_ls.keys()):
                adj_ls[e] = [0, 0]

            if (df["result"][i] == 1):
                j = 0
            else:
                j = 1

            adj_ls[e][j] += 1

    result = {}

    for key, e in adj_ls.items():
        calc = e[0] - e[1]
        if (calc > step or calc < -step):
            result[key] = e

    return result


In [17]:
def dumb_algo(text):
    result = 0
    
    # if already pre-treated
    if (isinstance(text, list)):
        filtered_tags = text
    else:  
        filtered_tags = pre_traitement(text)
    
    for e in filtered_tags:
        if (not (e in dumb_classified.keys())):
            continue
        
        adj_weight = dumb_classified[e]
        weight = adj_weight[0] - adj_weight[1]
        result += weight

    if (result >= 0):
        return 1
    else:
        return -1


<h2>Test Dumb classifieur</h2>

<p>classification</p>

In [18]:
#exemple dumb classification
size = clean_df.shape[0]
dumb_classified = dumb_classifier(clean_df, 0, size, 5)

dumb_classified

{'famous': [169, 98],
 'bad': [540, 2936],
 'little': [1055, 856],
 'tale': [70, 24],
 'innocent': [85, 53],
 'white': [237, 193],
 'likable': [85, 44],
 'real': [932, 715],
 'underwear': [9, 15],
 'atrocious': [5, 81],
 'terrible': [96, 628],
 'notenough': [32, 72],
 'effective': [112, 38],
 'possible': [168, 232],
 'good': [2516, 2212],
 'familiar': [96, 57],
 'stop': [33, 39],
 'next': [283, 272],
 'Japanese': [172, 87],
 'teen': [29, 45],
 'nonsensical': [5, 25],
 'ready': [48, 57],
 'endless': [14, 50],
 'various': [129, 91],
 'dramatic': [140, 71],
 'stupid': [92, 541],
 'spent': [11, 29],
 'mysterious': [82, 57],
 'pointless': [5, 65],
 'piece': [28, 66],
 'insipid': [3, 28],
 'mindless': [11, 27],
 'young': [826, 358],
 'soldier': [29, 20],
 'devoid': [3, 24],
 'best': [1436, 635],
 'bunch': [38, 119],
 'musical': [217, 90],
 'leading': [16, 10],
 'naive': [42, 33],
 'meant': [25, 40],
 'better': [732, 1008],
 'awful': [46, 568],
 'incapable': [4, 15],
 'much': [1067, 882],
 'h

<p>Tester le résultat sur un commentaire positif et négatif du jeu de données initiale</p>

In [19]:
text1 = clean_df["comment"][4]
result1 = apply_algo(dumb_algo, dumb_sarcasme, text1)

text2 = clean_df["comment"][5285]
result2 = apply_algo(dumb_algo, dumb_sarcasme, text2)

text_neg = "this is not good"
result3 = apply_algo(dumb_algo, dumb_sarcasme, text_neg)

print(result1, result2, result3)

-1 1 -1


<p>Evaluer le taux de réussiste sur le jeu de données initiale</p>

In [20]:
dumb_coef = test_algo(dumb_algo, dumb_sarcasme, clean_df)

print('accuracy du jeu de données initial:\n')
print(dumb_coef["accuracy"])
print ('\n matrice de confusion \n',dumb_coef["confusion"])
print(dumb_coef["report"])

accuracy du jeu de données initial:

0.7638

 matrice de confusion 
 [[3168 1832]
 [ 530 4470]]
              precision    recall  f1-score   support

          -1       0.86      0.63      0.73      5000
           1       0.71      0.89      0.79      5000

   micro avg       0.76      0.76      0.76     10000
   macro avg       0.78      0.76      0.76     10000
weighted avg       0.78      0.76      0.76     10000



<h1>classifieurs SickitLearn</h1>

In [21]:
def classification(clf, x, y, min_df = 0.0015, max_df=0.3, ngram_range=(1,3), is_to_array=False):
    vectorizer = TfidfVectorizer(min_df = min_df, max_df=max_df, ngram_range=ngram_range)
    k_fold = KFold(n_splits=10, random_state=None, shuffle=True)
    x = vectorizer.fit_transform(x)
    
    if is_to_array: x = x.toarray()
    
    clf.fit(x, y)
    
    r = {"clf": clf, "k_fold": k_fold, "vectorizer": vectorizer, "x": x}
    
    return r

In [22]:
def scoring_classification(clf, k_fold, x, y):
    score = cross_val_score(clf, x, y, cv=k_fold, scoring='accuracy')

    print('Les différentes accuracy pour les 10 évaluations sont :')
    print(score)
    print ('Accuracy moyenne :', score.mean())
    print('standard deviation', score.std())

<h2>MultinomialNB</h2>

In [24]:
r = classification(
    MultinomialNB(),
    clean_df["clean_comment"],
    clean_df["result"]
)

clf_multi_nb = r["clf"]
k_fold = r["k_fold"]
vectorizer_multi_nb = r["vectorizer"]

scoring_classification(
    clf_multi_nb,
    k_fold,
    r["x"],
    clean_df["result"]
)

Les différentes accuracy pour les 10 évaluations sont :
[0.917 0.901 0.911 0.908 0.902 0.903 0.92  0.903 0.919 0.905]
Accuracy moyenne : 0.9089
standard deviation 0.00700642562224135


<h2>GaussianNB</h2>

In [25]:
r = classification(
    GaussianNB(),
    clean_df["clean_comment"],
    clean_df["result"],
    ngram_range=(1,3),
    is_to_array=True
)

clf_gaussian = r["clf"]
k_fold = r["k_fold"]
vectorizer_gaussian = r["vectorizer"]

scoring_classification(
    clf_gaussian,
    k_fold,
    r["x"],
    clean_df["result"]
)

Les différentes accuracy pour les 10 évaluations sont :
[0.859 0.865 0.88  0.882 0.883 0.88  0.887 0.877 0.884 0.876]
Accuracy moyenne : 0.8773
standard deviation 0.008343260753446468


<h2>DecisionTreeClassifier</h2>

In [26]:
r = classification(
    DecisionTreeClassifier(random_state=0),
    clean_df["text_filtered_tags"],
    clean_df["result"],
    ngram_range=(1,2)
)

clf_tree = r["clf"]
k_fold = r["k_fold"]
vectorizer_tree = r["vectorizer"]

scoring_classification(
    clf_tree,
    k_fold,
    r["x"],
    clean_df["result"]
)


Les différentes accuracy pour les 10 évaluations sont :
[0.75  0.762 0.761 0.719 0.733 0.742 0.739 0.735 0.733 0.763]
Accuracy moyenne : 0.7436999999999999
standard deviation 0.014092906016858283


<h1>Test des différents classifieur</h1>

<h2>Tester sur un jeu de données indépendant</h2>
<p>Pour vérifier si un classifieur est efficace, il faut avoir une estimation moyenne des bons et des mauvais résultats qu'il donne. Cependant, faire cette moyenne sur le jeu de données sur lequel on a apprit, peut s'avérer faux. En effet, il se peut que la classification colle trop au jeu de données. Dans ce cas, si on fait le test sur un autre jeu de données, les résultats pourront être mauvais. Il est donc important d'avoir un autre jeu de données, sur lequel on ne fait aucune classification pour avoir un taux de réussite "réaliste".</p>

In [27]:
df_test = pd.read_csv('../data/test.csv',sep="\t", header=None, encoding="utf8")
del df_test[0]
df_test = df_test.rename(index = int, columns = {1: "comment", 2: "result"})

clean_df_test = parse_df(df_test)


<h3>Test de Dumb Classifier</h3>

In [28]:
dumb_coef_test = test_algo(dumb_algo, dumb_sarcasme, clean_df_test)

print('accuracy du jeu de données de IMDB:\n')
print(dumb_coef_test["accuracy"])
print ('\n matrice de confusion \n',dumb_coef_test["confusion"])
print(dumb_coef_test["report"])

accuracy du jeu de données de IMDB:

0.71842

 matrice de confusion 
 [[13733 11267]
 [ 2812 22188]]
              precision    recall  f1-score   support

          -1       0.83      0.55      0.66     25000
           1       0.66      0.89      0.76     25000

   micro avg       0.72      0.72      0.72     50000
   macro avg       0.75      0.72      0.71     50000
weighted avg       0.75      0.72      0.71     50000



<h2>Jeu de données du challenge</h2>

In [29]:
df_challenge_comment = pd.read_csv('../data/challenge/test_data.csv',sep="\t", header=None, encoding="utf8")
df_challenge_comment = df_challenge_comment.rename(index = int, columns = {0: "comment"})

df_challenge_result = pd.read_csv('../data/challenge/test_labels.csv',sep="\t", header=None, encoding="utf8")
df_challenge_result = df_challenge_result.rename(index = int, columns = {0: "result"})

df_challenge_raw = pd.concat([df_challenge_comment, df_challenge_result], axis=1)
df_challenge_clean = parse_df(df_challenge_raw)

<h3>Test de Dumb classifier</h3>

In [30]:
dumb_coef_test = test_algo(dumb_algo, dumb_sarcasme, df_challenge_clean)

print('accuracy du jeu de données du challenge:\n')
print(dumb_coef_test["accuracy"])
print ('\n matrice de confusion \n',dumb_coef_test["confusion"])
print(dumb_coef_test["report"])

accuracy du jeu de données du challenge:

0.7765

 matrice de confusion 
 [[1308  692]
 [ 202 1798]]
              precision    recall  f1-score   support

          -1       0.87      0.65      0.75      2000
           1       0.72      0.90      0.80      2000

   micro avg       0.78      0.78      0.78      4000
   macro avg       0.79      0.78      0.77      4000
weighted avg       0.79      0.78      0.77      4000



<h2>Pipeline</h2>

<p>Test de MultinomialNB sur :</p>
<ul>
    <li>jeu de données d'entraînement</li>
    <li>jeu de données IMDB</li>
    <li>jeu de données challenge</li>
</ul>

In [35]:
def display_result(result, y):
    accuracy = accuracy_score(y, result)
    report = classification_report(y, result)
    confusion = confusion_matrix(y, result)
    
    print(accuracy)
    print ('\n matrice de confusion \n',confusion)
    print(report)

In [36]:


pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df = 0.0015, max_df=0.3, ngram_range=(1,3))),
    ('trans', TfidfTransformer()),
    ('clf', MultinomialNB())
])

pipeline.fit(clean_df["clean_comment"], clean_df["result"])

pred1 = pipeline.predict(clean_df["clean_comment"])
pred2 = pipeline.predict(clean_df_test["clean_comment"])
pred3 = pipeline.predict(df_challenge_clean["clean_comment"])

print("résulat jeu de données d'entraînement :")
print(display_result(pred1, clean_df["result"]))

print ("--------------------------------------------------")

print("résulat jeu de données IMDB :")
print(display_result(pred2, clean_df_test["result"]))

print ("--------------------------------------------------")

print("résulat jeu de données challenge :")
print(display_result(pred3, df_challenge_clean["result"]))

résulat jeu de données d'entraînement :
0.9515

 matrice de confusion 
 [[4730  270]
 [ 215 4785]]
              precision    recall  f1-score   support

          -1       0.96      0.95      0.95      5000
           1       0.95      0.96      0.95      5000

   micro avg       0.95      0.95      0.95     10000
   macro avg       0.95      0.95      0.95     10000
weighted avg       0.95      0.95      0.95     10000

None
--------------------------------------------------
résulat jeu de données IMDB :
0.86502

 matrice de confusion 
 [[20475  4525]
 [ 2224 22776]]
              precision    recall  f1-score   support

          -1       0.90      0.82      0.86     25000
           1       0.83      0.91      0.87     25000

   micro avg       0.87      0.87      0.87     50000
   macro avg       0.87      0.87      0.86     50000
weighted avg       0.87      0.87      0.86     50000

None
--------------------------------------------------
résulat jeu de données challenge :
0.9167

<p>Sauvegarde du modèle</p>

In [38]:
#filename = "../z.pkl"
#pickle.dump(pipeline, open(filename, "wb"))