# Plateforme Agnostique de Traitement et d'Analyse des Textes
### Carnet d'exp√©rimentation

## Contexte
- Pour notre projet de recherche sur une plateforme de classification d'articles textuels, d'actualit√© en fran√ßais 
- Un jeu de donn√©es de qualit√©, et en quantit√© suffisante, est n√©cessaire.


- Nous n'avons pas trouv√© de jeu de donn√©es pertinent.
- Nous avons travaill√© 6 mois pour en constituer un


- Cette exp√©rience nous confirme qu'il est suffisant pour avancer
- Et qu'on peut esp√©rer de bonnes pr√©dictions avec des algoritmes de machine learning supervis√©s


## Point d'√©tape Septembre 2022
Cette exp√©rience :
- S'appuie sur les *donn√©es* consitu√©es pr√©c√©dement par le projet
- Illustre la *d√©marche* de recherche sur le coeur du projet
- Montre des *r√©sultats* pr√©liminaires

Elle apporte des √©l√©ments de r√©ponse aux questions :
- *Est-il possible de pr√©voir si un article est une infox en utilisant les solutions les plus simples ?*
- *Avons nous suffisament de donn√©es et des qualit√© suffisante pour avancer sur les phases suivantes ?*

Les solutions √©tudi√©es s'appuient :
- Un jeu de donn√©es de 455 articles labelis√©s manuellement
- Repr√©sentation des textes sous forme de *Bag Od Words*
- Machine learning supervis√© : *Regression Logistique* et al.

Nous sommes au d√©but de ce projet de recherche et notre approche est de commencer par les approches les simples, d'en mesurer les r√©sultats, puis d'am√©liorer l'approche dans le but d'obtenir de meilleurs r√©sultats.

---


## Contexte, Environnement et Observations
---

### Technologies utilis√©es

Un environnement √† l'√©tat de l'art
- *python 3.9.13* - https://www.python.org/
- *jupyter notebook* - https://jupyter.org/
- *pandas* - https://pandas.pydata.org/
- *scikit-learn* - hhttps://scikit-learn.org/stable/index.html

Et la plateforme que nous developons dans le cadre du projet
- *patat* - https://github.com/fmaine/patat

### Initialisations

In [1]:
cd ../..

/Users/fm/Desktop/Work/Patat


In [2]:
import importlib
import pandas as pd

### Chargement des donn√©es

Le dataset est compos√© :
- De textes d'environ 87000 articles d'actualit√© en fran√ßais
- De labelisation sur une 20aie de labels d'un sous ensemble de 455 de ces articles



#### Articles

In [3]:
article_filename = 'data/demo/221008-Article.csv'

In [4]:
import patat.db.article_db

article_db = patat.db.article_db.ArticleDb(article_filename)

article_db.df()

Unnamed: 0,url,title,article,site,author,date_iso,url_h
0,https://reseauinternational.net/tous-les-jeune...,"Tous les jeunes, portez la nouvelle (russe)",par Pepe Escobar.\nL‚ÄôOCS √† Samarcande et l‚ÄôAss...,reseauinternational.net,,2022-09-30T00:00:00,f95a294c94ff76cc9626ae06300a8b38067f89cf
1,https://reseauinternational.net/adhesion-a-la-...,Adh√©sion √† la Russie : 93% pour le ¬´¬†oui¬†¬ª dan...,"Dans les r√©gions de Zaporijia et de Kherson, 9...",reseauinternational.net,,2022-09-30T00:00:00,c924dab7ded47578d81c3ae46f8be0964b3c50f1
2,https://lemediaen442.fr/onu-le-premier-ministr...,ONU ‚Äì Le Premier ministre de Nouvelle-Z√©lande ...,L‚Äôargument principal de la ministre est que le...,lemediaen442.fr,,2022-09-29T00:00:00,3d47a59ef99274fd9ee96c209cc2ab41d6e1f6bb
3,https://www.francesoir.fr/societe-environnemen...,Compostage humain: les ‚Äúfun√©railles vertes‚Äù ga...,"Aux √âtats-Unis, les diff√©rents gouvernements r...",www.francesoir.fr,Auteur(s)\nFranceSoir,2022-09-28T13:15:00,0c0341a1f5fae820ee307cb54024df6b06a93d85
4,https://www.breizh-info.com/2022/09/27/208410/...,Donatello : g√©nie de la Renaissance,"Portrait du sculpteur Donatello (1386-1466), p...",www.breizh-info.com,,2022-09-27T00:00:00,fcce819327d0302c4cf5e3a8a43b54327ffb8e63
...,...,...,...,...,...,...,...
87179,https://www.profession-gendarme.com/zelensky-e...,Zelensky est ¬´ une marionnette qui fait inutil...,Encore un journaliste surpris qu‚Äôun colonel am...,www.profession-gendarme.com,,,e346761c2a99d892b26c9388480e0ee6ad6b303b
87180,https://www.profession-gendarme.com/zelensky-l...,Z√©lensky : l‚Äôarnaque de la contre-offensive uk...,Le pr√©sident Zelensky et ses alli√©s de l‚ÄôOtan ...,www.profession-gendarme.com,,,77525354644316eadab53960efa8f5fd028c7f67
87181,https://www.profession-gendarme.com/zelensky-m...,Zelensky massacre maintenant des citoyens ukra...,ZELENSKY A ORDONN√â QUE TOUTES LES PERSONNES FU...,www.profession-gendarme.com,,,fc6a074da5c02032bc9fc3f35f1044bc4029042c
87182,https://www.profession-gendarme.com/zero-mort-...,Z√©ro mort du coronavirus : comment expliquer l...,Le Vietnam est une exception dans le monde : a...,www.profession-gendarme.com,,,e4f0a8961d79c9a5f66c80d7aa4196cb679af2b0


Base d'envion 87000 articles.

On ne s'int√©resse qu'aux textes des titres et du corps des articles

Exemples d'articles :

In [5]:
for index,row in article_db._df.sample(10).iterrows():
    print(f"{row['title']}\n{row['url']}\n{row['article']}")
    print('\n\n--------------------\n\n')

Mus√©e Gr√©vin: Nikos Aliagas, l'animateur de "The Voice", a d√©sormais son double de cire
https://www.francesoir.fr/culture-art-expo/musee-grevin-nikos-aliagas-lanimateur-de-voice-desormais-son-double-de-cire
Le Mus√©e Gr√©vin √† Paris vient d'accueillir un nouveau pensionnaire en son sein. Depuis mercredi, Nikos Aliagas, l'animateur de "The Voice" a d√©sormais son double de cire au sein du temple des c√©l√©brit√©s.  D√©sormais, Nikos Aliagas a lui aussi son double de cire. L'animateur de The Voice √©tait mercredi 7 au soir au Mus√©e Gr√©vin pour inaugurer sa statue. Visiblement √©mu d'avoir fait son entr√©e au sein du temple des c√©l√©brit√©s 23 ans apr√®s ses d√©buts de journaliste, il a d√©di√© cette "victoire" au public et √† sa famille. "Je passe ma vie √† mettre les autres dans la lumi√®re. Je vois dans cette statue de cire mon grand-p√®re et mon p√®re Andreas √† son arriv√©e en France qui lui a ouvert les bras", a-t-il d√©clar√©. Et d'ajouter: "c'est tr√®s √©trange... Je suis t

#### labels

In [7]:
label_filename = 'data/demo/221008-Label.csv'

In [8]:
import patat.db.label_db

label_db = patat.db.label_db.LabelDb(label_filename)

label_db.df()

Unnamed: 0,url,label,value,owner,url_h
0,https://www.anguillesousroche.com/actualite/ou...,trop_mots,0.000000,recueil,365862d336e45b4bfa5d4010536ca7c04f26f780
1,https://www.anguillesousroche.com/actualite/ou...,trop_chiffres,0.000000,recueil,365862d336e45b4bfa5d4010536ca7c04f26f780
2,https://www.anguillesousroche.com/actualite/ou...,sophisme,0.000000,recueil,365862d336e45b4bfa5d4010536ca7c04f26f780
3,https://www.anguillesousroche.com/actualite/ou...,inversion_preuve,0.000000,recueil,365862d336e45b4bfa5d4010536ca7c04f26f780
4,https://www.anguillesousroche.com/actualite/ou...,inverifiable,0.000000,recueil,365862d336e45b4bfa5d4010536ca7c04f26f780
...,...,...,...,...,...
98958,https://www.profession-gendarme.com/zelensky-e...,infox,0.556761,220930-tf_lr.pp,e346761c2a99d892b26c9388480e0ee6ad6b303b
98959,https://www.profession-gendarme.com/zelensky-l...,infox,0.164531,220930-tf_lr.pp,77525354644316eadab53960efa8f5fd028c7f67
98960,https://www.profession-gendarme.com/zelensky-m...,infox,0.673791,220930-tf_lr.pp,fc6a074da5c02032bc9fc3f35f1044bc4029042c
98961,https://www.profession-gendarme.com/zero-mort-...,infox,0.103192,220930-tf_lr.pp,e4f0a8961d79c9a5f66c80d7aa4196cb679af2b0


### Mise en forme du corpus

Concat√©nation : text = titre + article

In [9]:
article_db.df()['text'] = article_db.df()['title'] + '\n' + article_db.df()['article']

In [10]:
df_corpus = article_db.df()

In [11]:
df_label = label_db._df
df_recueil_label = df_label[df_label['owner']=='recueil'].pivot_table(index='url_h',values='value',columns='label')

In [12]:
df_recueil_label = df_recueil_label.reset_index()
df_recueil_label

label,url_h,cherry_picking,denigrement,entites_coherentes,entites_nommees,exageration,faits,fausse_nouvelle,infox,insinuations,...,ouverture_esprit,propos_raportes,qualite_ecriture,scientifique_sulfureux,signe,sophisme,sources_citees,titre_decale,trop_chiffres,trop_mots
0,009770d8b01c877bc11b1c3ca7c481b7ea4b1876,,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,00f659a2b16dbcc63f93844489c0c3175b2a344c,,,1.0,1.0,,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,016cb74a38bcdf27c834b76f5792264fcbadce36,,,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,01a106ca9d80b922fc5979b57156a66a8cd46e38,,,0.0,0.0,,0.0,0.0,1.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,01a787224f485ecd4cbdfd43aa53780efdaec0f4,,,0.0,1.0,,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
491,fd7163e05b2b053b37dca2156cc8e3571027329a,,,1.0,1.0,,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
492,fdb789f4333f4eb6f72c2f77111a3048a1a778b1,,,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
493,fee8bffbd8f2188ff0d862ebfb14b26bb9030dc5,,,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
494,ff26c53b66457a06ff73c98b765cc23158445316,,,0.0,0.0,,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
df_recueil=pd.merge(article_db._df,df_recueil_label)

Cette exp√©rience est appliqu√©e au label infox

In [14]:
df_recueil = df_recueil[['url','text','infox']]
df_recueil

Unnamed: 0,url,text,infox
0,https://reseauinternational.net/tous-les-jeune...,"Tous les jeunes, portez la nouvelle (russe)\np...",0.0
1,https://reseauinternational.net/adhesion-a-la-...,Adh√©sion √† la Russie : 93% pour le ¬´¬†oui¬†¬ª dan...,0.0
2,https://lemediaen442.fr/onu-le-premier-ministr...,ONU ‚Äì Le Premier ministre de Nouvelle-Z√©lande ...,1.0
3,https://www.francesoir.fr/societe-environnemen...,Compostage humain: les ‚Äúfun√©railles vertes‚Äù ga...,0.0
4,https://www.dreuz.info/2022/09/qui-est-elle-an...,"Qui est-elle ? Anti-UE, pro-Otan, pro-Ukraine,...",0.0
...,...,...,...
440,https://www.profession-gendarme.com/leffet-kis...,L‚Äôeffet Kiss Coll de la guerre\nPar WD\n\n\nNo...,1.0
441,https://www.profession-gendarme.com/les-labora...,Les laboratoires du Pentagone et la¬†d√©populati...,1.0
442,https://www.profession-gendarme.com/lettre-ouv...,¬´¬†LETTRE OUVERTE A LA FRANCE ET AUX NATIONS¬†¬ª\...,1.0
443,https://www.profession-gendarme.com/tour-de-fr...,Tour de France ‚Äì Abandon de Victor Lafay :¬´ On...,1.0


## Exp√©rience 1
- Bag of Words avec `sklearn.feature_extraction.text.CountVectorizer`
- Regression logistique avec `sklearn.linear_model.LogisticRegression`

---

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df_recueil['text'])

In [None]:
y = df_recueil['infox']

In [None]:
X.shape

In [None]:
y.value_counts()

### Construction des Datasets d'entrainement et de test

In [None]:
import sklearn.model_selection

#X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,train_size=0.8,shuffle=True)
#X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,random_state=42,train_size=0.5)

X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,random_state=42,train_size=0.75)

X_train.shape

### Entrainement du modele

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#classifier = LogisticRegression(C=100.0, random_state=42, solver='lbfgs', multi_class='ovr', max_iter=1000)
classifier = LogisticRegression(max_iter=1000)

# Fit the model
classifier.fit(X_train, y_train)


### Mesure des r√©sultats

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print('Matrice de confusion')
print(cnf_matrix)
print(f'Accuracy score {metrics.accuracy_score(y_test, y_pred)*100:.2f}%')
print(f'Recall score {metrics.recall_score(y_test, y_pred)*100:.2f}%')

In [None]:
y_proba = classifier.predict_proba(X_test)
y_score = y_proba.transpose()[1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_score)
roc_display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr).plot()

## Exp√©rience 2
- Suppression des mots rares et des mots communs
- Comptage en fr√©quence TdIdf
- Bag of Words avec `sklearn.feature_extraction.text.TdidfVectorizer` + stopwords 
- Regression logistique avec `sklearn.linear_model.LogisticRegression`

---

### Comptage Mots pour label infox

In [None]:
import patat.ml.lex_analyser

In [None]:
lex = patat.ml.lex_analyser.LexAnalyser()

In [None]:
def merge_texts(texts):
    result = ''
    for text in texts:
        result = result + text + '\n'
    return result

In [None]:
def count_words_label(df,label):
    df = df[df[label].notna()]
    df_texts = pd.pivot_table(df, values='text', index=None, columns=label, aggfunc=merge_texts)
    wc = {}
    word_analysis = {}
    for key in df_texts.keys():
        text = df_texts[key]['text']
        count_colname = label+'_'+str(int(key))
        wc[key] = lex.count_tokens(lex.get_words(text))
        for word in wc[key]:
            word_dic = word_analysis.get(word,{})
            word_dic[count_colname]=wc[key][word]
            word_analysis[word]=word_dic
    return word_analysis

In [None]:
df = df_recueil
label = 'infox'
df_count = pd.DataFrame(count_words_label(df,'infox')).T
df_count = df_count.fillna(0)

In [None]:
df_count

In [None]:
df_count['words']=df_count.index

In [None]:
#df_text = pd.pivot_table(df_count, values=['infox_0','infox_1'], index=None, columns='words', aggfunc=sum).T
#

#### Identification des mots √† ignorer

In [None]:
# Mots rares

occ_rare = 3
#occ_rare = 5
def is_rare(row):
    return row['infox_0'] < occ_rare and row['infox_1'] < occ_rare

df_rare = df_count[df_count.apply(is_rare,axis=1)]

rare_words = list(df_rare.index)
len(rare_words)

In [None]:
# Mots communs

common_size = 100
#common_size = 300
top_0 = df_count.sort_values('infox_0',ascending=False).head(common_size).index
top_1 = df_count.sort_values('infox_1',ascending=False).head(common_size).index
common_words = []
for word in top_0:
    if word in top_1:
        common_words.append(word)
len(common_words)


In [None]:
ignore_words = common_words + rare_words
word_vocabulary = [word for word in df_count['words'] if word not in ignore_words]
print(f'Vocabulaire = Total mots - mots communs - mots rares = {len(word_vocabulary)}')

### Vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(stop_words=ignore_words)
#vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_recueil['text'])

In [None]:
y = df_recueil['infox']

In [None]:
X.shape

In [None]:
y.value_counts()

### Construction des Datasets d'entrainement et de test

In [None]:
import sklearn.model_selection

#X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,train_size=0.8,shuffle=True)
X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,random_state=42,train_size=0.75)
#X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,random_state=42,train_size=0.5)

X_train.shape

### Entrainement du modele

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

classifier = LogisticRegression(C=100.0, random_state=42, solver='lbfgs', multi_class='ovr', max_iter=1000)
#classifier = LogisticRegression(max_iter=1000)

# Fit the model
classifier.fit(X_train, y_train)


### Mesure des r√©sultats

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print('Matrice de confusion')
print(cnf_matrix)
print(f'Accuracy score {metrics.accuracy_score(y_test, y_pred)*100:.2f}%')
print(f'Recall score {metrics.recall_score(y_test, y_pred)*100:.2f}%')

In [None]:
y_proba = classifier.predict_proba(X_test)
y_score = y_proba.transpose()[1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_score)
roc_display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr).plot()

## Exp√©rience 3
- Benchmark de quelques algorithmes
- sur la base des features produites √† l'√©tape 2

---

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.inspection import DecisionBoundaryDisplay

In [None]:
names = [
    "Logistic Regression",
    "Nearest Neighbors",
    "Linear SVM",
    "RBF SVM",
    "Gaussian Process",
    "Decision Tree",
    "Random Forest",
    "Neural Net",
    "AdaBoost",
    "Naive Bayes",
    "QDA",
]

classifiers = [
    LogisticRegression(C=100.0, random_state=42, solver='lbfgs', multi_class='ovr', max_iter=1000),
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
]

X = X.toarray()

X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y,random_state=42,train_size=0.75)

results =[]
for name, clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
    print('------------------------------------')
    print(f'Classifier : {name}')
    print('Matrice de confusion')
    print(cnf_matrix)
    acc = metrics.accuracy_score(y_test, y_pred)
    print(f'Accuracy score {acc*100:.2f}%')
    recall = metrics.recall_score(y_test, y_pred)
    print(f'Recall score {recall*100:.2f}%')
    results.append({
        'Classifier' : name,
        'Accuracy' : round(acc*100),
        'Recall' : round(recall*100),
    })
print('------------------------------------')

### Synth√®se

In [None]:
df = pd.DataFrame(results)
df = df.sort_values(['Accuracy','Recall'],ascending=False)
df

## Analyse des r√©sultats et conclusions
---

- En utilisant les moyens les plus simples on arrive √† un pr√©dicteur qui a d√©j√† des performances qui semblent int√©ressantes
- La pr√©cision de presque 76% des pr√©dictions doit √™tre valid√©e par l'observation manuelle
- Les deux algorithmes qui sortent du lot sont Neural Network et Regression Logistique

## Prochaines Etapes
---
- Tuning de chaque algo du benchmark pour arriver √† de meilleur r√©sultats
- Continuer dans cette direction : utiliser les lemmes des textes

---

In [None]:
from patat.util.file import pickle_save
data = { 'X':X, 'y':y }
pickle_save(data,'data/tmp/221009-Xy.p')