# Catégorisez automatiquement des questions

projet : https://openclassrooms.com/fr/paths/148/projects/111/assignment
        
données : https://data.stackexchange.com/stackoverflow/query

### Description du projet
##### Objectifs
- Permettre la prédiction de tags 
- Exposer une API permettant l'interaction avec les models

#### Moyens
- ML non supervisé
- ML supervisé

#### Livrables
- Modèles 
- API dont l'entrypoint est exposé sur le net
- Ouvrir le gestionnaire de conf

# Imports

In [10]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime


from nltk.corpus import wordnet
from nltk.corpus import stopwords
import nltk
import nltk.data
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
import string
from langdetect import detect
import re
import spacy
from spacy.symbols import ORTH, NORM
from langdetect import detect_langs




import warnings
warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)


# Dataset

Le dataset est constitué par aggrégation du résultat de tirs successifs par tranches de 250 000 items (id non continu !) de la requête suivante :


select title,body,tags
from posts
where tags is not null 
and id > 750000
and id < 1000000;

L'aggrégation est automatisée par le shell ./dataset/generated_csv/aggregate.sh

In [70]:
data = pd.read_csv('./dataset/dataset.csv',sep=",",nrows=1000)

In [71]:
data.head()

Unnamed: 0,title,body,tags
0,IEnumerable Or Enumerable for Enumerating composite objects,"<p>Hello I need some help with the following:</p>\n\n<p>I have a ListView control with many ListViewItems, and each ListViewItem's tag property references a domain object like Customer etc.</p>\n\n<p>when retrieving the collection of selected listviewItems, I would like to pass an <code>IEnumerable&lt;Customer&gt;</code> instead of the <code>IEnumerable&lt;ListViewItem&gt;</code> to a function to do something with it.</p>\n\n<p>what would be the best way to achieve this with only one loop?</p>\n\n<p>Thanks.</p>\n",<c#>
1,How to programmatically edit the routing table,"<p>I am writing a daemon running on an embedded platform that needs to change the default route of the device according to which interface it is connecting on at a given time. How can I do this programatically? I know I can use system(""route del default &amp;&amp; route add default gateway blah""); etc but is there a more direct way?</p>\n\n<p>UPDATE: I solved my particular problem by discovering a patch to pppd that allows a replacedefaultroute option. The patch also includes code for programmatically modifying the routing table. It is included in this gentoo bug report <a href=""http://bugs.gentoo.org/227321"" rel=""nofollow noreferrer"">http://bugs.gentoo.org/227321</a></p>\n",<c><linux><networking>
2,How to erase the content on a graphics in GDI+?,"<p>I am using gdi+, in c++.</p>\n\n<pre><code>Bitmap canvasImg = new Bitmap(400, 300, PixelFormat32bppARGB);\nGraphics canvas = new Graphics(&amp;canvasImg );\n\ncanvas.DrawImage(XXXX);\n</code></pre>\n\n<p>There are two problems.\n<p>\n1. I find the canvasImg is black. How can I change the color to white? I mean I want an white canvas. <p>\n2. If I have drawn some thing on the canvas, How can I clear the canvas?<p></p>\n\n<p>Many Thanks!</p>\n",<canvas><gdi+>
3,How do I access the state of individual bits of a word in MIPS?,<p>I'm writing a program and I need to determine if bits 3 and 6 are set. I know that I can rotate a word or left/right shift it.</p>\n\n<p>But how do I access individual bit's state? Do I use a bitwise operator like and/xor?</p>\n,<assembly><bit-manipulation><mips>
4,How to handle a closing application event in Java?,"<p>Having a console application, a server accepting several connections from clients, is it possible to have a listener or an event on a closing application? I want, in this event, tell all connected clients to gently disconnect before the application really closes itself.</p>\n\n<p>Any solution? Thank you!</p>\n",<java><events><listener>


# Constats

Dans l'objectif de faire du NLP nous pouvons déjà dégager des éléments remarquables qui vont conduire notre pré-traitement du dataset : 

- a priori beaucoup de question sont rédigées en anglais
    - il faudra faire un choix pour le corpus en fonction de la langue prépondérante et écarter les enregistrements qui ne sont pas rédigés dans cette langue
- le body et la liste de tags associée sont au format html
    - il faudra prétraiter les colonnes pour revenir à un format dépollué
    - on mettra également de côté les sauts de ligne (\n)
- on constate que le body peut contenir des éléments qui ne vont pas directement servir pour le nlp:
    - des extraits de code cadrés par les balises html '\<code\>\</code\>'
    - des liens hypertexte cadrés par les balises html '\<a href\>\</a\>'

Les fonctions suivantes vont servir à effectuer les opérations élémentaires de prétraitement.

Elles seront reprises dans un pipeline complet pour permettre leur enchainement automatique.


In [72]:
def logIt(caller,s):
    dt = datetime.today().strftime('%Y-%m-%d %H:%M:%S')
    show_log = f'{caller: <16} ' + s
    print(dt + ' - ' + show_log)

In [73]:
"""
-----------------------------------------------------------------
Desc:   Detect lang
Input:  Dataset
Output: languages representativeness  

Traitements appliqués : 
- detection langue
-----------------------------------------------------------------
"""
def langRepresentativeness(df, caller):
    
    langs = {}

    logIt(caller,'Get lang from title')
    
    for title in df['title']:
        #print(detect(title) + ' : ' + title)
        test = langs.get(detect(title))
        if test:
            langs[detect(title)] += 1
        else:
            langs[detect(title)] = 1   
    
    logIt(caller,'=> done')
    
    logIt(caller,'Plot representativeness')

    plt.bar(range(len(langs)), list(langs.values()), align='center')
    plt.xticks(range(len(langs)), list(langs.keys()))

    plt.show()    
    

In [74]:
langRepresentativeness(data, 'test')

2021-04-14 23:06:25 - test             Get lang from title


KeyError: 'es'

Commentaire sur la répartition des langues + choix

In [82]:
"""
-----------------------------------------------------------------
Desc:   Clean code
Input:  Dataset
Output: dataset with body feature without tags   

Traitements appliqués : 
- remove code tags contents
-----------------------------------------------------------------
"""
def removeUnnecessaryContents(df, caller):
    
    input_file = open("./dataset/dataset.csv", "r")
    output_file = open("./dataset/dataset_ok.csv", "w")
    code = ""
    pre = ""
    href = ""

    for line in input_file:
        match_code = re.match(r'\s.*(<code>.*</code>)', line) 
        match_pre = re.match(r'\s.*(<pre>.*</pre>)', line) 
        match_href = re.match(r'\s.*(<a href>.*</a>)', line) 

        if match_code:
            print('match_code: ' + line)
            code = match_code.group(1)
            print('--> ' + code)
            line = line.replace(code,'')
        elif match_pre:
            pre = match_pre.group(1)
            line = line.replace(pre,'')

        elif match_href:
            href = match_ot.group(2)
            line = line.replace(href,'')

        output_file.write(line)
        
    df = pd.read_csv('./dataset/dataset_ok.csv',sep=",",nrows=1000)
    
    return df
    

In [83]:
data = removeUnnecessaryContents(data, 'test')

match_code:   When you create <code>autoparts()</code> or <code>splittext()</code>, the idea is that this will be a function that you can call, and it can (and should) give something back.

--> <code>splittext()</code>
match_code:   Once you figure out the output that you want your function to have, you need to put it in a <code>return</code> statement.</p>

--> <code>return</code>
match_code:   <code>-ls</code> gives me the space-delimited filename fragments

--> <code>-ls</code>
match_code:   <code>-fprint</code> doesn't do any better.</p>

--> <code>-fprint</code>
match_code:     <code>AutoCompleteMode = SuggestAppend</code><br>

--> <code>AutoCompleteMode = SuggestAppend</code>
match_code:     <code>AutoCompleteSource = ListItems</code><br>

--> <code>AutoCompleteSource = ListItems</code>
match_code:     <code>DropDownStyle = DropDown</code></p>

--> <code>DropDownStyle = DropDown</code>
match_code:   <p>System.Configuration.ConfigurationErrorsException: Exactly one <code>&lt;siteM

In [77]:
data.head()

Unnamed: 0,title,body,tags
0,IEnumerable Or Enumerable for Enumerating composite objects,"<p>Hello I need some help with the following:</p>\n\n<p>I have a ListView control with many ListViewItems, and each ListViewItem's tag property references a domain object like Customer etc.</p>\n\n<p>when retrieving the collection of selected listviewItems, I would like to pass an <code>IEnumerable&lt;Customer&gt;</code> instead of the <code>IEnumerable&lt;ListViewItem&gt;</code> to a function to do something with it.</p>\n\n<p>what would be the best way to achieve this with only one loop?</p>\n\n<p>Thanks.</p>\n",<c#>
1,How to programmatically edit the routing table,"<p>I am writing a daemon running on an embedded platform that needs to change the default route of the device according to which interface it is connecting on at a given time. How can I do this programatically? I know I can use system(""route del default &amp;&amp; route add default gateway blah""); etc but is there a more direct way?</p>\n\n<p>UPDATE: I solved my particular problem by discovering a patch to pppd that allows a replacedefaultroute option. The patch also includes code for programmatically modifying the routing table. It is included in this gentoo bug report <a href=""http://bugs.gentoo.org/227321"" rel=""nofollow noreferrer"">http://bugs.gentoo.org/227321</a></p>\n",<c><linux><networking>
2,How to erase the content on a graphics in GDI+?,"<p>I am using gdi+, in c++.</p>\n\n<pre><code>Bitmap canvasImg = new Bitmap(400, 300, PixelFormat32bppARGB);\nGraphics canvas = new Graphics(&amp;canvasImg );\n\ncanvas.DrawImage(XXXX);\n</code></pre>\n\n<p>There are two problems.\n<p>\n1. I find the canvasImg is black. How can I change the color to white? I mean I want an white canvas. <p>\n2. If I have drawn some thing on the canvas, How can I clear the canvas?<p></p>\n\n<p>Many Thanks!</p>\n",<canvas><gdi+>
3,How do I access the state of individual bits of a word in MIPS?,<p>I'm writing a program and I need to determine if bits 3 and 6 are set. I know that I can rotate a word or left/right shift it.</p>\n\n<p>But how do I access individual bit's state? Do I use a bitwise operator like and/xor?</p>\n,<assembly><bit-manipulation><mips>
4,How to handle a closing application event in Java?,"<p>Having a console application, a server accepting several connections from clients, is it possible to have a listener or an event on a closing application? I want, in this event, tell all connected clients to gently disconnect before the application really closes itself.</p>\n\n<p>Any solution? Thank you!</p>\n",<java><events><listener>
