<h1>Projet_OC_05 : Catégorisez automatiquement des questions (API)</h1>

# Sommaire :

**Partie 1 : Configuration du notebook**

 - <a href="#C11">P1.1 : Chargement des librairies </a>
 - <a href="#C12">P1.2 : Fonctions </a>
 - <a href="#C13">P1.3 : Chargement des données</a>
 
**Partie 2 : Représentation de données**

 - <a href="#C21">P2.1 : Structure de données </a>
 - <a href="#C22">P2.2 : NaN et doublons </a>
 - <a href="#C23">P2.3 : Inspection de données </a> 
 
**Partie 3 : Analyse de données**

 - <a href="#C31">P3.1 : Structure de données </a>
 - <a href="#C32">P3.2 : Nettoyage de données </a>
 - <a href="#C33">P3.3 : Visualisation par wordcloud </a>
 
  
**Partie 4 : Enregistrement de données**

 - <a href="#C41">P4.1 : nregistrement de données </a>

<h1>Partie 1 : Configuration du notebook</h1>

# <a name="C11"> P1.1 : Chargement des librairies </a>

In [24]:
import os, sys, time

import numpy as np
import pandas as pd

import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy
from bs4 import BeautifulSoup
import string

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier, LogisticRegression, Perceptron
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss, jaccard_score
import tensorflow_hub as hub

import joblib

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

**La liste des librairies ci-dessus sont chargées.**

# <a name="C12"> P1.2 : Fonctions </a>

In [3]:
def tags_process(sentence):
    tags = sentence.replace('<', ' ').replace('>', ' ').replace('/', ' ').strip()
    return tags

In [4]:
def rare_tag_remove(sentence, most_freq_tags):
    sentence_filter = [w for w in sentence.split() if w in most_freq_tags]
    
    if sentence_filter:
        return ' '.join(sentence_filter)
    else:
        return np.nan

In [5]:
class TXTProcesser(TransformerMixin, BaseEstimator):
    
    def __init__(self,
              stop_words=list(set(stopwords.words('english'))) + ['[', ']', ',', '.', ':', '?', '(', ')', 'good', 'idea', 'solution', 'issue', 'problem', 'way', 'example', 
                 'case', 'question', 'questions', 'something', 'everything', 'anything', 'thing', 'things', 'answer', 'thank', 'thanks', 'none', 'end', 'anyone', 
                 'test', 'lot', 'one', 'someone', 'help'], 
              authorized_pos=['NN', 'NNS', 'NNP', 'NNPS'], 
              no_pos_tag_list=['r', 'c', 'android', 'java', 'javascript', 'ios', 'iphone', 'c#', 'c++'], 
              no_lem_stem_list=['ios', 'pandas', 'objectivec', 'string', 'spring', 'windows', 'gems', 'named', 'cookie', 'cookies', 'css', 'r', 'files', 'types', 'functions', 'sass'],
              nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']),
              embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")):
        
        self.stop_words = stop_words
        self.authorized_pos = authorized_pos
        self.no_pos_tag_list = no_pos_tag_list
        self.no_lem_stem_list = no_lem_stem_list
        self.nlp = nlp
        self.embed = embed
    
    def fit(self, X, Y=None):
        '''
        X_clean = []
        for x in X:
            X_clean.append(self.txt_clean(x))            
    
        return self.txt_use_feature(X_clean)'''
        return self
    
    def transform(self, X, Y=None):
        
        X_clean = []
        
        if isinstance(X, list):
            for x in X:
                X_clean.append(self.txt_clean(x))            
    
            return self.txt_use_feature(X_clean)
            
        else:        
            
            for x in [X]*10:
                X_clean.append(self.txt_clean(x))            

            return [self.txt_use_feature(X_clean)[0]]
    
    def txt_clean(self, X, Y=None):

        sentence_lower = X.lower()
        sentence_no_html_raw = BeautifulSoup(sentence_lower, "html.parser")
        for data in sentence_no_html_raw(['style', 'script', 'code', 'a']):
            # Remove tags
            data.decompose()
        sentence_no_html = ' '.join(sentence_no_html_raw.stripped_strings)
        sentence_no_abb = sentence_no_html.replace("what's", "what is ").replace("\'ve", " have ").replace("can't", "can not ").replace("n't", " not ").replace("i'm", "i am ")\
                           .replace("\'re", " are ").replace("\'d", " would ").replace("\'ll", " will ").replace("\'scuse", " excuse ").replace('-', '_').replace(' - ', ' ')\
                           .replace(' vs ', ' ').replace(' vs. ', ' ').replace('difference between', ' ').replace('::', '')

        sentence_no_new_line = re.sub(r'\n', ' ', sentence_no_abb)

        translator = str.maketrans(dict.fromkeys(string.punctuation, ' '))
        sentence_no_caracter = sentence_no_new_line.translate(translator)

        sentence_no_caracter_process = sentence_no_caracter.replace('+', 'plus').replace('#', 'sharp').replace('.net', 'dotnet').replace('asp.net', 'aspdotnet')\
                                                       .replace('node.js', 'nodedotjs').replace('visual studio', 'visualstudio').replace('android studio', 'androidstudio')\
                                                       .replace('objective-c', 'objectivec').replace('sql-server', 'sqlserver').strip()

        sentence_no_stopwords = ' '.join([word for word in sentence_no_caracter_process.split() if word not in self.stop_words])


        sentence_tokens =  [token.text for token in self.nlp(sentence_no_stopwords) if token.tag_ in self.authorized_pos and len(token.text)>=3 or token.text in self.no_pos_tag_list] 


        lemmatizer = WordNetLemmatizer()
        lem_or_stem_tokens = [lemmatizer.lemmatize(word) if word not in self.no_lem_stem_list else word for word in sentence_tokens]


        final_sentence = ' '.join(sentence_tokens).replace('plus', '+').replace('_', '-').replace('sharp', '#').replace('dotnet', '.net').replace('aspdotnet', 'asp.net')\
                                                        .replace('nodedotjs', 'node.js').replace('visualstudio', 'visual-studio').replace('objectivec', 'objective-c')\
                                                        .replace('androidstudio', 'android-studio').replace('sqlserver', 'sql-server').strip()

        return final_sentence  
    
    def txt_use_feature(self, X, Y=None):
        batch_size = 10
        
        for step in range(len(X)//batch_size) :
            idx = step*batch_size
            feat = self.embed(X[idx:idx+batch_size])

            if step ==0 :
                features = feat
            else :
                features = np.concatenate((features,feat))

        return features

In [7]:
class TXTModel(TransformerMixin, BaseEstimator):
    
    def __init__(self, clf=OneVsRestClassifier(SGDClassifier()), ml_binarizer=MultiLabelBinarizer(), arg_sort=np):
        self.clf=OneVsRestClassifier(SGDClassifier())
        self.ml_binarizer=MultiLabelBinarizer()
        self.arg_sort=np
    
    def fit(self, X, Y):
        self.ml_binarizer.fit(Y)
        self.clf.fit(X, self.ml_binarizer.transform(Y))
        
    def predict(self, X):
        
        return self.clf.predict(X)
    
    def decision_function(self, X):

        dfun = self.clf.decision_function(X)
        most_common_idx = self.arg_sort.argsort(dfun)[:, -5:]
        return self.classes_(most_common_idx)
    
    def transform(self, Y):
        
        return self.ml_binarizer.transform(Y) 
        
    def inverse_transform(self, Yt):
        
        return self.ml_binarizer.inverse_transform(Yt)      
    
    def classes_(self, Y_idx):
        
        return self.ml_binarizer.classes_[Y_idx]
    

# <a name="C13"> P1.3 : Chargement des données </a>

In [8]:
raw_txt_data = pd.read_csv('data.csv')
raw_txt_data = raw_txt_data.select_dtypes(include=object)

print('-'*150)
print('Data size:', raw_txt_data.shape)
print('-'*150)
raw_txt_data.head()

------------------------------------------------------------------------------------------------------------------------------------------------------
Data size: (27832, 3)
------------------------------------------------------------------------------------------------------------------------------------------------------


Unnamed: 0,Title,Body,Tags
0,Post data to JsonP,<p>Is it possible to post data to JsonP? Or do...,<javascript><jquery><ajax><json><jsonp>
1,How do I run NUnit in debug mode from Visual S...,<p>I've recently been building a test framewor...,<c#><visual-studio-2008><unit-testing><testing...
2,Output different precision by column with pand...,<h2>Question</h2>\n\n<p>Is it possible to spec...,<python><csv><numpy><floating-point><pandas>
3,How to use p4merge as the merge/diff tool for ...,"<p>Does anyone know how to setup <a href=""http...",<macos><configuration><mercurial><perforce><p4...
4,How to change the locale in chrome browser,<p>I want to change Accept-language request he...,<google-chrome><google-chrome-extension><brows...


In [17]:
raw_txt_data['Tags'] = raw_txt_data.Tags.apply(lambda x: tags_process(x))

all_tags = ' '.join(raw_txt_data['Tags'].tolist()).split()
keywords = nltk.FreqDist(all_tags)
most_freq_tags = []
for (w, f) in keywords.most_common(100):
    most_freq_tags.append(w)

raw_txt_data['Tags'] = raw_txt_data.Tags.apply(lambda sentence: rare_tag_remove(sentence, most_freq_tags))
raw_txt_data.dropna(inplace=True)

In [18]:
data = raw_txt_data.sample(frac=0.1)

X = (data.Title + ' ' + data.Body).tolist()

y = data.Tags

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1) 

y_train_list = y_train.apply(lambda x: x.split()).tolist()
y_test_list = y_test.apply(lambda x: x.split()).tolist()

# <a name="C21"> P2.1 : Pipeline </a>

In [19]:
pipe = Pipeline([('transformer', TXTProcesser()), ('model', TXTModel())])

In [20]:
pipe.fit(X[:(len(X_train)//10)*10], y_train_list[:(len(X_train)//10)*10])

# <a name="C22"> P2.2 : déploiement du pipeline </a>

In [25]:
joblib.dump('pipe', 'pipeline_housing.joblib')

['pipeline_housing.joblib']