<h1>Projet_OC_05 : Catégorisez automatiquement des questions (API)</h1>

# Sommaire :

**Partie 1 : Configuration du notebook**

 - <a href="#C11">P1.1 : Chargement des librairies </a>
 - <a href="#C12">P1.2 : Fonctions </a>
 - <a href="#C13">P1.3 : Classes </a>
 - <a href="#C14">P1.4 : Chargement des données</a>
 
**Partie 2 : Modèle et déploiement**

 - <a href="#C21">P2.1 : Modèle </a>
 - <a href="#C22">P2.2 : Déploiement du modèle (mlflow) </a>
 - <a href="#C23">P2.3 : Essai du modèle (mlflow) </a> 


<h1>Partie 1 : Configuration du notebook</h1>

# <a name="C11"> P1.1 : Chargement des librairies </a>

In [13]:
import os, sys, time

import numpy as np
import pandas as pd

import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy
from bs4 import BeautifulSoup
import string

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier, LogisticRegression, Perceptron
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score, jaccard_score, recall_score, accuracy_score

import tensorflow_hub as hub

import joblib
from mlflow.models.signature import infer_signature
import mlflow.sklearn 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

**La liste des librairies ci-dessus sont chargées.**

# <a name="C12"> P1.2 : Fonctions </a>

In [3]:
def print_score(scores, y_test, y_pred, clf, average='weighted'):
    
    '''
    print_score(scores, y_test, y_pred, clf, average='weighted')
    
    This function evaluates model performance using 4 scores: accuracy_score, recall_score, f1_score, jaccard_score
    
    parameters:
        score: list
            list appends all scores, useful to compare several models 
        y_test: list 
            true target
        y_pred: lisy
            predictions
        average: str, default 'weighted'
            average use dto compute final score
    '''
    
    
    
    score_1 = accuracy_score(y_test, y_pred)
    score_2 = recall_score(y_test, y_pred, average=average)
    score_3 = f1_score(y_pred, y_test, average=average)
    score_4 = jaccard_score(y_test, y_pred, average=average)


    print("Clf: ", clf.__class__.__name__)
    print("accuracy_score: {}".format(score_1))
    print("recall_score: {}".format(score_2))
    print("f1_score: {}".format(score_3))
    print("Jaccard score: {}".format(score_4))

    scores_temp = np.concatenate((scores, np.around([score_1, score_2, score_3, score_4], 2)))

    return scores_temp

**La fonction ci-dessus evalue les performance de modèle en utilisant 4 scores: accuracy, recall, f1_score, et jaccard .**

In [2]:
def tags_process(sentence, most_freq_tags):
    
    '''
    Function tags_process(sentence, most_freq_tags)
    
    This function removes < and > from a sentence and filter words, only keeping most_freq_tags
    
    parameters:
        sentence: str
            String to be cleaned 
        most_freq_tags: list 
            list of words (string) containing the words to be kept 
    '''
    
    sentence_process = (sentence.replace('<', ' ').replace('>', ' ').replace('/', ' ').strip()).split()
    
    sentence_filter = [word for word in sentence_process if word in most_freq_tags]                
    
    if sentence_filter:
        return ' '.join(sentence_filter)
    else:
        return np.nan

**La fonction ci-dessus assure le traitement des Tags. Elle supprimer "<" et ">" et filtre les tags les plus fréquents.**

In [6]:
def tag_ponc_process(sentence):
    
    '''
    tag_ponc_process(sentence)
    
    This function replaces the pnctuations in the 50-most used tags.  
    
    parameters:
        sentence: str
            String to be cleaned 
    '''
    
    return sentence.replace('c#', 'csharp').replace('c++', 'cplusplus').replace('.net', 'dotnet').replace('objective-c', 'objectivec').replace('ruby-on-rails', 'rubyonrails')\
                .replace('sql-server', 'sqlserver').replace('node.js', 'nodedotjs').replace('aspdotnet-mvc', 'aspdotnetmvc').replace('visual-studio', 'visualstudio').replace('visual studio', 'visualstudio')\
                .replace('unit-testing', 'unittesting').replace('cocoa-touch', 'cocoatouch').replace('python-3.x', 'python3x').replace('entity-framework', 'entityframework')\
                .replace('language-agnostic', 'languageagnostic').replace('amazon-web-services', 'amazonwebservices').replace('google-chrome', 'googlechrome').replace('user-interface', 'userinterface')\
                .replace('design-patterns', 'designpatterns').replace('version-control', 'versioncontrol').strip()

**La fonction ci-dessus remplace ou supprime les ponctuations, dans les tags les plus fréquents, pour éviter de perdre ces tags lors de traitement de text.**

In [7]:
def inverse_tag_ponc_process(sentence):
    
    '''
    inverse_tag_ponc_process(sentence)
    
    This function inverses the processing of the "tag_ponc_process()" function. It coverts the 50-most used tags into their original formats.  
    
    parameters:
        sentence: str
            String to be cleaned 
    '''
    
    return sentence.replace('csharp', 'c#').replace('cplusplus', 'c++').replace('dotnet', '.net').replace('objectivec', 'objective-c').replace('rubyonrails', 'ruby-on-rails')\
                .replace('sqlserver', 'sql-server').replace('nodedotjs', 'node.js').replace('aspdotnetmvc', 'aspdotnet-mvc').replace('visualstudio', 'visual-studio')\
                .replace('unittesting', 'unit-testing').replace('cocoatouch', 'cocoa-touch').replace('python3x', 'python-3.x').replace('entityframework', 'entity-framework')\
                .replace('languageagnostic', 'language-agnostic').replace('amazonwebservices', 'amazon-web-services').replace('googlechrome', 'google-chrome').replace('userinterface', 'user-interface')\
                .replace('designpatterns', 'design-patterns').replace('versioncontrol', 'version-control').strip()

**La fonction ci-dessus inverse le traitement de la fonction "tag_ponc_process()".**

In [8]:
def txt_process(sentence, stop_words, authorized_pos, no_pos_tag_list, no_lem_stem_list, force_is_alpha=False, method='spacy', lem_or_stem='lem'):
    
    '''
    txt_process(sentence, stop_words, authorized_pos, no_pos_tag_list, no_lem_stem_list, force_is_alpha=False, method='spacy', lem_or_stem='lem')
    
    This function is a set of text processing steps: lower, html tags\abbreviation\ponctuation\stop_words removing, tokonization, pos_tags filetring, and lemmatization\stemming.  
    
    parameters:
        sentence: str
            String to be cleaned 
        stop_words: list
            stop_words to be removed
        authorized_pos: list
            pos_tags to be kept
        no_pos_tag_list: list
            contains tokens, needed to be kept (ex: targets)
        no_lem_stem_list: list
            contains tokens, needed to be kept in their original formats (ex: Tags, keywords, targets)
        force_is_alpha: bool, default False
            if True, only alphabetic tokens are kept
        method: str, default 'spacy'
            defines the tokenization method: 'word'==>word_tokenize(), 'wordpunct'==>wordpunct_tokenize, 'spacy'==>nlp\spacy
        lem_or_stem: str, default 'lem'
            choice between lemmatization or stemming: 'lem'==> lemmatization, 'stem'==> stemming
    '''
    
    
    sentence_lower = sentence.lower()
    
    sentence_no_html_raw = BeautifulSoup(sentence_lower, "html.parser")

    for data in sentence_no_html_raw(['style', 'script', 'code', 'a']):
        # Remove tags
        data.decompose()
        
    sentence_no_html = ' '.join(sentence_no_html_raw.stripped_strings)
    
    sentence_no_abb = sentence_no_html.replace("what's", "what is ").replace("\'ve", " have ").replace("can't", "can not ").replace("n't", " not ").replace("i'm", "i am ")\
                       .replace("\'re", " are ").replace("\'d", " would ").replace("\'ll", " will ").replace("\'scuse", " excuse ").replace(' vs ', ' ').replace('difference between', ' ')

    sentence_no_abb_trans = tag_ponc_process(sentence_no_abb)

    sentence_no_new_line = re.sub(r'\n', ' ', sentence_no_abb_trans)

    translator = str.maketrans(dict.fromkeys(string.punctuation, ' '))
    sentence_no_caracter = sentence_no_new_line.translate(translator)
    
    sentence_no_stopwords = ' '.join([word for word in sentence_no_caracter.split() if word not in stop_words])
    
    if method=='word':
        tokens_list = word_tokenize(sentence_no_stopwords)
        sentence_tokens = [word for (word, tag) in nltk.pos_tag(tokens_list) if tag in authorized_pos and len(word)>=3 or word in no_pos_tag_list] 
    elif method=='wordpunct':
        tokens_list = wordpunct_tokenize(sentence_no_stopwords)
        sentence_tokens = [word for (word, tag) in nltk.pos_tag(tokens_list) if tag in authorized_pos and len(word)>=3 or word in no_pos_tag_list]
    elif method=='spacy':
        sentence_tokens =  [token.text for token in nlp(sentence_no_stopwords) if token.tag_ in authorized_pos and len(token.text)>=3 or token.text in no_pos_tag_list] 
    else: 
        tokens_list = RegexpTokenizer(r"\w+").tokenize(sentence_no_stopwords)
        sentence_tokens = [word for (word, tag) in nltk.pos_tag(tokens_list) if tag in authorized_pos and len(word)>=3 or word in no_pos_tag_list]
    
    if force_is_alpha:
        alpha_tokens = [word for word in sentence_tokens if word.isalpha()]
    else:
        alpha_tokens = sentence_tokens
    
    if lem_or_stem=='lem':
        lemmatizer = WordNetLemmatizer()
        lem_or_stem_tokens = [lemmatizer.lemmatize(word) if word not in no_lem_stem_list else word for word in alpha_tokens]
        
    else:
        stemmer = PorterStemmer()
        lem_or_stem_tokens = [stemmer.stem(word) if word not in no_lem_stem_list else word for word in alpha_tokens]
    
    final_sentence = inverse_tag_ponc_process(' '.join(sentence_tokens))
    
    return final_sentence

**La fonction ci-dessus assure le traitement de texte: suppression de html tags-ponctuation-stop_words-etc, tokennisation, lemmatisation-stemming, etc.**

In [11]:
def feature_USE_fct(sentences, b_size=8) :
    
    '''
    feature_USE_fctsentences, b_size=8)
    
    This function extacts features from text, using USE (universal sentence encoder) model.  
    
    parameters:
        sentence: str
            String to be cleaned 
        b_size: int, default 8
            sentence set treated at once, fixed to 8 in coherence with GPU architecture.
    '''

    batch_size = b_size
    time1 = time.time()

    for step in range(len(sentences)//batch_size) :
        idx = step*batch_size
        feat = embed(sentences[idx:idx+batch_size])

        if step ==0 :
            features = feat
        else :
            features = np.concatenate((features,feat))

    time2 = np.round(time.time() - time1,0)
    return features

**La fonction ci-dessus utilise USE codage pour extraire des features.**

# <a name="C13"> P1.3 : Classes </a>

In [14]:
class TXTModel(TransformerMixin, BaseEstimator):

    '''
    TXTModel(TransformerMixin, BaseEstimator)
    
    This classe groups a classifier "clf" and "MultiLabelBinarizer" transformer together. "MultiLabelBinarizer" is used in a post-processing step.  
    
    parameters:
        clf: sklearn classifier
            classifier to be used, ex: OneVsRestClassifier(LinearSVC()). 
        ml_binarizer: sklearn transformer
            MultiLabelBinarizer() from sklearn.
    '''
    
    def __init__(self, clf, ml_binarizer):
        self.clf = clf
        self.ml_binarizer = ml_binarizer
        
    def transform(self, Y):
        
        return self.ml_binarizer.transform(Y.tolist()) 
    
    def fit(self, X, Y):
        
        self.ml_binarizer.fit(Y.tolist())
        self.clf.fit(X, self.ml_binarizer.transform(Y.tolist()))
        
    def predict(self, X):
        
        dfun = self.clf.decision_function(X)
        most_common_idx = dfun.argsort()[:, -5:]
        return self.classes_(most_common_idx)
    
    def decision_function(self, X):

        return self.clf.decision_function(X)    
    
    def classes_(self, Y_idx):
        
        return self.ml_binarizer.classes_[Y_idx]
    

**La classe ci-dessus regroupe un classificateur et le transformateur MultiLabelBinarizer() pour assurer une multi-classification et un post-traitement du résultat.**

# <a name="C14"> P1.4 : Chargement des données </a>

In [9]:
raw_txt_data = pd.read_csv('data.csv')
raw_txt_data = raw_txt_data.select_dtypes(include=object)
raw_txt_data.dropna(inplace=True)

print('-'*150)
print('Data size:', raw_txt_data.shape)
print('-'*150)
raw_txt_data.head()

------------------------------------------------------------------------------------------------------------------------------------------------------
Data size: (99997, 3)
------------------------------------------------------------------------------------------------------------------------------------------------------


Unnamed: 0,Title,Body,Tags
0,Find Mime type of file or url using php for al...,<p>Hi I am looking for best way to find out mi...,<php><amazon-web-services><mime-types><content...
1,native zlib inflate/deflate for swift3 on iOS,<p>I'd like to be able to inflate/deflate Swif...,<ios><swift><swift3><zlib><swift-data>
2,`Sudo pip install matplotlib` fails to find fr...,<p>I already have <code>matplotlib-1.2.1</code...,<python><numpy><matplotlib><homebrew><osx-mave...
3,Serialization in C# without using file system,<p>I have a simple 2D array of strings and I w...,<c#><sharepoint><serialization><moss><wss>
4,How do I prevent IIS from compiling website?,<p>I have an ASP .NET web application which on...,<asp.net><performance><web-services><iis><asmx>


**Ci-dessus, les données originales sont chargées.**

<h1>Partie 2 : Modèle et déploiement</h1>

# <a name="C21"> P2.1 : Modèle </a>

In [22]:
most_common_val = 50
all_tags = ' '.join(raw_txt_data.Tags.apply(lambda sentence: sentence.replace('<', ' ').replace('>', ' ')).tolist()).split()
unique_tags = list(set(all_tags))
keywords = nltk.FreqDist(all_tags)
most_common_tags = [word[0] for word in keywords.most_common(most_common_val)]

raw_txt_data['Tags'] = raw_txt_data.Tags.apply(lambda sentence: tags_process(sentence, most_common_tags))
raw_txt_data.dropna(inplace=True)

**La variable "Tags" est nétoyée et filtrée: 50 tags les plus fréquents sont uniquement gardés.**

In [23]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
authorized_pos = ['NN', 'NNS', 'NNP', 'NNPS']
no_pos_tag_list = tag_ponc_process(' '.join(most_common_tags)).split()
no_lem_stem_list = tag_ponc_process(' '.join(most_common_tags)).split()
stop_words = list(set(stopwords.words('english'))) + \
                ['[', ']', ',', '.', ':', '?', '(', ')']
stop_words.extend(['good', 'idea', 'solution', 'issue', 'problem', 'way', 'example', 'case', 'question', 'questions', 'something', 'everything',
                   'anything', 'thing', 'things', 'answer', 'thank', 'thanks', 'none', 'end', 'anyone', 'test', 'lot', 'one', 'someone', 'help'])

clf = OneVsRestClassifier(LinearSVC())
ml_binarizer = MultiLabelBinarizer()

**La définition des variables du modèle est faite ci-dessus.**

In [40]:
raw_txt_data['Title_Body'] = raw_txt_data['Title'] + ' ' + raw_txt_data['Body']

X = raw_txt_data['Title_Body'].sample(frac=0.01).apply(lambda sen: txt_process(sen, stop_words, authorized_pos, no_pos_tag_list, no_lem_stem_list))
y = pd.Series(raw_txt_data.Tags.values, index=raw_txt_data.Tags.index)

**Test processing et création des variables X, y.**

In [41]:
idx_norm = 8
X_use = feature_USE_fct(X, idx_norm)

**Création des features USE est faite ci-dessus.**

In [244]:
X_train, X_test, y_train, y_test = train_test_split(X_use, y[:(len(X_use)//idx_norm)*idx_norm], test_size = 0.25, random_state = 1) 

X_train = X_train[:(len(X_train)//idx_norm)*idx_norm]
X_test = X_test[:(len(X_test)//idx_norm)*idx_norm]
y_train = y_train[:(len(y_train)//idx_norm)*idx_norm]
y_test = y_test[:(len(y_test)//idx_norm)*idx_norm]

y_train = y_train.apply(lambda x: x.split())
y_test = y_test.apply(lambda x: x.split())

**Split des variable X, y.**

In [245]:
clf = TXTModel(clf, ml_binarizer)
clf.fit(X_train, y_train)

**Ci-dessus, fitting du modèle.**

In [252]:
ml_binarizer_score = MultiLabelBinarizer()
ml_binarizer_score.fit(y_train)
scores = print_score([], ml_binarizer_score.transform(clf.predict(X_test)), ml_binarizer_score.transform(y_test), clf, average='micro')

Clf:  TXTModel
accuracy_score: 0.0004152823920265781
recall_score: 0.2968069398301956
f1_score: 0.4475522529292254
Jacard score: 0.2882881267815206


**Le score du modèle est donné ci-dessus. Le score est inférieur, car obtenu avec la méthode "decision_function" en considérant les 5 premiers Tags.**

# <a name="C22"> P2.2 : déploiement du modèle (mlflow)</a>

In [253]:
joblib.dump(clf, 'clf_housing.joblib')

['clf_housing.joblib']

**Sérialisation du modèle à l'aide de joblib.**

In [254]:
signature = infer_signature(X_train, y_train)

**Extraction de la signature de données d'entrée et de sortie.**

In [256]:
mlflow.sklearn.save_model(clf, 'mlflow_model', signature=signature)

**Le modèle "clf" est sauvegardé à l'aide de la fonction save_model.**

mlflow models serve -m mlflow_model/

**Une API REST (mlflow) est créée.**

# <a name="C23"> P2.3 : Essai du modèle (mlflow)</a>

In [61]:
test_idx = np.random.randint(0, raw_txt_data.shape[0], size=1)[0]
raw_txt_data['Title'].iloc[test_idx]

'storing raw json in redis by using spring-data-redis'

In [62]:
raw_txt_data['Body'].iloc[test_idx]

'<p>I am using RedisCacheManager to store my cache data in my spring-boot application. Default serializer seems to serialize everything into byte and deserialize from byte to appropriate java type. </p>\n\n<p>However, I want to make the cache data be stored as json so that I can read it from none-java clients.</p>\n\n<p>I found that switching from default one to other serializers such as Jackson2JsonRedisSerializer supposed to work. After doing this, deserialization phase fails.</p>\n\n<p>pom.xml</p>\n\n<pre><code>    &lt;dependency&gt;\n        &lt;groupId&gt;org.springframework.data&lt;/groupId&gt;\n        &lt;artifactId&gt;spring-data-redis&lt;/artifactId&gt;\n    &lt;/dependency&gt;\n\n    &lt;dependency&gt;\n        &lt;groupId&gt;redis.clients&lt;/groupId&gt;\n        &lt;artifactId&gt;jedis&lt;/artifactId&gt;\n    &lt;/dependency&gt;\n</code></pre>\n\n<p>CacheConfig.java</p>\n\n<pre><code>@Configuration\n@EnableCaching\npublic class CacheConfig {\n\n    @Bean\n    public RedisC

**Ci-dessus, un titre et une question sont générés pour tester l'API.**

In [63]:
raw_txt_data['Tags'].iloc[test_idx]

'java json spring'

**La vrai target est donnée ci-dessus.**