## Contexte du projet

Vous venez de rejoindre une ESN autant que développeur.se en IA pour renforcer l'équipe Data Science.

Votre première mission se déroule chez un client qui est en cours de digitaliser ses articles de journaux et souhaite établir une classification automatique de ces articles en 5 catégories: tech, business, sport, entertainment ou politics. Etant donné que le client souhaite participer à un concours d'innovation, il exige que la brique IA soit un réseau de neurones.

N'hésitez pas à reduire (pas trop) la taille du dataset si les opérations prennent trop de temps sur votre machine.


### 1. Import des librairies


In [32]:
import pandas as pd
import spacy
import numpy as np
import gensim.downloader


from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models.word2vec import Word2Vec

### 2. Chargement des données

In [2]:
df = pd.read_csv('289df373-42e6-40fe-a3ab-8c8110f0a571.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [3]:
df.category.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [4]:
nlp = spacy.load('en_core_web_sm')

### 3. Nettoyage du texte
On enlève les mots les plus courants (stopwords) et les caractères non alphabétiques

In [6]:
def clean_docs(texts, remove_stopwords=False, n_process = 4):
    
    docs = nlp.pipe(texts, 
                    n_process=n_process,
                    disable=['parser', 'ner',
                             'lemmatizer', 'textcat'])
    stopwords = nlp.Defaults.stop_words

    docs_cleaned = []
    for doc in docs:
        tokens = [tok.text.lower().strip() for tok in doc if not tok.is_punct]
        if remove_stopwords:
            tokens = [tok for tok in tokens if tok not in stopwords]
        doc_clean = ' '.join(tokens)
        docs_cleaned.append(doc_clean)
        
    return docs_cleaned

In [12]:
df['text_clean'] = clean_docs(df['text'], remove_stopwords=True)
df.head()

Unnamed: 0,category,text,text_clean,y
0,tech,tv future in the hands of viewers with home th...,tv future hands viewers home theatre systems ...,4
1,business,worldcom boss left books alone former worldc...,worldcom boss left books worldcom boss berni...,0
2,sport,tigers wary of farrell gamble leicester say ...,tigers wary farrell gamble leicester rushed ...,3
3,sport,yeading face newcastle in fa cup premiership s...,yeading face newcastle fa cup premiership newc...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,ocean s raids box office ocean s crime caper ...,1


### 4. Encodage de la cible

In [9]:
le = LabelEncoder()
df['y'] = le.fit_transform(df['category'])
le.classes_

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(df['text_clean'].values, 
                                                    df['y'].values, 
                                                    test_size=0.2, 
                                                    random_state=33,
                                                    stratify = df['y'].values)

In [18]:
# Prédiction sur le set d'entrainement
clf = LinearSVC(max_iter=10000, C=0.1)

In [14]:
def fit_vectorizers(vectorizer):
    pipeline = Pipeline(
    [
        ("vect", vectorizer()),
        ("scaling", StandardScaler(with_mean=False)),
        ("clf", clf),
    ]
    )

    parameters = {
        "vect__ngram_range": ((1, 1), (1, 2)),  # unigrams or bigrams
        "vect__stop_words": ("english", None)
    }

    grid_search = GridSearchCV(pipeline, parameters, scoring='f1_micro',
                               cv=4, n_jobs=4, verbose=1)
    grid_search.fit(X_train, y_train)

    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

    print(f"CV scores {grid_search.cv_results_['mean_test_score']}")
    print(f"Mean F1 {np.mean(grid_search.cv_results_['mean_test_score'])}")
    
    return grid_search

In [23]:
cv_bow = fit_vectorizers(CountVectorizer)

Fitting 4 folds for each of 4 candidates, totalling 16 fits
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
CV scores [0.95898876 0.95898876 0.9752809  0.9747191 ]
Mean F1 0.9669943820224719


In [24]:
cv_tfidf = fit_vectorizers(TfidfVectorizer)

Fitting 4 folds for each of 4 candidates, totalling 16 fits
	vect__ngram_range: (1, 2)
	vect__stop_words: 'english'
CV scores [0.96292135 0.96292135 0.9747191  0.9747191 ]
Mean F1 0.9688202247191011


In [31]:
X_train_tokens = [text.split() for text in X_train]
w2v_model = Word2Vec(X_train_tokens, vector_size=200, window=5, 
                     min_count=1, workers=4)

In [33]:
def get_mean_vector(w2v_vectors, words):
    words = [word for word in words if word in w2v_vectors]
    if words:
        avg_vector = np.mean(w2v_vectors[words], axis=0)
    else:
        avg_vector = np.zeros_like(w2v_vectors['hi'])
    return avg_vector

def fit_w2v_avg(w2v_vectors):
    X_train_vectors = np.array([get_mean_vector(w2v_vectors, words)
                                for words in X_train_tokens])
    
    scores = cross_val_score(clf, X_train_vectors, y_train, 
                         cv=4, scoring='f1_micro', n_jobs=4)

    print(f"CV scores {scores}")
    print(f"Mean F1 {np.mean(scores)}")
    return scores

In [34]:
cv_w2vec = fit_w2v_avg(w2v_model.wv)

CV scores [0.87191011 0.85842697 0.83146067 0.8494382 ]
Mean F1 0.852808988764045


In [None]:
glove_model = gensim.downloader.load('glove-wiki-gigaword-200')

In [None]:
cv_w2vec_transfert = fit_w2v_avg(glove_model)

In [None]:
perfs = pd.DataFrame(
    [np.mean(cv_bow.cv_results_['mean_test_score']),
     np.mean(cv_tfidf.cv_results_['mean_test_score']),
    np.mean(cv_w2vec),
    np.mean(cv_w2vec_transfert)],
    index = ['Bag-of-Words','TF-IDF', 'Word2Vec non pré-entraîné', 'Word2Vec pré-entraîné'],
    columns = ["Mean F1 score"]
).sort_values("Mean F1 score",ascending = False)
perfs