# 1) Introduction

L’objectif du notebook est de pré traiter les données issues de l’outil d’export de données de Stack Overflow-
"stackexchange explorer", qui recense un grand nombre de données authentiques de la plateforme afin de détecter
les sujets et générer des tags.

## 1.1. Étapes de prétraitement

**1) Suppression du bruit**

1. Suppression du formatage HTML
2. Suppression des contractions
3. La correction orthographique
4. Mettre en minuscule le texte

**2) Suppression des caractères simples**

1. Suppression de la ponctuation, des caractères spéciaux et des nombres
2. Suppression d'un seul caractère (facultatif et spécifique)

**3) Suppression de StopWords**

1. Suppression du mot le plus fréquent
2. Suppression d'un certain type de mot (facultatif et spécifique)

**4) Stemming / Lemmatisation**

1. Stemming
2. Lemmatisation


Ce pré-processus est utilisé pour effectuer une simple détection de sujet (LDA, NMF, etc.) ou une classification,
des informations nécessaires à certaines analyses peuvent être perdues.

 

1.1.1 Vocabulaire
Voici une petite liste de concepts utilisés dans ce cahier.

Tokenize : "Processus de conversion d'une chaîne en une liste de sous-chaînes, appelées tokens."

Normalisation du texte : "Processus de transformation d'un texte en une seule forme canonique qu'il n'aurait
peut-être pas eu auparavant (par exemple, mettre en minuscule le texte, supprimer les contractions, correction
orthographique, stemming / lemmatisation, etc.). La normalisation du texte nécessite de savoir quel type de texte
doit être normalisée et comment elle doit être traitée par la suite ; il n’existe pas de procédure de
normalisation universelle. "

Suppression du bruit : "Processus de suppression de tout élément susceptible interferer avec votre analyse
(par exemple, suppression du code HTML, mettre en minuscule le texte, suppression de la ponctuation / du caractère
spécial, etc.)

Stemming: "Processus de réduction des mots à leur racine , base ou forme de racine - généralement une forme de
mot écrit ("fishing", "fished", and "fisher" to the stem "fish")."

Lemmatisation : "Processus de regroupement des formes fléchies d'un mot afin qu'elles puissent être analysées
comme un seul élément, identifié par le lemme du mot, ou par la forme du dictionnaire (ie : "walking" to "walk",
 "better" to "good")."

StopWord : "Mots qui sont filtrés avant ou après le traitement des données en langage naturel (texte). Les mots
d'arrêt font généralement référence aux mots les plus courants dans une langue (des mots comme "The","a", etc. en
anglais)."

# 2) Libraries and Dataset 

In [1]:
! pip install bs4
! pip install contractions
! pip install autocorrect 



In [2]:
# generic librairies
import pandas as pd

# Text librairies
import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import ToktokTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import contractions
from autocorrect import Speller


In [3]:
# https://numpy.org/devdocs/user/basics.types.html

dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'string', 'Body': 'string', 'Tags': 'string'}

In [4]:
%%time

nrows=10000

df_questions_python = pd.read_csv('input/QueryResults python score more than 20.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "ISO-8859-1",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

df_questions_r = pd.read_csv('input/QueryResults r score more than 20.csv',
                           usecols=['Id', 'Score', 'Title', 'Body', 'Tags'],
                           encoding = "ISO-8859-1",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

df_questions = pd.concat([df_questions_python,df_questions_r])
df_questions['Title_raw']=df_questions['Title']
df_questions['Body_raw']=df_questions['Body']
df_questions = df_questions.sample(frac=1,random_state=1).reset_index(drop=True)

CPU times: user 194 ms, sys: 11 ms, total: 205 ms
Wall time: 204 ms


In [5]:
df_questions[['Title', 'Body', 'Tags']] = df_questions[[
    'Title', 'Body','Tags'
]].applymap(lambda x: str(x).encode("utf-8", errors='surrogatepass').decode(
    "ISO-8859-1", errors='surrogatepass'))

In [6]:
df_questions.dtypes

Id            int32
Title        object
Body         object
Tags         object
Score         int16
Title_raw    string
Body_raw     string
dtype: object

In [7]:
spell = Speller()
token = ToktokTokenizer()
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
charac = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789'
stop_words = set(stopwords.words("english"))
adjective_tag_list = {'JJ', 'JJR', 'JJS', 'RBR', 'RBS'}  # List of Adjective's tag from nltk package


**NLTK Tag list**
List of tag use in the tagger (pos_tag function) from NLTK:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [8]:
df_questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13913 entries, 0 to 13912
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Id         13913 non-null  int32 
 1   Title      13913 non-null  object
 2   Body       13913 non-null  object
 3   Tags       13913 non-null  object
 4   Score      13913 non-null  int16 
 5   Title_raw  13913 non-null  string
 6   Body_raw   13913 non-null  string
dtypes: int16(1), int32(1), object(3), string(2)
memory usage: 625.1+ KB


In [9]:
df_questions['qTitle'] = df_questions['Tags'].apply(lambda x : len((x).split(" ")))

# 3) Suppression du bruit

La suppression du bruit consiste à supprimer tout ce qui peut interférer avec votre analyse de texte. C'est comme l'étape de nettoyage des données pour un projet ML classique.

## 3.1. Suppression du code HTML

In [10]:
df_questions['Body'][10]

"<p>I do not know python very much (never used it before :D), but I can't seem to find anything online. Maybe I just didn't google the right question, but here I go:</p>\n<p>I want to change an instance's implementation of a specific method. When I googled for it, I found you could do it, but it changes the implementation for all other instances of the same class, for example:</p>\n<pre><code>def showyImp(self):\n    print self.y\n\nclass Foo:\n    def __init__(self):\n        self.x = &quot;x = 25&quot;\n        self.y = &quot;y = 4&quot;\n\n    def showx(self):\n        print self.x\n\n    def showy(self):\n         print &quot;y = woohoo&quot;\n\nclass Bar:\n    def __init__(self):\n        Foo.showy = showyImp\n        self.foo = Foo()\n\n    def show(self):\n        self.foo.showx()\n        self.foo.showy()\n\nif __name__ == '__main__':\n    b = Bar()\n    b.show()\n    f = Foo()\n    f.showx()\n    f.showy()\n</code></pre>\n<p>This does not work as expected, because the output i

In [11]:
%%time

# Analyser la question et le titre puis renvoyer uniquement le texte
df_questions['Body'] = df_questions['Body'].apply(
    lambda x: BeautifulSoup(x, 'html.parser').get_text())
df_questions['Title'] = df_questions['Title'].apply(
    lambda x: BeautifulSoup(x, 'html.parser').get_text())


CPU times: user 5.86 s, sys: 24.8 ms, total: 5.88 s
Wall time: 5.88 s


BeautifulSoup nous permet de supprimer efficacement la plupart du code html mais pas tout.

In [12]:
df_questions['Body'][10]

'I do not know python very much (never used it before :D), but I can\'t seem to find anything online. Maybe I just didn\'t google the right question, but here I go:\nI want to change an instance\'s implementation of a specific method. When I googled for it, I found you could do it, but it changes the implementation for all other instances of the same class, for example:\ndef showyImp(self):\n    print self.y\n\nclass Foo:\n    def __init__(self):\n        self.x = "x = 25"\n        self.y = "y = 4"\n\n    def showx(self):\n        print self.x\n\n    def showy(self):\n         print "y = woohoo"\n\nclass Bar:\n    def __init__(self):\n        Foo.showy = showyImp\n        self.foo = Foo()\n\n    def show(self):\n        self.foo.showx()\n        self.foo.showy()\n\nif __name__ == \'__main__\':\n    b = Bar()\n    b.show()\n    f = Foo()\n    f.showx()\n    f.showy()\n\nThis does not work as expected, because the output is the following:\n\nx = 25\ny = 4\nx = 25\ny = 4\n\nAnd I want it 

Nous devons donc supprimer le reste ici.

In [13]:
def clean_text(text):
    """ mettre la déf de la fonction"""
    text = re.sub(r"\'", "'", text) # match all literal apostrophe pattern then replace them by a single whitespace
    text = re.sub(r"\n", " ", text) # match all literal Line Feed (New line) pattern then replace them by a single whitespace
    text = re.sub(r"\xa0", " ", text) # match all literal non-breakable space pattern then replace them by a single whitespace
    text = re.sub('\s+', ' ', text) # match all one or more whitespace then replace them by a single whitespace
    text = text.strip(' ')
    return text


In [14]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: clean_text(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: clean_text(x))

CPU times: user 609 ms, sys: 0 ns, total: 609 ms
Wall time: 609 ms


In [15]:
df_questions['Body'][10]

'I do not know python very much (never used it before :D), but I can\'t seem to find anything online. Maybe I just didn\'t google the right question, but here I go: I want to change an instance\'s implementation of a specific method. When I googled for it, I found you could do it, but it changes the implementation for all other instances of the same class, for example: def showyImp(self): print self.y class Foo: def __init__(self): self.x = "x = 25" self.y = "y = 4" def showx(self): print self.x def showy(self): print "y = woohoo" class Bar: def __init__(self): Foo.showy = showyImp self.foo = Foo() def show(self): self.foo.showx() self.foo.showy() if __name__ == \'__main__\': b = Bar() b.show() f = Foo() f.showx() f.showy() This does not work as expected, because the output is the following: x = 25 y = 4 x = 25 y = 4 And I want it to be: x = 25 y = 4 x = 25 y = woohoo I tried to change Bar\'s init method with this: def __init__(self): self.foo = Foo() self.foo.showy = showyImp But I ge

Nous devons aussi traiter la colonne des tags


## 3.2. Suppression des contractions

In [16]:
def expand_contractions(text):
    """développer les mots raccourcis, e.g. 'don't' to 'do not'"""
    text = contractions.fix(text)
    return text

In [17]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: expand_contractions(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: expand_contractions(x))

CPU times: user 661 ms, sys: 0 ns, total: 661 ms
Wall time: 661 ms


In [18]:
df_questions['Body'][10]

'I do not know python very much (never used it before :D), but I can not seem to find anything online. Maybe I just did not google the right question, but here I go: I want to change an instance\'s implementation of a specific method. When I googled for it, I found you could do it, but it changes the implementation for all other instances of the same class, for example: def showyImp(self): print self.y class Foo: def __init__(self): self.x = "x = 25" self.y = "y = 4" def showx(self): print self.x def showy(self): print "y = woohoo" class Bar: def __init__(self): Foo.showy = showyImp self.foo = Foo() def show(self): self.foo.showx() self.foo.showy() if __name__ == \'__main__\': b = Bar() b.show() f = Foo() f.showx() f.showy() This does not work as expected, because the output is the following: x = 25 y = 4 x = 25 y = 4 And I want it to be: x = 25 y = 4 x = 25 y = woohoo I tried to change Bar\'s init method with this: def __init__(self): self.foo = Foo() self.foo.showy = showyImp But I g

## 3.3. La correction orthographique

Pour 1000 entrées cette correction prends 10 minutes

In [19]:
def autocorrect(text):
    words_id = token.tokenize(text)
    words_correct = [spell(i) for i in words_id]
    return ' '.join(map(str, words_correct)) # Return the text untokenize

#df_questions['Title'] = df_questions['Title'].apply(lambda x: autocorrect(x))
#df_questions['Body'] = df_questions['Body'].apply(lambda x: autocorrect(x))

## 3.4. Mettre en minuscule le texte

Je choisis d'abaisser le texte après le paquet de contractions car celui-ci peut remettre des lettres majuscules lors de la suppression des contractions. La mise en minuscule du texte est une étape classique et utile de la suppression du bruit ou de la normalisation du texte car elle réduit le vocabulaire, normalise le texte et ne coûte presque rien.

In [20]:
%%time

df_questions['Title'] = df_questions['Title'].str.lower()
df_questions['Body'] = df_questions['Body'].str.lower()
df_questions['Tags'] = df_questions['Tags'].str.lower()

CPU times: user 24.4 ms, sys: 45 µs, total: 24.5 ms
Wall time: 23.9 ms


In [21]:
df_questions['Body'][10]

'i do not know python very much (never used it before :d), but i can not seem to find anything online. maybe i just did not google the right question, but here i go: i want to change an instance\'s implementation of a specific method. when i googled for it, i found you could do it, but it changes the implementation for all other instances of the same class, for example: def showyimp(self): print self.y class foo: def __init__(self): self.x = "x = 25" self.y = "y = 4" def showx(self): print self.x def showy(self): print "y = woohoo" class bar: def __init__(self): foo.showy = showyimp self.foo = foo() def show(self): self.foo.showx() self.foo.showy() if __name__ == \'__main__\': b = bar() b.show() f = foo() f.showx() f.showy() this does not work as expected, because the output is the following: x = 25 y = 4 x = 25 y = 4 and i want it to be: x = 25 y = 4 x = 25 y = woohoo i tried to change bar\'s init method with this: def __init__(self): self.foo = foo() self.foo.showy = showyimp but i g

# 4) Suppression des caractères

## 4.1. Suppression de la ponctuation, des caractères spéciaux et des nombres

TOUS les caractères non alphabétiques ont été supprimés (y compris la ponctuation, les nombres et les caractères spéciaux). Ainsi, je ne considère pas les mots importants qui peuvent contenir des caractères spéciaux (comme "C #" en programmation).

In [22]:
def remove_punctuation_and_number(text):
    """remove all punctuation and number"""
    return text.translate(str.maketrans(" ", " ", charac)) 

def remove_non_alphabetical_character(text):
    """remove all non-alphabetical character"""
    text = re.sub("[^a-z]+", " ", text) # remove all non-alphabetical character
    text = re.sub("\s+", " ", text) # remove whitespaces left after the last operation
    return text

In [23]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_punctuation_and_number(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_punctuation_and_number(x))

CPU times: user 141 ms, sys: 3.78 ms, total: 145 ms
Wall time: 144 ms


In [24]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_non_alphabetical_character(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_non_alphabetical_character(x))

CPU times: user 903 ms, sys: 0 ns, total: 903 ms
Wall time: 902 ms


In [25]:
df_questions['Body'][10]

'i do not know python very much never used it before d but i can not seem to find anything online maybe i just did not google the right question but here i go i want to change an instances implementation of a specific method when i googled for it i found you could do it but it changes the implementation for all other instances of the same class for example def showyimpself print selfy class foo def initself selfx x selfy y def showxself print selfx def showyself print y woohoo class bar def initself fooshowy showyimp selffoo foo def showself selffooshowx selffooshowy if name main b bar bshow f foo fshowx fshowy this does not work as expected because the output is the following x y x y and i want it to be x y x y woohoo i tried to change bars init method with this def initself selffoo foo selffooshowy showyimp but i get the following error message showyimp takes exactly argument given so yeah i tried using setattr but seems like it is the same as selffooshowy showyimp any clue '

## 4.2. Suppression de la présence d'un seul caractère


Je choisis de supprimer un seul caractère car lorsque nous faisons de la programmation, nous utilisons souvent un seul caractère alphabétique comme nom de variable ("x", "y", "z", etc.). Et j'ai observé que lorsque j'ai essayé de détecter des sujets sans les supprimer, j'ai trouvé beaucoup de sujets avec eux! Et même un sujet que je pourrais nommer "Nom de variable" ...

In [26]:
def remove_single_letter(text):
    """remove single alphabetical character"""
    text = re.sub(r"\b\w\b", "", text) # remove all single letter
    text = re.sub("\s+", " ", text) # remove whitespaces left after the last operation
    text = text.strip(" ")
    return text

In [27]:
%%time

#df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_single_letter(x))
#df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_single_letter(x))

# nous ne pouvons pas supprimer les single letters car nous voulons garder la lettre R !!

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.58 µs


In [28]:
df_questions['Body'][10]

'i do not know python very much never used it before d but i can not seem to find anything online maybe i just did not google the right question but here i go i want to change an instances implementation of a specific method when i googled for it i found you could do it but it changes the implementation for all other instances of the same class for example def showyimpself print selfy class foo def initself selfx x selfy y def showxself print selfx def showyself print y woohoo class bar def initself fooshowy showyimp selffoo foo def showself selffooshowx selffooshowy if name main b bar bshow f foo fshowx fshowy this does not work as expected because the output is the following x y x y and i want it to be x y x y woohoo i tried to change bars init method with this def initself selffoo foo selffooshowy showyimp but i get the following error message showyimp takes exactly argument given so yeah i tried using setattr but seems like it is the same as selffooshowy showyimp any clue '

In [29]:
df_questions.iloc[110]

Id                                                     1482141
Title        what does it mean weaklyreferenced object no l...
Body         i am running a python code and i get the follo...
Tags                                                  <python>
Score                                                       28
Title_raw    What does it mean "weakly-referenced object no...
Body_raw     <p>I am running a Python code and I get the fo...
qTitle                                                       1
Name: 110, dtype: object

# 5) Suppression des stopwords

## 5.1. Removing most frequent words

Supprimer les mots les plus fréquents est une étape classique de la NLP. Les mots les plus fréquents n'ajoutent pas beaucoup d'informations dans la plupart des cas (puisqu'ils sont dans presque toutes les phrases). En les supprimant, vous créez plus d'"espace" pour les autres mots qui peuvent avoir des informations plus utiles.
Vous pouvez utiliser des listes prédéfinies à partir de bibliothèques telles que SciKit-Learn, NLTK et autres. Mais sachez que ces listes peuvent être plus problématiques qu'utiles (en particulier la liste scikit-learn, voir [Stop Word Lists in Free Open-source Software Packages](https://www.aclweb.org/anthology/W18-2502.pdf) pour plus d'informations).

In [30]:
def remove_stopwords(text):
    """remove common words in english by using nltk.corpus's list"""
    words_idx = token.tokenize(text)
    filtered = [i for i in words_idx if not i in stop_words]
    
    return ' '.join(map(str, filtered)) # Return the text untokenize

In [31]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_stopwords(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_stopwords(x))

CPU times: user 2.22 s, sys: 0 ns, total: 2.22 s
Wall time: 2.22 s


In [32]:
df_questions['Body'][10]

'know python much never used seem find anything online maybe google right question go want change instances implementation specific method googled found could changes implementation instances class example def showyimpself print selfy class foo def initself selfx x selfy def showxself print selfx def showyself print woohoo class bar def initself fooshowy showyimp selffoo foo def showself selffooshowx selffooshowy name main b bar bshow f foo fshowx fshowy work expected output following x x want x x woohoo tried change bars init method def initself selffoo foo selffooshowy showyimp get following error message showyimp takes exactly argument given yeah tried using setattr seems like selffooshowy showyimp clue'

In [33]:
df_questions.iloc[110]

Id                                                     1482141
Title               mean weaklyreferenced object longer exists
Body         running python code get following error messag...
Tags                                                  <python>
Score                                                       28
Title_raw    What does it mean "weakly-referenced object no...
Body_raw     <p>I am running a Python code and I get the fo...
qTitle                                                       1
Name: 110, dtype: object

## 5.2. Suppression d'adjectives

Je choisis de supprimer les adjectifs en plus de la liste NLTK. Pourquoi ? Tout simplement parce que lorsque j'ai d'abord essayé de faire une détection de sujet dans un cahier suivant celui-ci et cela améliore ma détection de sujet. Je pensais aussi que les adjectifs n'ajouteraient aucune information utile. En même temps, je pourrais aussi supprimer des verbes avec le même raisonnement. Mais je ne l'ai pas fait parce que l'ensemble de données StackOverflow concerne la programmation. Et en programmation, nous avons beaucoup de verbes, ou de mots qui peuvent être interprétés comme un verbe, qui peuvent être importants ("return", "get", "request", "replace", etc.).

In [34]:

# noinspection PyTypeChecker
def remove_by_tag(text, undesired_tag):
    """remove all words by using ntk tag (adjectives, verbs, etc.)"""
    words_idx = token.tokenize(text) # Tokenize each words
    words_tagged = nltk.pos_tag(tokens=words_idx, tagset=None, lang='eng') # Tag each words and return a list of tuples (e.g. ("have", "VB"))
    filtered = [i[0] for i in words_tagged if i[1] not in undesired_tag] # Select all words that don't have the undesired tags
    
    return ' '.join(map(str, filtered)) # Return the text untokenize

In [35]:
%%time
df_questions['Title'] = df_questions['Title'].apply(
    lambda x: remove_by_tag(x, adjective_tag_list))
df_questions['Body'] = df_questions['Body'].apply(
    lambda x: remove_by_tag(x, adjective_tag_list))

CPU times: user 42.2 s, sys: 152 ms, total: 42.3 s
Wall time: 42.3 s


In [36]:
df_questions['Body'][10]

'know python never used seem find anything maybe google question go change instances implementation method googled found could changes implementation instances class example def showyimpself print class foo def initself selfx selfy def showxself print def showyself print class bar def initself fooshowy selffoo foo def showself selffooshowx name b bar bshow foo fshowx fshowy work expected output following x want x woohoo tried change bars init def initself selffoo foo selffooshowy showyimp get following error message showyimp takes exactly given yeah tried using setattr seems like selffooshowy showyimp clue'

In [37]:
df_questions.iloc[110]

Id                                                     1482141
Title                             mean weaklyreferenced exists
Body         running python code get following error messag...
Tags                                                  <python>
Score                                                       28
Title_raw    What does it mean "weakly-referenced object no...
Body_raw     <p>I am running a Python code and I get the fo...
qTitle                                                       1
Name: 110, dtype: object

# 6) Stemming / Lemmatisation

Le Stemming et la Lemmatisation sont des opérations qui :
- peuvent améliorer votre temps de calcul en réduisant votre vocabulaire
- aider à généraliser plus facilement en regroupant les mots (ex: "suis", "sont", "être", etc. seront transformés en "être" pour la lemmatisation)

## 6.1. Stemming

Je n'ai pas choisi d'utiliser le stemming ici, mais l'on doit toujours envisager cette alternative car elle est beaucoup moins coûteuse.

Le stemming est le processus de réduction des mots fléchis à leur racine mot, base ou forme de racine - généralement une forme de mot écrit ("fishing", "fished", and "fisher" à la racine "fish"). Il fonctionne généralement en supprimant l'affixe d'un mot. Un affixe peut être un suffixe ou un préfixe (par exemple «-ed», «-ing», etc.). C'est simple mais ne fonctionnera pas lorsque le mot est "irrégulier" ("ran" et "run"). C'est une opération plus simple que la lemmatisation, qui peut suffire dans certains cas, mais peut faire trop d'erreurs dans d'autres cas.

In [38]:
words = ["program", "programs", "programer", "programing", "programers"]
  
for w in words:
    print(w, " : ", stemmer.stem(w))

program  :  program
programs  :  program
programer  :  program
programing  :  program
programers  :  program


In [39]:
def stem_text(text):
    """Stem the text"""
    words_idx = nltk.word_tokenize(text) # tokenize the text then return a list of tuple (token, nltk_tag)
    stemmed_text = []
    for word in words_idx:
        stemmed_text.append(stemmer.stem(word)) # Stem each words
    return " ".join(stemmed_text) # Return the text untokenize

In [40]:
# %%time

# df_questions['Title'] = df_questions['Title'].apply(lambda x: stem_text(x))
# df_questions['Body'] = df_questions['Body'].apply(lambda x: stem_text(x))

In [41]:
df_questions.iloc[110]

Id                                                     1482141
Title                             mean weaklyreferenced exists
Body         running python code get following error messag...
Tags                                                  <python>
Score                                                       28
Title_raw    What does it mean "weakly-referenced object no...
Body_raw     <p>I am running a Python code and I get the fo...
qTitle                                                       1
Name: 110, dtype: object

## 6.2. Lemmatisation

Comme dit au début, la lemmatisation est le processus de remplacement d'un mot par son lemma (forme canonique ou forme dictionnaire). Mais dans certains cas, un lemmatiseur peut ne pas être en mesure de trouver la bonne racine si vous ne précisez pas le type de mot comme vous pouvez le voir ci-dessous.

In [42]:
print(lemmatizer.lemmatize("stripes", "v"))
print(lemmatizer.lemmatize("stripes", "n"))  
print(lemmatizer.lemmatize("are"))
print(lemmatizer.lemmatize("are", "v"))

strip
stripe
are
be


Une façon de contourner ce problème consiste à utiliser un marqueur et à passer le type de mot dans la fonction lemmatise. MAIS c'est vraiment coûteux. La mise en tige ou une simple lemmatisation à cet égard est bien plus efficace.

In [43]:
def lemmatize_text(text):
    """Lemmatize the text by using tag """
    
    tokens_tagged = nltk.pos_tag(nltk.word_tokenize(text))  # tokenize the text then return a list of tuple (token, nltk_tag)
    lemmatise_text = []
    for word, tag in tokens_tagged:
        if tag.startswith('J'):
            lemmatise_text.append(lemmatizer.lemmatize(word,'a')) # Lemmatise adjectives. Not doing anything since we remove all adjective
        elif tag.startswith('V'):
            lemmatise_text.append(lemmatizer.lemmatize(word,'v')) # Lemmatise verbs
        elif tag.startswith('N'):
            lemmatise_text.append(lemmatizer.lemmatize(word,'n')) # Lemmatise nouns
        elif tag.startswith('R'):
            lemmatise_text.append(lemmatizer.lemmatize(word,'r')) # Lemmatise adverbs
        else:
            lemmatise_text.append(lemmatizer.lemmatize(word)) # If no tags has been found, perform a non specific lemmatisation
    return " ".join(lemmatise_text) # Return the text untokenize

In [44]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: lemmatize_text(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: lemmatize_text(x))

CPU times: user 40.8 s, sys: 128 ms, total: 40.9 s
Wall time: 40.9 s


In [45]:
df_questions['Body'][10]

'know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx selfy def showxself print def showyself print class bar def initself fooshowy selffoo foo def showself selffooshowx name b bar bshow foo fshowx fshowy work expect output follow x want x woohoo try change bar init def initself selffoo foo selffooshowy showyimp get follow error message showyimp take exactly give yeah tried use setattr seem like selffooshowy showyimp clue'

# 7) Feature engineering

L'utilisation du titre et du corps en même temps donne de bien meilleurs résultats pour la détection des sujets.

In [46]:
df_questions['Text'] = df_questions['Title'] + ' ' + df_questions['Body']
df_questions['Text_raw'] = df_questions['Title_raw'] + ' ' + df_questions['Body_raw']

# 8) Dataframe avec les Tags

In [47]:
df_questions['Tags'][10]

'<python>'

In [48]:
def clean_split_tag(text):
    """ 
        fonction pour traiter tous les tags en supprimant les caracteres HTML
    """
    text = re.sub(r"<", "", text) # match all literal apostrophe pattern then replace them by a single whitespace
    text = re.sub(r">", ",", text)# match all literal Line Feed (New line) pattern then replace them by a single whitespace
    text = re.sub(r" ", "_", text)
    text = text.strip(',')
    text = text.split(',')
    #text = re.sub(r",", " ", text)
    #text = text.rstrip('>')
    return text

In [49]:
%%time

df_questions['Tags'] = df_questions['Tags'].apply(lambda x: clean_split_tag(x))

CPU times: user 138 ms, sys: 25 µs, total: 138 ms
Wall time: 137 ms


In [50]:
tags_list = df_questions[['Id','Tags']].explode('Tags')
print(tags_list.head(10))

def untokenize(words_id):
    return ' '.join(map(str, words_id)) # Return the text untokenize

df_questions['Tags'] = df_questions['Tags'].apply(lambda x: untokenize(x))

         Id       Tags
0   5649407     python
0   5649407  bytearray
1   7974849     python
2  29554796          r
2  29554796    ggplot2
3    250151     python
3    250151  scripting
3    250151        lua
4   1342000     python
4   1342000    unicode


In [51]:
df_questions['Tags']

0              python bytearray
1                        python
2                     r ggplot2
3          python scripting lua
4                python unicode
                  ...          
13908    python mysql timedelta
13909         python amazon-ec2
13910                   r dplyr
13911         python django ide
13912               r dataframe
Name: Tags, Length: 13913, dtype: object

In [52]:
df_questions[df_questions['Title']=='']

Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Body_raw,qTitle,Text,Text_raw
7720,1471994,,anyone please explain configure used,python python-3.x setup.py pypi python-packaging,1223,What is setup.py?,<p>Can anyone please explain what <code>setup....,1,anyone please explain configure used,What is setup.py? <p>Can anyone please explain...
7868,9831097,,write lot python code want read file know two ...,python file,20,Is open().read() safe?,<p>I write a lot of Python code where I just w...,1,write lot python code want read file know two...,Is open().read() safe? <p>I write a lot of Pyt...
10426,12013953,,gb reason take extremely time write never actu...,r file-io csv dataframe data.table,55,write.csv for large data.table,<p>I have a <code>data.table</code> that is no...,1,gb reason take extremely time write never act...,write.csv for large data.table <p>I have a <co...
11247,11476190,,duplicate python operator behaves unexpectedly...,python debugging integer cpython,130,Why (0-6) is -6 = False?,<blockquote>  <p><strong>Possible Duplicate:<...,1,duplicate python operator behaves unexpectedl...,Why (0-6) is -6 = False? <blockquote>  <p><st...
11261,2671376,,question see python list realize probably conc...,python data-structures hash immutability,86,"Hashable, immutable","<p>From a recent SO question (see <a href=""htt...",1,question see python list realize probably con...,"Hashable, immutable <p>From a recent SO questi..."
13607,470139,,python evaluate expression ever put print answ...,python evaluation operator-precedence,29,Why does 1+++2 = 3?,<p>How does Python evaluate the expression <co...,1,python evaluate expression ever put print ans...,Why does 1+++2 = 3? <p>How does Python evaluat...


In [53]:
df_questions=df_questions[df_questions['Title']!='']
# On supprime les lignes avec les titres effacés

# 9) Exportation des données

In [54]:
print(df_questions.head(10),'\n \n',tags_list.head(10))
#df_questions.drop(['Tags'], axis=1).to_csv('df_questions_fullclean.csv', encoding='utf-8')
df_questions.to_csv('df_questions_fullclean.csv', encoding='utf-8')
tags_list.to_csv('tags_list.csv', encoding='utf-8')

         Id                                              Title  \
0   5649407                                string array python   
1   7974849                   make one python file run another   
2  29554796               mean band width ggplot geomsmooth lm   
3    250151                 lua generalpurpose script language   
4   1342000  make python interpreter correctly character st...   
5   7460938                          run python script webpage   
6  16455777        python count element object match attribute   
7    323972                                    way kill thread   
8    563022            python practice import offering feature   
9  18060116                add legend ggplot line add manually   

                                                Body  \
0  long string represent series value type conver...   
1  make one python file run another example two p...   
2  follow code libraryggplot ggplotmtcars aesxwt ...   
3  see thing ever read embed often anything world