<h1>Sommaire<span class="tocSkip"></span></h1>
</br>
<div class="toc">
	<ul class="toc-item">
		<li>
			<span>
				<a href="#Import-des-données" data-toc-modified-id="Import-des-données-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import des données</a>
			</span>
		</li>
		<li>
			<span>
				<a href="#Préprocessing-Steps/Options" data-toc-modified-id="Préprocessing-Steps/Options-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Préprocessing Steps/Options</a>
			</span>
			<ul class="toc-item">
				<li><span><a href="#Cleaning-Data" data-toc-modified-id="Cleaning-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Cleaning Data</a></span>
					<ul class="toc-item">
							<li><span><a href="#Clean-Tags" data-toc-modified-id="Clean-Tags-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Clean Tags</a></span>
							</li>
							<li><span><a href="#Clean-Posts" data-toc-modified-id="Clean-Posts-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Clean Posts</a></span>
							</li>
					</ul>
				</li>
				<li><span><a href="#Tokenize-(Gensim)" data-toc-modified-id="Tokenize-(Gensim)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tokenize (Gensim)</a></span>
				</li>
				<li><span><a href="#Make-bigrams-/-trigram-with-Gensim" data-toc-modified-id="Make-bigrams-/-trigram-with-Gensim-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Make bigrams / trigram with Gensim</a></span>
				</li>
				<li><span><a href="#Lemmatize-(Spacy)" data-toc-modified-id="Lemmatize-(Spacy)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Lemmatize (Spacy)</a></span>
				</li>
			</ul>
		</li>
		<li>
			<span>
				<a href="#Save-Corpus" data-toc-modified-id="Save-Corpus-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Save Corpus</a>
			</span>
		</li>
	</ul>
</div>

In [1]:
import os
import itertools
import re
import string
import pandas as pd
import glob
import numpy as np
import spacy
import gensim


from collections import defaultdict

from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from chardet.universaldetector import UniversalDetector

# Import des données

Nous avons importé 21 fichers csv à partir du site StackOverflow

In [2]:
posts_path = os.path.normpath('Data/')

In [3]:
query_files = glob.glob(os.path.join(posts_path, '*'))
print(query_files)

['Data\\QueryResults1.csv', 'Data\\QueryResults10.csv', 'Data\\QueryResults11.csv', 'Data\\QueryResults12.csv', 'Data\\QueryResults13.csv', 'Data\\QueryResults14.csv', 'Data\\QueryResults15.csv', 'Data\\QueryResults16.csv', 'Data\\QueryResults17.csv', 'Data\\QueryResults18.csv', 'Data\\QueryResults19.csv', 'Data\\QueryResults2.csv', 'Data\\QueryResults20.csv', 'Data\\QueryResults21.csv', 'Data\\QueryResults3.csv', 'Data\\QueryResults4.csv', 'Data\\QueryResults5.csv', 'Data\\QueryResults6.csv', 'Data\\QueryResults7.csv', 'Data\\QueryResults8.csv', 'Data\\QueryResults9.csv']


In [4]:
df_data = pd.concat([pd.read_csv(file)
                     for file in query_files], ignore_index=True)

print(df_data.shape)

(552541, 23)


Après fusion des 21 fichiers csv, nous obtenons une base de données comportant 552.541 lignes et 23 vaqriables.

In [5]:
# Déterminons les variables de notre base de données
df_data.columns

Index(['Id', 'PostTypeId', 'AcceptedAnswerId', 'ParentId', 'CreationDate',
       'DeletionDate', 'Score', 'ViewCount', 'Body', 'OwnerUserId',
       'OwnerDisplayName', 'LastEditorUserId', 'LastEditorDisplayName',
       'LastEditDate', 'LastActivityDate', 'Title', 'Tags', 'AnswerCount',
       'CommentCount', 'FavoriteCount', 'ClosedDate', 'CommunityOwnedDate',
       'ContentLicense'],
      dtype='object')

Seuls les `Titles`, `Body` et `Tags` nous sont utiles pour notre étude.

In [6]:
data = df_data[['Title', 'Body', 'Tags']]

In [7]:
# Affichons le premier element de la variable Body.
data['Body'].iloc[0]

'<p>I want to use a <code>Track-Bar</code> to change a <code>Form</code>\'s opacity.</p>\n<p>This is my code:</p>\n<pre class="lang-cs prettyprint-override"><code>decimal trans = trackBar1.Value / 5000;\nthis.Opacity = trans;\n</code></pre>\n<p>When I build the application, it gives the following error:</p>\n<blockquote>\n<p>Cannot implicitly convert type <code>decimal</code> to <code>double</code></p>\n</blockquote>\n<p>I have tried using <code>trans</code> and <code>double</code> but then the <code>Control</code> doesn\'t work. This code worked fine in a past VB.NET project.</p>\n'

In [8]:
# Affichons le premier element de la variable Title.
data['Title'].iloc[0]

'Convert Decimal to Double'

In [9]:
# Affichons le premier element de la variable Tags.
data['Tags'].iloc[0]

'<c#><floating-point><type-conversion><double><decimal>'

In [10]:
# Déterminons les données manquantes dans notre base de données
data.isna().sum()

Title    1
Body     0
Tags     0
dtype: int64

Notre base de données comporte une seule valeur manquante qui se trouve au niveau de la variable Title.

Nous allons eliminer les doublons au niveau de la variable `Body`.

In [11]:
data.nunique()

Title    551881
Body     552487
Tags     321536
dtype: int64

In [12]:
df_data1 = data.drop_duplicates(subset=['Title','Body']).copy()
df_data1.shape

(552523, 3)

In [13]:
df_dataFinal = df_data1.drop_duplicates(subset=['Body']).copy()
df_dataFinal.shape

(552487, 3)

Compte tenu des contraintes machines (puissance et mémoire) et contraintes de temps, nous allons procéder à un échantillonnage arbitraire de 200000 posts pour la suite de notre études.

In [14]:
#sample 200,000 posts out of these 552,523 posts that we have.
df_dataFinalSample = df_dataFinal.sample(n = 200000, replace=False, random_state=1)
df_dataFinalSample.head()

Unnamed: 0,Title,Body,Tags
337049,How do I make a shrinkable scrollbar?,<p><strong>What I want:</strong> </p>\n\n<pre...,<html><css><scrollbars>
478707,"Do you prefer ""if (var)"" or ""if (var != 0)""?",<p>I've been programming in C-derived language...,<c><perl><coding-style>
191651,SQL ServerReporting Services - Export report t...,<p>I have a report which has a document map fo...,<c#><asp.net><reporting-services>
228069,WIX: How can I register a new ISAPI Extension ...,"<p>I've seen the <a href=""http://wix.sourcefor...",<installation><wix><windows-installer><isapi>
26946,C# - Tetris clone - Can't get block to respond...,<p>I'm working on programming a Tetris game in...,<c#><keydown><arrow-keys>


Après élimination des doublons nous obtenons 552.487 lignes et 3 variables (`Body`, `Tags` et `Title`).

### Concatenate Title and Body

In [15]:
# concatenate title and body
df_dataFinalSample.loc[:, 'Text'] = df_dataFinalSample['Title'].copy() + ' ' + df_dataFinalSample['Body'].copy()
df_dataFinalSample.head()

Unnamed: 0,Title,Body,Tags,Text
337049,How do I make a shrinkable scrollbar?,<p><strong>What I want:</strong> </p>\n\n<pre...,<html><css><scrollbars>,How do I make a shrinkable scrollbar? <p><stro...
478707,"Do you prefer ""if (var)"" or ""if (var != 0)""?",<p>I've been programming in C-derived language...,<c><perl><coding-style>,"Do you prefer ""if (var)"" or ""if (var != 0)""? <..."
191651,SQL ServerReporting Services - Export report t...,<p>I have a report which has a document map fo...,<c#><asp.net><reporting-services>,SQL ServerReporting Services - Export report t...
228069,WIX: How can I register a new ISAPI Extension ...,"<p>I've seen the <a href=""http://wix.sourcefor...",<installation><wix><windows-installer><isapi>,WIX: How can I register a new ISAPI Extension ...
26946,C# - Tetris clone - Can't get block to respond...,<p>I'm working on programming a Tetris game in...,<c#><keydown><arrow-keys>,C# - Tetris clone - Can't get block to respond...


### Save

In [16]:
# save data
df_dataFinalSample.to_pickle('PickleData/raw_data.pkl')

# Préprocessing Steps/Options

## Cleaning Data

### Clean Tags

In [17]:
def get_tags(tags):
    """
        Transform a string of tags into a list of tags
        
        return: list of tags
    """
    try:
        tags = re.findall(r'<(.*?)>', tags)
        return tags
    except:
        return None
    
df_dataFinalSample.loc[:, 'Tags'] = df_dataFinalSample['Tags'].apply(get_tags)

df_dataFinalSample.head()

Unnamed: 0,Title,Body,Tags,Text
337049,How do I make a shrinkable scrollbar?,<p><strong>What I want:</strong> </p>\n\n<pre...,"[html, css, scrollbars]",How do I make a shrinkable scrollbar? <p><stro...
478707,"Do you prefer ""if (var)"" or ""if (var != 0)""?",<p>I've been programming in C-derived language...,"[c, perl, coding-style]","Do you prefer ""if (var)"" or ""if (var != 0)""? <..."
191651,SQL ServerReporting Services - Export report t...,<p>I have a report which has a document map fo...,"[c#, asp.net, reporting-services]",SQL ServerReporting Services - Export report t...
228069,WIX: How can I register a new ISAPI Extension ...,"<p>I've seen the <a href=""http://wix.sourcefor...","[installation, wix, windows-installer, isapi]",WIX: How can I register a new ISAPI Extension ...
26946,C# - Tetris clone - Can't get block to respond...,<p>I'm working on programming a Tetris game in...,"[c#, keydown, arrow-keys]",C# - Tetris clone - Can't get block to respond...


### Clean Posts

In [18]:
# Transform the text into a list of main tokens
def post_prepare(post, REPLACE_BY_SPACE, BAD_SYMBOLS):
    """
        Transform a post into a list of main tokens
        
        return: list of tokens
    """
    post = BeautifulSoup(post)
    try:
        # delete url tag and code tag
        for tag in post.find_all(['a', 'code']):
            tag.decompose()
        text = post.text
        text = text.lower() #lowercase text
        text = re.sub(REPLACE_BY_SPACE, " ", text) # replace REPLACE_BY_SPACE symbols by space in text
        text = re.sub(BAD_SYMBOLS, "", text) # delete symbols which are in BAD_SYMBOLS from text

    except:
        text = 'ERROR'
    return text


REPLACE_BY_SPACE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS = re.compile('[^0-9a-z #+_]')
for index, row in df_dataFinalSample.iterrows():
    df_dataFinalSample.loc[index, 'Text'] = post_prepare(row['Text'], REPLACE_BY_SPACE, BAD_SYMBOLS)

print('Nombre d\'erreurs : {}'.format(
    df_dataFinalSample['Text'].loc[df_dataFinalSample['Text'] == 'ERROR'].shape[0]))

# suppression posts avec erreur
print('Nombre de posts avant traitement : {}'.format(df_dataFinalSample.shape[0]))
df_dataFinalSample = df_dataFinalSample.loc[df_dataFinalSample['Text'] != 'ERROR']
print('Nombre de posts après suppression des erreurs : {}'.format(df_dataFinalSample.shape[0]))    

df_dataFinalSample.head()

Nombre d'erreurs : 0
Nombre de posts avant traitement : 200000
Nombre de posts après suppression des erreurs : 200000


Unnamed: 0,Title,Body,Tags,Text
337049,How do I make a shrinkable scrollbar?,<p><strong>What I want:</strong> </p>\n\n<pre...,"[html, css, scrollbars]",how do i make a shrinkable scrollbar what i wa...
478707,"Do you prefer ""if (var)"" or ""if (var != 0)""?",<p>I've been programming in C-derived language...,"[c, perl, coding-style]",do you prefer if var or if var 0 ive been...
191651,SQL ServerReporting Services - Export report t...,<p>I have a report which has a document map fo...,"[c#, asp.net, reporting-services]",sql serverreporting services export report to...
228069,WIX: How can I register a new ISAPI Extension ...,"<p>I've seen the <a href=""http://wix.sourcefor...","[installation, wix, windows-installer, isapi]",wix how can i register a new isapi extension o...
26946,C# - Tetris clone - Can't get block to respond...,<p>I'm working on programming a Tetris game in...,"[c#, keydown, arrow-keys]",c# tetris clone cant get block to respond pr...


In [19]:
# keep raw data
df_dataFinalSample['raw'] = df_dataFinalSample['Text'].copy()

## Tokenize (Gensim)

In [20]:
# Tokenize & Remove stopwords

mystopwords = set(stopwords.words('english')) | set(string.punctuation)


def tokenize_text(text):
    # tokenize, normalize and filter number
    words = list(gensim.utils.tokenize(text, lowercase=True)) # tokenize, normalize and filter number
    # stop words
    words = [x for x in words if x not in mystopwords] # stop words
    return words

In [21]:
df_dataFinalSample.loc[:, 'Text'] = df_dataFinalSample['Text'].apply(tokenize_text)
df_dataFinalSample['Text'].head(5)

337049    [make, shrinkable, scrollbar, want, notice, sh...
478707    [prefer, var, var, ive, programming, cderived,...
191651    [sql, serverreporting, services, export, repor...
228069    [wix, register, new, isapi, extension, script,...
26946     [c, tetris, clone, cant, get, block, respond, ...
Name: Text, dtype: object

Après traitement, il se peut que des posts soient vide et donc nous allons rechercher ces indexes de ces postes vites et les supprimer de notre base de données.

In [22]:
NumOfNulText = 0
NulTextIndex = []
for index, row in df_dataFinalSample.iterrows():
    num = len(row['Text'])
    if num == 0:
        NumOfNulText = NumOfNulText+1
        NulTextIndex.append(index)

print('Nombre de posts vides : {}'.format(NumOfNulText))

Nombre de posts vides : 5


In [23]:
# Supression des posts vides
df_dataFinalSample = df_dataFinalSample.drop(NulTextIndex)
df_dataFinalSample.head()

Unnamed: 0,Title,Body,Tags,Text,raw
337049,How do I make a shrinkable scrollbar?,<p><strong>What I want:</strong> </p>\n\n<pre...,"[html, css, scrollbars]","[make, shrinkable, scrollbar, want, notice, sh...",how do i make a shrinkable scrollbar what i wa...
478707,"Do you prefer ""if (var)"" or ""if (var != 0)""?",<p>I've been programming in C-derived language...,"[c, perl, coding-style]","[prefer, var, var, ive, programming, cderived,...",do you prefer if var or if var 0 ive been...
191651,SQL ServerReporting Services - Export report t...,<p>I have a report which has a document map fo...,"[c#, asp.net, reporting-services]","[sql, serverreporting, services, export, repor...",sql serverreporting services export report to...
228069,WIX: How can I register a new ISAPI Extension ...,"<p>I've seen the <a href=""http://wix.sourcefor...","[installation, wix, windows-installer, isapi]","[wix, register, new, isapi, extension, script,...",wix how can i register a new isapi extension o...
26946,C# - Tetris clone - Can't get block to respond...,<p>I'm working on programming a Tetris game in...,"[c#, keydown, arrow-keys]","[c, tetris, clone, cant, get, block, respond, ...",c# tetris clone cant get block to respond pr...


## Make bigrams / trigram with Gensim

In [24]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(df_dataFinalSample['Text'], min_count=50, threshold=1000) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[df_dataFinalSample['Text']], threshold=1000)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# save bigram and trigram model for heroku app
bigram_mod.save('HerokuApp/bigram_model')
trigram_mod.save('HerokuApp/trigram_model')

In [25]:
def make_bigrams(text):
    return bigram_mod[text]

def make_trigrams(text):
    return trigram_mod[text]

df_dataFinalSample.loc[:, 'Text'] = df_dataFinalSample['Text'].apply(make_bigrams)
df_dataFinalSample.loc[:, 'Text'] = df_dataFinalSample['Text'].apply(make_trigrams)
df_dataFinalSample['Text'].head(5)

337049    [make, shrinkable, scrollbar, want, notice, sh...
478707    [prefer, var, var, ive, programming, cderived,...
191651    [sql, serverreporting, services, export, repor...
228069    [wix, register, new, isapi, extension, script,...
26946     [c, tetris, clone, cant, get, block, respond, ...
Name: Text, dtype: object

## Lemmatize (Spacy)

In [26]:
#!python -m spacy download en
nlp = spacy.load('en')

def lemmatize(text):
    text = ' '.join(text)
    doc = nlp(text)
    text = [token.lemma_ for token in doc]
    return text

df_dataFinalSample.loc[:, 'Text'] = df_dataFinalSample['Text'].apply(lemmatize)
df_dataFinalSample['Text'].head(5)

337049    [make, shrinkable, scrollbar, want, notice, sh...
478707    [prefer, var, var, -PRON-, have, program, cder...
191651    [sql, serverreporte, services, export, report,...
228069    [wix, register, new, isapi, extension, script,...
26946     [c, tetris, clone, can, not, get, block, respo...
Name: Text, dtype: object

In [27]:
df_dataFinalSample.shape

(199995, 5)

# Save Corpus

In [28]:
df_dataFinalSample.to_pickle('PickleData/corpus.pkl')