In the notebook, I will focus on data cleansing, such as removing HTML tags from body text, and some other text processing, so that the dataset will be ready for feature extraction.

In [1]:
import os

HOME_DIR = os.curdir
DATA_DIR = os.path.join(HOME_DIR, "data")

In [2]:
import nltk
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm

pd.options.display.max_colwidth = 255
tqdm.pandas()

In [3]:
df = pd.read_pickle(f"{DATA_DIR}/eda.pkl")

In [4]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count
0,80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,"[flex, actionscript-3, air]",3
1,90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,"[svn, tortoisesvn, branch, branching-and-merging]",4
2,120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,"[sql, asp.net, sitemap]",3
3,180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,"[algorithm, language-agnostic, colors, color-s...",4
4,260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,"[c#, .net, scripting, compiler-construction]",4


### Understand text length

In [28]:
min_title_length = df["title"].str.len().min()
max_title_length = df["title"].str.len().max()
min_body_length = df["body"].str.len().min()
max_body_length = df["body"].str.len().max()

In [29]:
print(f"min_title_length: {min_title_length}")
print(f"max_title_length: {max_title_length}")
print(f"min_body_length: {min_body_length}")
print(f"max_body_length: {max_body_length}")

min_title_length: 9
max_title_length: 189
min_body_length: 18
max_body_length: 46489


In [30]:
df[df["title"].str.len() == min_title_length]

Unnamed: 0,id,title,body,tag,tag_count
9695,622900,C# hashes,<p>I'm new to C#</p>\n\n<blockquote>\n <ol>\n...,"[c#, hash]",2


In [31]:
df[df["title"].str.len() == max_title_length]

Unnamed: 0,id,title,body,tag,tag_count
694578,23691000,How to convert Office 365 ï¿½Éï¿½ï¿½ï¿½ï¿½é¼...,<p>I am reading source of the following url bu...,"[c#, character-encoding]",2


<img src="images/question-encoding-error.png" width="500" height="567" /> 
We can see that the title actually has encoding error itself, so there is not much we can do.

In [32]:
df[df["body"].str.len() == min_body_length]

Unnamed: 0,id,title,body,tag,tag_count
14480,858790,How to setup TeamCity under IIS?,<p>Any ideas?</p>\n,"[version-control, teamcity]",2


In [37]:
df[df["body"].str.len() == max_body_length]

Unnamed: 0,id,title,body,tag,tag_count
1168127,37657980,Elasticsearch - How to provide custom synonyms when querying?,"<p>I'm developping a search engine for my client which has to use synonym expansion. I can properly setup my index with a synonym token filter and a custom file (synonym.txt). </p>\n\n<p>Example: ipod, i-pod, i pod</p>\n\n<p>However, whenever we want ...",[elasticsearch],1


<img src="images/question-long-body-text.png" width="500" height="300" /> 
The actual text of the body is not particularly long, but rather the original poster has included a long portion of code for reference. We need to consider if we want to retain this type of information, as it may deviate the model assumption by a lot.

### Use BeautifulSoup to remove HTML tags from body text

In [70]:
df["body"] = df["body"].progress_apply(lambda text: BeautifulSoup(text, "lxml").text)

100%|██████████| 1264216/1264216 [07:52<00:00, 2676.96it/s]


In [71]:
df.head()

Unnamed: 0,id,title,body,tag,tag_count
0,80,SQLStatement.execute() - multiple queries in one statement,"I've written a database generation script in SQL and want to execute it in my Adobe AIR application:\nCreate Table tRole (\n roleID integer Primary Key\n ,roleName varchar(40)\n);\nCreate Table tFile (\n fileID integer Primary Key\n ,f...","[flex, actionscript-3, air]",3
1,90,Good branching and merging tutorials for TortoiseSVN?,Are there any really good tutorials explaining branching and merging with Apache Subversion? \nAll the better if it's specific to TortoiseSVN client.\n,"[svn, tortoisesvn, branch, branching-and-merging]",4
2,120,ASP.NET Site Maps,"Has anyone got experience creating SQL-based ASP.NET site-map providers?\nI've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamic...","[sql, asp.net, sitemap]",3
3,180,Function for creating color wheels,"This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter.\n","[algorithm, language-agnostic, colors, color-space]",4
4,260,Adding scripting functionality to .NET applications,"I have a little game written in C#. It uses a database as back-end. It's \na trading card game, and I wanted to implement the function of the cards as a script.\nWhat I mean is that I essentially have an interface, ICard, which a card class implements...","[c#, .net, scripting, compiler-construction]",4


In [72]:
# checkpoint
df.to_pickle(f"{DATA_DIR}/tp-1.pkl")

### Lower case, remove newline and punctuations; tokenize and handle symbols in topics

In [5]:
df["body"] = df["body"].str.lower()

In [6]:
import nltk
nltk.download("punkt")

# we have to keep a list of topics with symbols or digits that people will actually type in because of how nltk handles word tokenization
# this list includes tags that have more than 10,000 questions as of 2020 Jan
topics_with_symbols = ["c#", "c++", ".net", "asp.net", "node.js", "objective-c", "unity3d", "html5", "css3", \
                       "d3.js", "utf-8", "neo4j", "scikit-learn", "f#", "3d", "x86"]

df["body_tokenized"] = df["body"].progress_apply(lambda text: [word for word in nltk.word_tokenize(text) \
                                                               if word.isalpha() or word in list("+#") + topics_with_symbols])

  0%|          | 10/1264216 [00:00<3:32:52, 98.98it/s]

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


100%|██████████| 1264216/1264216 [35:24<00:00, 594.98it/s] 


In [None]:
# retokenize topics including meaningful symbols such as C#, C++
mwe_tokenizer = nltk.MWETokenizer(separator="")
mwe_tokenizer.add_mwe(("c", "#"))
mwe_tokenizer.add_mwe(("c", "+", "+"))
mwe_tokenizer.add_mwe(("f", "#"))

df["body_tokenized"] = df["body_tokenized"].progress_apply(lambda tokens: [token for token in mwe_tokenizer.tokenize(tokens)])

### Remove stop words

In [None]:
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def filter_stop_words(words):
    words_filtered = []
    for word in words:
        if word not in stop_words:
            words_filtered.append(word)
    return words_filtered

df["body_tokenized"] = df["body_tokenized"].progress_apply(filter_stop_words)

Let's take a look at the results:

In [20]:
df.sample(5)

Unnamed: 0,id,title,body,tag,tag_count,body_tokenized
349655,12519690,spring-mock in production code,"one of our applications has the spring-mock.jar in the ear. i am the cm and i'm not a developer, but it doesn't seem like you want mock services in your production application. i thought the spring-mock.jar allows you to mimic certain services while t...","[java, spring, jar]",3,"[one, applications, ear, cm, developer, seem, like, want, mock, services, production, application, thought, allows, mimic, certain, services, testing, see, code, dependencies, upon, classes, always, possible, runtime, dependencies, compile, time, depe..."
79976,3480790,Subdomain and favicons,i'm having trouble showing the favicon of a subdomain (which redirects to a php file)\nmc.company.com redirects to www.company.com/mc.php\nin the mc.php file i have included a header (templates) which has the link and stuff for the favicon:\n<link rel...,"[html, subdomain, favicon]",3,"[trouble, showing, favicon, subdomain, redirects, php, file, redirects, file, included, header, templates, link, stuff, favicon, link, shortcut, icon, http, link, icon, http, noticed, main, index, subdirectory, creates, frameset, frame, file, frame, b..."
2153,174400,Using VS 2005 to design abstract forms,there's a famous bug in visual studio that prevents you from using the form designer on a subclass of an abstract form. \nthis problem has already been elucidated and solved most elegantly by urban potato; that's not the part i'm having trouble with....,"[c#, .net, visual-studio, winforms, designer]",5,"[famous, bug, visual, studio, prevents, using, form, designer, subclass, abstract, form, problem, already, elucidated, solved, elegantly, urban, potato, part, trouble, trouble, duplicated, technique, described, urban, potato, included, project, happen..."
401950,14218320,click/clickAndWait selenese command fails when used in play framework auto-test mode(headless browser),"while writing some acceptance tests for webapp (playframework based),i got confused by the usage of some selenium commands.\nclick/clickandwait works well when i run in browser using this command. \nplay run \n\nit fails when i run in command prompt(h...","[testing, selenium, playframework, selenium-ide, playframework-1.x]",5,"[writing, acceptance, tests, webapp, playframework, based, got, confused, usage, selenium, commands, works, well, run, browser, using, command, play, run, fails, run, command, prompt, headless, browser, using, command, play, tried, commands, click, li..."
126042,5098050,How do I prevent redeclaration errors when using Mock classes that implement the IteratorAggregate interface when testing with PHPUnit?,"i'm writing a unit test that relies on an external class, exceptionmanager. i want to be able to predict what some specific functions on this class will return, so i'm using a mock object. the code is quite straightforward:\n$mockexceptionmanager = $t...","[php, unit-testing, mocking, phpunit]",4,"[writing, unit, test, relies, external, class, exceptionmanager, want, able, predict, specific, functions, class, return, using, mock, object, code, quite, straightforward, mockexceptionmanager, getmock, trouble, exception, manager, implements, iterat..."


### Repeat the steps above on title column as wellmwe_tokenizer

In [None]:
df["title"] = df["title"].str.lower()

df["title_tokenized"] = df["title"].progress_apply(lambda text: [word for word in nltk.word_tokenize(text) \
                                                               if word.isalpha() or word in list("+#") + topics_with_symbols])

df["title_tokenized"] = df["title_tokenized"].progress_apply(lambda tokens: [token for token in mwe_tokenizer.tokenize(tokens)])

df["title_tokenized"] = df["title_tokenized"].progress_apply(filter_stop_words)

In [26]:
df.sample(5)

Unnamed: 0,id,title,body,tag,tag_count,body_tokenized,title_tokenized
63709,2887940,re-adjusting a binary heap after removing the minimum element,"after removing the minimum element in a binary heap, i.e. after removing the root, i understand that the heap must be adjusted in order to maintain the heap property. \nbut the preferred method for doing this appears to be to assign the last leaf to t...","[algorithm, binary-heap]",2,"[removing, minimum, element, binary, heap, removing, root, understand, heap, must, adjusted, order, maintain, heap, property, preferred, method, appears, assign, last, leaf, root, sift, wondering, take, lesser, child, used, root, keep, sifting, childr...","[binary, heap, removing, minimum, element]"
408320,14430440,"how to ""walk"" through word document making changes to the content?","i want to replace each character in file with another one.\nnow i'm implementing it by using find.execute() method, but in this case it spends time for searching and then replaces it, then search for another character from the beginning of file again,...","[c#, .net, ms-word]",3,"[want, replace, character, file, another, one, implementing, using, method, case, spends, time, searching, replaces, search, another, character, beginning, file, want, replace, alphabetic, letters, go, whole, document, lower, case, upper, case, times,...","[walk, word, document, making, changes, content]"
466985,16342040,"using an @autowired resource with ""try with resource""","given the following:\npublic class resourceone implements autocloseable {...}\n\nwith an instance of resourceone instantiated in (spring) xml config. \nhow should this object (when autowired) be used with the ""try-with-resources statement"", since you ...","[java, spring]",2,"[given, following, public, class, resourceone, implements, autocloseable, instance, resourceone, instantiated, spring, xml, config, object, autowired, used, statement, since, required, instantiate, resource, try, block, one, approach, could, use, refe...","[using, autowired, resource, try, resource]"
653505,22397870,facing an error in program of image capturing and processing using opencv(cv2) and python,"in the following program, it captures and processes image runtime. but i am facing a lot problems in the code. the first problem is, when camera is initialized for the first time and if it is unable to detect red colour in captured frame then it give...","[opencv, python-2.7]",2,"[following, program, captures, processes, image, runtime, facing, lot, problems, code, first, problem, camera, initialized, first, time, unable, detect, red, colour, captured, frame, gives, following, error, traceback, recent, call, last, file, line, ...","[facing, error, program, image, capturing, processing, using, opencv, python]"
1194463,38354640,grouping list by nth element,"i have a 2d list like the ones below \noriginal_list = [['2', 'out', 'words', 'test3', '21702-1201', 'us', 41829.0, 'vn', 'post', 'nai'],\n ['test', 'info', 'more info', 'stuff', '63123-7802', 'us', 40942.0, 'cm', 'user info', 'vai'],\...","[python, list]",2,"[list, like, ones, info, info, user, already, sorted, list, zip, code, want, split, list, zip, code, element, group, zip, codes, new, list, would, also, like, sort, first, numbers, zip, code, ignoring, last, tried, use, zip, function, could, get, grou...","[grouping, list, nth, element]"


In [28]:
# checkpoint
df.rename(columns={"tag": "tags"}, inplace=True)
df[["id", "title_tokenized", "body_tokenized", "tags"]].to_pickle(f"{DATA_DIR}/tp-2.pkl")