In this notebook, I will focus on feature extraction from the two text columns, i.e. title and body, so that the data set will be ready for model training.

In [1]:
import os

HOME_DIR = os.curdir
DATA_DIR = os.path.join(HOME_DIR, "data")

In [4]:
import pandas as pd
from tqdm import tqdm

pd.options.display.max_colwidth = 255
tqdm.pandas()

In [3]:
df = pd.read_pickle(f"{DATA_DIR}/tp-2.pkl")

In [4]:
df.sample(5)

Unnamed: 0,id,title_tokenized,body_tokenized,tags
170505,6628270,"[view, renders, correctly, browser, tests, fail, helper, method]","[simple, view, helper, defines, title, everything, works, fine, pull, view, browser, rspec, tests, fail, tests, describe, pagescontroller, ror, sample, app, end, describe, get, successful, get, end, right, title, get, title, content, +, home, end, end...","[ruby-on-rails, view, rspec, helper]"
1181127,37999670,"[parsing, linux, iscsi, python, nested, dictionaries]","[writing, script, involves, multipath, objects, standard, configuration, file, example, #, basic, configuration, file, examples, device, mapper, #, multipath, #, #, use, user, friendly, names, instead, using, wwids, names, defaults, yes, #, #, devices...","[python, dictionary, config, iscsi]"
408156,14425410,"[unable, handle, kernel, paging, request, x, intercepting, system, call]","[possible, duplicate, linux, kernel, system, call, hooking, example, trying, hook, system, calls, kernel, got, basic, idea, system, call, trying, intercept, fork, found, address, turned, wrote, module, #, include, #, include, #, include, #, include, #...","[c, linux, kernel-module, kernel]"
330749,11898920,[working],"[using, url, routing, structure, web, use, url, routing, especially, multi, segment, css, js, file, used, work, void, routemap, routecollection, route, false, resource, work, resource, work, work, work, guide, key, key, void, object, sender, eventargs...","[asp.net, routing, url-routing]"
1012433,33451620,"[hibernate, findbyproperty, use, different, types]","[trying, write, search, using, hibernate, would, run, search, different, types, variables, model, movie, properties, title, director, genre, year, title, director, genre, strings, year, int, jsp, file, select, choose, property, want, search, text, inp...","[java, hibernate-criteria]"


### Number of tags (i.e. classes)

In [7]:
from collections import Counter

tag_count = Counter()

def count_tag(tags):
    for tag in tags:
        tag_count[tag] += 1

df["tags"].apply(count_tag)

len(tag_count.values())

38147

As there are over 38,000 tags in the dataset, which is too much for a multi-label classification, I will only keep data with the top 4,000 tags (which will cover 90% of the questions), as suggested in exploratory data analysis earlier.

In [8]:
most_common_tags = [count[0] for count in tag_count.most_common(4000)]
df["tags"] = df["tags"].progress_apply(lambda tags: [tag for tag in tags if tag in most_common_tags])

100%|██████████| 1264216/1264216 [00:46<00:00, 27418.12it/s]


In [9]:
df[df["tags"].map(lambda tags: len(tags) > 0)].shape

(1250951, 4)

In [10]:
print(f"Only {1264216 - 1250951:,} rows of data will be dropped while number of classes is reduced from {len(tag_count.values()):,} to 4,000, which is great!")

Only 13,265 rows of data will be dropped while number of classes is reduced from 38,147 to 4,000, which is great!


In [11]:
df = df[df["tags"].map(lambda tags: len(tags) > 0)]

In [12]:
# checkpoint
df.to_pickle(f"{DATA_DIR}/fe-1.pkl")

In [5]:
df = pd.read_pickle(f"{DATA_DIR}/fe-1.pkl")

### tf-idf

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# we have already tokenize the text so we need a dummy one to bypass tokenization
def dummy_tokenizer(string): return string

# we will only get the 10,000 most common words for title to limit size of dataset
title_vectorizer = TfidfVectorizer(tokenizer=dummy_tokenizer, lowercase=False, max_features=10000)
x_title = title_vectorizer.fit_transform(df["title_tokenized"])

In [7]:
# we will get the 100,000 most common words for body
body_vectorizer = TfidfVectorizer(tokenizer=dummy_tokenizer, lowercase=False, max_features=100000)
x_body = body_vectorizer.fit_transform(df["body_tokenized"])

Let's take a look at an example:

In [8]:
df.iloc[[10]]

Unnamed: 0,id,title_tokenized,body_tokenized,tags
10,930,"[connect, database, loop, recordset, c#]","[simplest, way, connect, query, database, set, records, c#]","[c#, database, loops, connection]"


In [9]:
pd.DataFrame(x_title[:11].toarray(), columns=title_vectorizer.get_feature_names()) \
  .iloc[10].sort_values(ascending=False).where(lambda v: v > 0).dropna().head(10)

recordset    0.668319
connect      0.433807
loop         0.376748
database     0.338899
c#           0.329195
Name: 10, dtype: float64

In [10]:
pd.DataFrame(x_body[:11].toarray(), columns=body_vectorizer.get_feature_names()) \
  .iloc[10].sort_values(ascending=False).where(lambda v: v > 0).dropna().head(10)

simplest    0.557716
records     0.401627
connect     0.373088
c#          0.356147
query       0.294480
database    0.282837
set         0.231035
way         0.203765
Name: 10, dtype: float64

It's not that bad, as we can see keywords from the feature like `connect`, `loop`, `c#` and `database`, which are similar to the actual tags.

### Concantenate dataset and train test split

In [6]:
# there is a problematic tag named "nan" which causes string comparison error
df["tags"] = df["tags"].apply(lambda tags: [tag if not isinstance(tag, float) else "nan" for tag in tags])

In [12]:
# give a weight of 2 to title as it should contain more important words than body
x_title = x_title * 2

In [7]:
from scipy.sparse import hstack
from sklearn.model_selection import train_test_split

X = hstack([x_title, x_body])
y = df[["tags"]]

In [8]:
from sklearn.preprocessing import MultiLabelBinarizer

multi_label_binarizer = MultiLabelBinarizer()
y = multi_label_binarizer.fit_transform(y["tags"])

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

In [11]:
# checkpoint

from sklearn.externals import joblib

joblib.dump(X_train, f"{DATA_DIR}/x_train.pkl")
joblib.dump(X_test, f"{DATA_DIR}/x_test.pkl")
joblib.dump(y_train, f"{DATA_DIR}/y_train.pkl")
joblib.dump(y_test, f"{DATA_DIR}/y_test.pkl")
joblib.dump(multi_label_binarizer.classes_, f"{DATA_DIR}/y_classes.pkl")

['./data/y_classes.pkl']