# Predict tags of StackOverflow questions

In this task, you will learn how to predict tags for posts from [StackOverflow](https://stackoverflow.com). To solve this task you will use multilabel classification approach.
<img src="images/stackoverflow tagging.png">

## Libraries

In this task you will need the following libraries:
- [Numpy](http://www.numpy.org) — a package for scientific computing.
- [Pandas](https://pandas.pydata.org) — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
- [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
- [NLTK](http://www.nltk.org) — a platform to work with natural language.

So let's start by importing the aforementioned libraries

In [1]:
import numpy as np
import pandas as pd
import nltk

## Read & visualize data
In this task you will deal with a dataset of post titles from StackOverflow. You are provided a split to 2 sets: *train* and *test*. All data contain titles of the posts and corresponding tags (100 tags are available). Let's read the data (using pandas) and visualize it.

In [2]:
from ast import literal_eval

def read_data(filename, size=None):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    if size: data = data.iloc[:size,:] # take a subset of the data
    return data

In [3]:
train = read_data('data/stackoverflow tagging/train.tsv', 20000)
test = read_data('data/stackoverflow tagging/validation.tsv', 5000)

train.head()

Unnamed: 0,title,tags
0,How to draw a stacked dotplot in R?,[r]
1,mysql select all records where a datetime fiel...,"[php, mysql]"
2,How to terminate windows phone 8.1 app,[c#]
3,get current time in a specific country via jquery,"[javascript, jquery]"
4,Configuring Tomcat to Use SSL,[java]


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
title    20000 non-null object
tags     20000 non-null object
dtypes: object(2)
memory usage: 312.6+ KB


In [5]:
print('titles data type is', type(train['title'].iloc[0]))
print('tags data type is', type(train['tags'].iloc[0]))

titles data type is <class 'str'>
tags data type is <class 'list'>


As you can see, *title* column contains titles of the posts and *tags* column contains the tags. It could be noticed that a number of tags for a post is not fixed and could be as many as necessary.

For a more comfortable usage, initialize *X_train*, *X_test*, *y_train*, *y_test*.

In [6]:
X_train, y_train = train['title'].values, train['tags'].values
X_test, y_test = test['title'].values, test['tags'].values

In [7]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(20000,) (20000,)
(5000,) (5000,)


## Preprocessing

In [8]:
for ix in [10,27,28,33,37]:
    print(X_train[ix])

Obtaining the object the java.lang.Class<T> object for a parameterized type without constructing the class in q_uestion?
How to keep 2 objects in view at all time by scaling the field of view? (or z&y axis)
AttributeError: 'NoneType' object has no attribute 'split'
Creating JSON in Windows 8 C# using IJsonValue: Cannot Implicitly convert type 'string' to 'Windows.Data.Json.IJsonValue'
Laravel 4: Input::all() returns no data with $.ajax POST


What's weird about the previous example? One of the most known difficulties when working with natural data is that it's unstructured. For example, if you use it "as is" and extract tokens just by splitting the titles by whitespaces, you will see that there are many "weird" tokens like *all()*, *'string'*, *Class<>*, etc. To prevent the problems, it's usually useful to prepare the data somehow. In this task you'll write a function to clean all titles in out data.

**Task 1 (preprocessing).** Implement the function *text_prepare* that will do the following:
- convert trst to lower case
- replace symbols (e.g. brackets & slashes) with space
- delete weird symbols (e.g. ? ! * < >)
- delete stop words

After that, run the function *test_test_prepare* to test it on tiny cases and check if your implementation is correct.

<p style="color:white"> *HINT:* use regular expression, **sub** function in particular <p>

In [9]:
import re
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
    return text

In [12]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            print('ans:', ans)
            print('text_prepare(ex):', text_prepare(ex))
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

print(test_text_prepare())

Basic tests are passed.


Now apply the preprocessing function on our data (X_train & X_test)

In [13]:
X_train = [text_prepare(x) for x in X_train]
X_test = [text_prepare(x) for x in X_test]

## Transforming text to vector of numbers (Bag of Words)
**Task 2:** get a list of all unique words in out training data (let's call it vocabulary)

In [14]:
def get_vocabulary(sentences):
    """ Given several sentences, extract all unique words sorted from a to z
    params:
        sentences: list of strings
    returns:
        list of all unique words in sentences sorted from a to z
    example:
        if sentences = ['i liked the movie', 'the movie was great']
        output should be = ['great', 'i', 'liked', 'movie', 'the', 'was']
    """
    words = []
    for sentence in sentences:
        w = sentence.split()
        words.extend(w)    
    words = sorted(list(set(words)))
    return words

**Let's test our implementation of get_vocabulary function:**

In [15]:
def test_get_vocabulary():
    examples = ['i liked the movie', 'the movie was great']
    answers = ['great', 'i', 'liked', 'movie', 'the', 'was']
    vocab = get_vocabulary(examples)
    if vocab != answers:
        print('correct output >>', answers)
        print('your output >>', vocab)
        print('please go back and update get_vocabulary function')
    else:
        print('Basic tests are passed :)')

test_get_vocabulary()

Basic tests are passed :)


**Apply the function to our training data:**

In [16]:
vocabulary = get_vocabulary(X_train)
print('# unique words:',len(vocabulary))

# unique words: 12249


Once we have vocabulary (unique words), we can get BoW and convert out text to vectors. <br>
**Task 3:** implement bag_of_words function

In [17]:
def bag_of_words(sentences, vocabulary):
    """
    params:
        sentences: list of strings
        vocabulary: list of all unique words
    returns:
        bag of words of the input sentences: numpy array of shape (#sentences, #vocabulary)
    """
    bow = np.zeros((len(sentences), len(vocabulary)), dtype=np.int8)
    for index, sentence in enumerate(sentences):
        tokens = sentence.split()
        for token in tokens:
            if token in vocabulary:
                bow[index, vocabulary.index(token)] = 1
    return bow

**Let's test our implementation of bag_of_words function:**

In [18]:
from IPython.display import display

def test_bag_of_words():
    docs = ['machine learning is great',
           'natural language processing is a complex field',
           'natural language processing is used in machine learning']
    bow_true = np.array([[0,0,0,1,0,1,0,1,1,0,0,0],[1,1,1,0,0,1,1,0,0,1,1,0],[0,0,0,0,1,1,1,1,1,1,1,1]])
    vocab = get_vocabulary(docs)
    if vocab != ['a', 'complex', 'field', 'great', 'in', 'is', 'language', 'learning', 'machine', 'natural', 'processing', 'used']:
        print('Error in get_vocabulary function!')
        return
    bow = bag_of_words(docs, vocab)
    if np.array_equal(bow, bow_true) == False:
        print('Something went wrong :( re-check your implementation!')
    else:
        print('Basic tests are passed :)')
        df = pd.DataFrame(data=bow, columns=vocab, index=docs)
        display(df.head())

test_bag_of_words()

Basic tests are passed :)


Unnamed: 0,a,complex,field,great,in,is,language,learning,machine,natural,processing,used
machine learning is great,0,0,0,1,0,1,0,1,1,0,0,0
natural language processing is a complex field,1,1,1,0,0,1,1,0,0,1,1,0
natural language processing is used in machine learning,0,0,0,0,1,1,1,1,1,1,1,1


**Apply bag_of_words function to our train & test data:**

In [19]:
X_train_mybag = bag_of_words(X_train, vocabulary)
X_test_mybag = bag_of_words(X_test, vocabulary)

In [20]:
print(X_train_mybag.shape, X_test_mybag.shape)

(20000, 12249) (5000, 12249)


In [21]:
from scipy import sparse as sp_sparse

X_train_mybag_sparse = sp_sparse.vstack([sp_sparse.csr_matrix(bow) for bow in X_train_mybag])
X_test_mybag_sparse = sp_sparse.vstack([sp_sparse.csr_matrix(bow) for bow in X_test_mybag])

### FYI
Bag of words is a very common approach in NLP. Therefore, there is off-the-shelf implementation in python. For more info, check the following scikit-learn functions: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html), ... etc.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ['machine learning is great',
           'natural language processing is a complex field',
           'natural language processing is used in machine learning']

v = CountVectorizer(binary=True, token_pattern='\\w+')
o = v.fit_transform(docs)

cols = sorted(v.vocabulary_.keys())
df = pd.DataFrame(data=o.toarray(), columns=cols, index=docs)
display(df.head())

Unnamed: 0,a,complex,field,great,in,is,language,learning,machine,natural,processing,used
machine learning is great,0,0,0,1,0,1,0,1,1,0,0,0
natural language processing is a complex field,1,1,1,0,0,1,1,0,0,1,1,0
natural language processing is used in machine learning,0,0,0,0,1,1,1,1,1,1,1,1


## Classification

In [26]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier

# prepare labels
all_tags = []
for tags in y_train:
    for tag in tags:
        if tag not in all_tags: all_tags.append(tag)

mlb = MultiLabelBinarizer(classes=sorted(all_tags))
y_train = mlb.fit_transform(y_train)
y_test = mlb.fit_transform(y_test)

# train classifier
classif = OneVsRestClassifier(LogisticRegression())
classif.fit(X_train_mybag_sparse, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

## Prediction

In [30]:
y_val_predicted_labels_mybag = classif.predict(X_test_mybag_sparse)
y_val_predicted_scores_mybag = classif.decision_function(X_test_mybag_sparse)

y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_mybag)
y_val_inversed = mlb.inverse_transform(y_test)
for i in range(80):
    print('{}\nTitle:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        i,
        X_test[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

# good : 28, 30, 31, 33, 34, 38, 46, 60, 70, 71, 72, 74
# bad  : 41, 64, 66, 0, 6, 24, 32, 35

0
Title:	odbc_exec always fail
True labels:	php,sql
Predicted labels:	


1
Title:	access base classes variable within child class
True labels:	javascript
Predicted labels:	


2
Title:	contenttype application json required rails
True labels:	ruby,ruby-on-rails
Predicted labels:	json,ruby-on-rails


3
Title:	sessions sinatra used pass variable
True labels:	ruby,session
Predicted labels:	


4
Title:	getting error type json exist postgresql rake db migrate
True labels:	json,ruby,ruby-on-rails
Predicted labels:	json


5
Title:	library found
True labels:	c++,ios,iphone,xcode
Predicted labels:	


6
Title:	csproj file programmatic adding deleting files
True labels:	c#
Predicted labels:	


7
Title:	typeerror makedirs got unexpected keyword argument exists_ok
True labels:	django,python
Predicted labels:	python


8
Title:	pan div using jquery
True labels:	html,javascript,jquery
Predicted labels:	javascript,jquery


9
Title:	hibernate intermediate advanced tutorials
True labels:	hibernate,java
Pre

In [39]:
def display_predictions(indexes):
    examples_titles = [X_test[i] for i in indexes]
    examples_actual = [y_val_inversed[i] for i in indexes]
    examples_pred = [y_val_pred_inversed[i] for i in indexes]
    df = pd.DataFrame(data={'title':examples_titles, 'True labels':examples_actual, 'Predicted labels':examples_pred})
    df = df[['title','True labels', 'Predicted labels']]
    display(df.head())

good_examples = [28, 30, 31, 33, 34, 38, 46, 60, 70, 71, 72, 74]
bad_examples = [41, 64, 66, 0, 6, 24, 32, 35]
print('Good Examples:')
display_predictions(good_examples)
print('Bad Examples:')
display_predictions(bad_examples)

Good Examples:


Unnamed: 0,title,True labels,Predicted labels
0,rspec controller tests put update action neste...,"(ruby-on-rails,)","(ruby-on-rails,)"
1,ajax data uncaught referenceerror date defined,"(ajax, javascript, jquery)","(ajax, javascript, jquery)"
2,lifecycle method fragment execute asynctask,"(android, java)","(android, java)"
3,remove html tags format text returns spaces et...,"(.net, c#, html)","(c#,)"
4,initialization makes pointer integer without c...,"(c, pointers)","(c, pointers)"


Bad Examples:


Unnamed: 0,title,True labels,Predicted labels
0,hide statusbar uitabbarcontroller enters lands...,"(ios, iphone, objective-c)",()
1,javamathbiginteger cannot cast javalanginteger,"(hibernate, java, mysql, sql)",()
2,cannot convert value type error swift3,"(ios, swift)",()
3,odbc_exec always fail,"(php, sql)",()
4,csproj file programmatic adding deleting files,"(c#,)",()
