# Linear models to predict StackOverflow tags

Data used is found on Kaggle as 'StackSample: 10% of Stack Overflow Q&A' through the following link: https://www.kaggle.com/stackoverflow/stacksample/data

This is a dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

This is organized as three tables - Questions, Answers and Tags. Here, only two will be used:
+ Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
+ Tags contains the tags on each of these questions

The task is to use questions to predict tags. This is a multilabel classification task.

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/charlottefettes/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
questions = pd.read_csv('Questions.csv', encoding='latin-1')

tags = pd.read_csv('Tags.csv')
tags["Tag"]= tags["Tag"].astype(str) 
tags = tags.groupby('Id').agg(lambda x: x.tolist())

In [3]:
data = pd.merge(tags, questions, how='inner', on='Id')
data = data[['Title','Tag']]

In [4]:
from sklearn.model_selection import train_test_split 

train, test = train_test_split(data, train_size = 0.8, random_state=42)

In [5]:
X_train, y_train = train['Title'].values, train['Tag'].values
X_test, y_test = test['Title'].values, test['Tag'].values

## Text preprocessing

### Prepare the text

In [6]:
import re

In [7]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;<>]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([x for x in text.split() if x and x not in STOPWORDS]) # delete stopwords from text
    return text

In [8]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vector int arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

print(test_text_prepare())

Basic tests are passed.


In [9]:
X_train = [text_prepare(x) for x in X_train]
X_test = [text_prepare(x) for x in X_test]

In [10]:
X_train[:3]

['get 0 180 degree onsensorchanged',
 'ria service generated code support partial class',
 'error building packages r 320 latex error creating pdf version']

### Words Tags Count

Find most common words and tags.

In [11]:
y_train[:3]

array([list(['android', 'android-layout']),
       list(['wcf-ria-services', 'ado.net-entity-data-model']),
       list(['r', 'package'])], dtype=object)

In [12]:
from collections import Counter

# Dictionary of all tags from train corpus with their counts.
tags_counts = Counter()

# Dictionary of all words from train corpus with their counts.
words_counts = Counter()

for tags in y_train:
    for tag in tags:
        tags_counts[tag] += 1
        
for words in X_train:
    for word in words.split():
        words_counts[word] += 1

In [13]:
most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:3]
most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:3]

print('Most common tags: ', most_common_tags)
print('Most common words: ', most_common_words)

Most common tags:  [('javascript', 99082), ('java', 92219), ('c#', 80973)]
Most common words:  [('using', 74079), ('file', 42948), ('error', 39977)]


### Transforming text to a vector

Bag-of-words representation will be used to convert the text data to numeric vectors. To do this, the following steps will be taken:
1. Find N most popular words in training corpus and numerate them, creating a dictionary containing the most popular words.
2. For each title in the corpora, create a zero vector with dimension equal to N.
3. For each text in the corpora, iterate over words that are in the dictionary and increase by 1 the corresponding coordinate.

In [14]:
DICT_SIZE = 5000
INDEX_TO_WORDS = sorted(words_counts.keys(), key=lambda x: words_counts[x], reverse=True)[:DICT_SIZE]
WORDS_TO_INDEX = {word:i for i, word in enumerate(INDEX_TO_WORDS)}
ALL_WORDS = WORDS_TO_INDEX.keys()

def my_bag_of_words(text, words_to_index, dict_size):

    result_vector = np.zeros(dict_size)

    for word in text.split():
        if word in words_to_index:
            result_vector[words_to_index[word]] += 1
    return result_vector

In [15]:
def test_my_bag_of_words():
    words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
    examples = ['hi how are you']
    answers = [[1, 1, 0, 1]]
    for ex, ans in zip(examples, answers):
        if (my_bag_of_words(ex, words_to_index, 4) != ans).any():
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

print(test_my_bag_of_words())

Basic tests are passed.


In [16]:
from scipy import sparse as sp_sparse

In [None]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])

print('X_train shape ', X_train_mybag.shape)
print('X_test shape ', X_test_mybag.shape)

The data has been transformed into a sparse representation for efficient storage. As sklearn algorithms can only work with cst matrix, this is used here.

### TF-IDF

As an alternative approach, the bag-of-words framework can be extended by taking into account total frequencies of words in the corpora, which helps to penalise too frequent words, and provides better features space.

The vectoriser is trained using the train corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def tfidf_features(X_train, X_test):
    # create TF-IDF vectorizer, fit on train set, transform train and test sets and return the result
    
    tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1, 2), token_pattern='(\S+)')
    
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    
    return X_train, X_test, tfidf_vectorizer.vocabulary_

In [None]:
#check results - here will look at c++ and c#
X_train_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

In [None]:
print(tfidf_vocab['c++'])
print(tfidf_vocab['c#'])

## MultiLabel classifier

To deal with situations where there are mutiple tags - thus multilabel predictions - the labels need to transformed to binary form, and the prediction will be a mask of 0's and 1's.

MultiLabelBinarizer can be used for this.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
mlb = MultiLabelBinarizer(classes=sorted(tags_counts.keys()), sparse_output=True)

#fit on all data to avoid error if tags missing from train but present in test
y_all = pd.concat([y_train, y_test], axis=0)
mlb.fit(y_all)

#call fit_transform on train and test separately to obtain for each
y_train = mlb.fit_transform(y_train)
y_test = mlb.fit_transform(y_test)

Now, the classifier will be trained. A one-vs-rest approach is used, where k classifiers (# tags) are trained.

Logistic Regression will be used. This is a simple classifier - possibly not the best performed, but should perform fairly well on text classification.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier

In [None]:
def train_classifier(X_train, y_train, penalty='l2', C=1.0):
    # Create and fit LogisticRegression wrapped into OneVsRestClassifier.
    lr = LogisticRegression(penalty=penalty, C=C, solver='liblinear')
    ovr = OneVsRestClassifier(lr)
    ovr.fit(X_train, y_train)
    return ovr

In [None]:
#train classifiers for different data transformations: bag-of-words and tf-idf
classifier_mybag = train_classifier(X_train_mybag, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)

In [None]:
#create predictions for the data - 2 types: labels and scores
y_test_predicted_labels_mybag = classifier_mybag.predict(X_test_mybag)
y_test_predicted_scores_mybag = classifier_mybag.decision_function(X_test_mybag)

y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
y_test_predicted_scores_tfidf = classifier_tfidf.decision_function(X_test_tfidf)

In [None]:
#try examples
y_test_pred_inversed = mlb.inverse_transform(y_test_predicted_labels_tfidf)
y_test_inversed = mlb.inverse_transform(y_test)
for i in range(3):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(y_test_pred_inversed[i])
    ))

# Evaluation

Multiple classification metrics will be used to check performance: accuracy, F1-score, areas under ROC-curve, area under precision-recall curve.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [None]:
def print_evaluation_scores(y_test, predicted):
    print('Accuracy:', accuracy_score(y_test, predicted))
    print('F1-score macro:', f1_score(y_test, predicted, average='macro'))
    print('F1-score micro:', f1_score(y_test, predicted, average='micro'))
    print('F1-score weighted:', f1_score(y_test, predicted, average='weighted'))
    print('Precision macro:', average_precision_score(y_test, predicted, average='macro'))
    print('Precision micro:', average_precision_score(y_test, predicted, average='micro'))
    print('Precision weighted:', average_precision_score(y_test, predicted, average='weighted'))

In [None]:
print('Bag-of-words')
print_evaluation_scores(y_test, y_test_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_test, y_test_predicted_labels_tfidf)

Plot some generalisations of the ROC curve.

In [None]:
from metrics import roc_auc
%matplotlib inline

In [None]:
n_classes = len(tags_counts)
roc_auc(y_test, y_test_predicted_scores_mybag, n_classes)

In [None]:
n_classes = len(tags_counts)
roc_auc(y_test, y_test_predicted_scores_tfidf, n_classes)

### Optimisation 

Compare bag-of-words and TF-IDF approaches on F1-score, experimenting with L1 and L2 regularisation, and regression with different coefficients.

In [None]:
for penalty in ('l1', 'l2'):
    for C in (0.1, 0.6, 1, 3):
        print('Penalty:', penalty, 'C=', C)
        classifier_mybag = train_classifier(X_train_mybag, y_train, penalty, C)
        classifier_tfidf = train_classifier(X_train_tfidf, y_train, penalty, C)
        y_test_predicted_labels_mybag = classifier_mybag.predict(X_test_mybag)

        y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
        print('Bag-of-words')
        print('F1-score weighted:', f1_score(y_test, y_test_predicted_labels_mybag, average='weighted'))
        print('Tfidf')
        print('F1-score weighted:', f1_score(y_test, y_test_predicted_labels_tfidf, average='weighted'))

In [None]:
classifier_best = train_classifier(X_train_, 
                                   y_train, penalty='', C=)

In [None]:
#predictions
y_test_predicted_labels_final = classifier_best.predict(   
    X_test_)

y_test_predicted_scores_final = classifier_best.decision_function(
    X_test_)



In [None]:
print('Evaluation Scores for optimal model')
print_evaluation_scores(y_test, y_test_predicted_labels_final)

## Analysis of most important features (words)

Find the words with the largest weights in Logistic Regression model.

In [None]:
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
    print('Tag:\t{}'.format(tag))
    
    # Extract an estimator from the classifier for the given tag.
    # Extract feature coefficients from the estimator. 
    coef = classifier.coef_[tags_classes.index(tag)]
    
    top_positive_words = [index_to_words[idx] for idx in coef.argsort()[-1:-6:-1]]# top-5 words sorted by the coefficiens.
    top_negative_words = [index_to_words[idx] for idx in coef.argsort()[:5]]# bottom-5 words  sorted by the coefficients.
    print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
    print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))

In [None]:
print_words_for_tag(classifier_tfidf, 'c', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'c++', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'linux', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)