<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/Project__Auto_tagging_stack_exchange_questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

Build a model to automatically predict tags for a given a StackExchange question by using the text of the question.
![alt text](https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.svg?v=d29f0785ebb7)

__Dataset Specs__: Over 85,000 questions

__License__

All Stack Exchange user contributions are licensed under [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) with [attribution required](http://blog.stackoverflow.com/2009/06/attribution-required/).

<br>

***

In [None]:
# optional step (if you are working on colab)
# mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Steps to Follow
1. Loading Data
2. Text Cleaning
3. Merge Tags with Questions
4. Dataset Preparation
5. Feature Engineering using TF-IDF
6. Model Building
    1. Naive Bayes
    2. Logistic Regression
    3. Model Building Summary

7. Final Question Tagging Pipeline



In [1]:
# Importing Required Libraries

# for string matching
import re

# for handling data
import pandas as pd

# for numerical computing
import numpy as np

# for handling html data
from bs4 import BeautifulSoup

# for NLP related tasks
import spacy
nlp=spacy.load('en_core_web_sm',disable=["tagger", "parser","ner"])

pd.set_option('display.max_colwidth', 200)

# Loading Data

In [6]:
!ls /content

ProjectMultilabeltextclassification-201029-191838.zip  sample_data


In [7]:
!unzip /content/ProjectMultilabeltextclassification-201029-191838.zip

Archive:  /content/ProjectMultilabeltextclassification-201029-191838.zip
  inflating: Project_ Auto tagging stack exchange questions.ipynb  
  inflating: Tags.csv                
  inflating: Questions.csv           


In [8]:
# load questions
questions_df = pd.read_csv('/content/Questions.csv',encoding='latin-1')
print('Shape=>',questions_df.shape)
questions_df.head()

Shape=> (85085, 6)


Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain English,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values in statistical tests?,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests...."
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not mean causation,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ..."


1. Id: Question ID
2. OwnerUserId: User ID
3. CreationDate: Date of posting question
4. Score: Count of Upvotes received by the question
5. Title: Title of the question
6. Body: Text body of the question

In [9]:
# load tags
tags_df = pd.read_csv('/content/Tags.csv')
print('Shape=>',tags_df.shape)
tags_df.head()

Shape=> (244228, 2)


Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


# Text Cleaning

Let's define a function to clean the text data.

In [10]:
def cleaner(text):

  # take off html tags
  text = BeautifulSoup(text).get_text()

  # fetch alphabetic characters
  text = re.sub("[^a-zA-Z]", " ", text)

  # convert text to lower case
  text = text.lower()

  # removing extra spaces
  text=re.sub("[\s]+"," ",text)

  # creating doc object
  doc=nlp(text)

  # remove stopwords and lemmatize the text
  tokens=[token.lemma_ for token in doc if(token.is_stop==False)]

  return " ".join(tokens)

In [11]:
# Pre-processing Questions
questions_df['cleaned_text'] = questions_df['Body'].apply(cleaner)



In [12]:
questions_df['Body'][1]

"<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?</li>\n<li>if let's say I have census data\ndating back to 4 - 5 census periods,\nhow far can i forecast it into the\nfuture?</li>\n<li>if some of the census zone change\nlightly in boundaries, how can i\naccount for that change?</li>\n<li>What are the methods to validate\ncensus forecasts? for example, if i\nhave data for existing 5 census\nperiods, should I model the first 3\nand test it on the latter two? or is\nthere another way?</li>\n<li>what's the state of practice in\nforecasting census data, and what are\nsome of the state of the art methods?</li>\n</ul>\n"

In [13]:
questions_df['cleaned_text'][1]

'ways forecast demographic census validation calibration techniques concerns census blocks vary sizes rural areas lot larger condensed urban areas need account area size difference let s census data dating census periods far forecast future census zone change lightly boundaries account change methods validate census forecasts example data existing census periods model test way s state practice forecasting census data state art methods'

# Merge Tags with Questions

Let's now explore the tags data.

In [14]:
tags_df.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [15]:
# count of unique tags
len(tags_df['Tag'].unique())

1315

In [16]:
tags_df['Tag'].value_counts()

Unnamed: 0_level_0,count
Tag,Unnamed: 1_level_1
r,13236
regression,10959
machine-learning,6089
time-series,5559
probability,4217
...,...
fmincon,1
netflix-prize,1
american-community-survey,1
propensity,1


In [17]:
# remove "-" from the tags
tags_df['Tag']= tags_df['Tag'].apply(lambda x:re.sub("-"," ",x))

In [18]:
# group tags Id wise
tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags_df.head()

Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [19]:
# merge tags and questions
df = pd.merge(questions_df,tags_df,how='inner',on='Id')

In [None]:
df = df[['Id','Body','cleaned_text','tags']]
print('Shape=>',df.shape)
df.head()

Shape=> (85085, 4)


Unnamed: 0,Id,Body,cleaned_text,tags
0,6,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach...",year read blog post brendan o connor entitle statistic vs machine learn fight discuss difference field andrew gelman respond favorably simon blomberg r s fortune package paraphrase provocatively m...,[machine learning]
1,21,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...,way forecast demographic census validation calibration technique concern census block vary size rural area lot large condense urban area need account area size difference let s census datum date c...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n,describe plain english characteristic distinguish bayesian frequentist reason,"[bayesian, frequentist]"
3,31,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests....",take statistic course try help fellow student notice subject inspire head desk bang interpret result statistical hypothesis test student easily learn perform calculation require give test hang int...,"[hypothesis testing, t test, p value, interpretation, intuition]"
4,36,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ...",old say correlation mean causation teach tend use follow standard example illustrate point numb stork birth rate denmark numb priest america alcoholism start th century note strong correlation num...,"[correlation, teaching]"


There are over 85,000 unique questions and over 1300 tags.

# Dataset Preparation

In [20]:
# check frequency of occurence of each tag
freq= {}
for i in df['tags']:
  for j in i:
    if j in freq.keys():
      freq[j] = freq[j] + 1
    else:
      freq[j] = 1

Let's find out the most frequent tags.

In [21]:
# sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))

In [22]:
freq.items()

dict_items([('r', 13236), ('regression', 10959), ('machine learning', 6089), ('time series', 5559), ('probability', 4217), ('hypothesis testing', 3869), ('self study', 3732), ('distributions', 3501), ('logistic', 3316), ('classification', 2881), ('correlation', 2871), ('statistical significance', 2666), ('bayesian', 2656), ('anova', 2505), ('normal distribution', 2181), ('multiple regression', 2054), ('mixed model', 1998), ('clustering', 1952), ('neural networks', 1897), ('mathematical statistics', 1888), ('confidence interval', 1776), ('categorical data', 1703), ('generalized linear model', 1614), ('variance', 1576), ('data visualization', 1549), ('estimation', 1533), ('forecasting', 1422), ('t test', 1418), ('pca', 1395), ('sampling', 1363), ('cross validation', 1344), ('repeated measures', 1335), ('spss', 1296), ('svm', 1283), ('chi squared', 1261), ('maximum likelihood', 1209), ('predictive models', 1189), ('multivariate analysis', 1116), ('survival', 1081), ('references', 1076), (

In [23]:
# Top 10 most frequent tags
common_tags = list(freq.keys())[:10]
common_tags

['r',
 'regression',
 'machine learning',
 'time series',
 'probability',
 'hypothesis testing',
 'self study',
 'distributions',
 'logistic',
 'classification']

We will use only those questions/queries that have the above 10 tags associated with it.

In [24]:
x=[]
y=[]

for i in range(len(df['tags'])):

  temp=[]
  for j in df['tags'][i]:
    if j in common_tags:
      temp.append(j)

  if(len(temp)>1):
    x.append(df['cleaned_text'][i])
    y.append(temp)

In [25]:
# number of questions left
len(x)

11106

In [26]:
x[:5]

['recently started working tuberculosis clinic meet periodically discuss number tb cases currently treating number tests administered etc d like start modeling counts guessing unusual unfortunately ve little training time series exposure models continuous data stock prices large numbers counts influenza deal cases month mean median var distributed like image lost mists time image eaten grue ve found articles address models like d greatly appreciate hearing suggestions approaches r packages use implement approaches edit mbq s answer forced think carefully m asking got hung monthly counts lost actual focus question d like know fairly visible decline onward reflect downward trend overall number cases looks like number cases monthly reflects stable process maybe seasonality overall stable present looks like process changing overall number cases declining monthly counts wobble randomness seasonality test s real change process identify decline use trend seasonality estimate number cases upco

In [27]:
y[:5]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series']]

In [28]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Getting Labels
y = mlb.fit_transform(y)
y.shape

(11106, 10)

In [29]:
y[0,:]

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 1])

In [30]:
mlb.classes_

array(['classification', 'distributions', 'hypothesis testing',
       'logistic', 'machine learning', 'probability', 'r', 'regression',
       'self study', 'time series'], dtype=object)

We can now split the dataset into training set and validation set.

In [31]:
from sklearn.model_selection import train_test_split
x_tr,x_val,y_tr,y_val=train_test_split(x, y, test_size=0.2, random_state=0,shuffle=True)

In [32]:
print('x_tr:',len(x_tr),'y_tr:',len(y_tr))
print('x_val:',len(x_val),'y_val:',len(y_val))

x_tr: 8884 y_tr: 8884
x_val: 2222 y_val: 2222


# Feature Engineering using TF-IDF

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [34]:
# initialize TFIDF
word_vectorizer = TfidfVectorizer(max_features=5000)

In [35]:
# Fitting Vectorizer on Train set
word_vectorizer.fit(x_tr)

In [36]:
# create TF-IDF vectors for Train Set
train_word_features = word_vectorizer.transform(x_tr)

In [37]:
train_word_features

<8884x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 403799 stored elements in Compressed Sparse Row format>

In [38]:
# create TF-IDF vectors for Test Set
test_word_features = word_vectorizer.transform(x_val)
test_word_features

<2222x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 98503 stored elements in Compressed Sparse Row format>

# Model Building

## Naive Bayes

In [39]:
# Importing for modeling
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

In [40]:
# Defining Model
nb_model=OneVsRestClassifier(MultinomialNB())

In [41]:
# Training Model
nb_model.fit(train_word_features,y_tr)

In [42]:
# Make predictions for train set
train_pred_nb=nb_model.predict_proba(train_word_features)

In [43]:
train_pred_nb[:5]

array([[0.01544179, 0.03526473, 0.33214277, 0.02112472, 0.0187124 ,
        0.02357314, 0.86165451, 0.10499437, 0.06557019, 0.94539406],
       [0.00824331, 0.00947724, 0.02731332, 0.09658217, 0.0183064 ,
        0.01234313, 0.52049994, 0.73306767, 0.01271752, 0.469521  ],
       [0.06337962, 0.03257936, 0.05316659, 0.31364521, 0.09487311,
        0.03441643, 0.37576543, 0.59728314, 0.05801358, 0.03665974],
       [0.02109324, 0.02021033, 0.06908385, 0.03041695, 0.07338991,
        0.02021796, 0.55730314, 0.33654205, 0.06710397, 0.63118261],
       [0.00599055, 0.00151706, 0.00399711, 0.00723145, 0.0377916 ,
        0.00173069, 0.94674773, 0.14018032, 0.003072  , 0.994584  ]])

The predictions are in terms of probabilities for each of the 10 tags. Hence we need to have a threshold value to convert these probabilities to 0 or 1.

Let's specify a set of candidate threshold values. We will select the threshold value that performs the best for the train set.

In [44]:
# Function for converting probabilities into classes or tags based on a threshold value
def classify(pred_prob,threshold):
  y_pred_seq = []

  for i in pred_prob:
    temp=[]
    for j in i:
      if j>=threshold:
        temp.append(1)
      else:
        temp.append(0)
    y_pred_seq.append(temp)

  return y_pred_seq

In [45]:
# Function for finding optimum value of threshold
def optimum_threshold(actual,pred_prob):
  #define candidate threshold values
  thresholds  = np.arange(0,0.5,0.01)

  score=[]
  for value in thresholds:
    # Getting classes for each threshold
    pred_classes= classify(pred_prob,value)
    # Getting F1-score for every threshold
    score.append(f1_score(actual,pred_classes,average="weighted"))

  return thresholds[score.index(max(score))]

In [46]:
# Finding Optimum value
print("Optimal threshold=>",optimum_threshold(y_tr,train_pred_nb))

Optimal threshold=> 0.28


In [47]:
# Getting classes using optimum threshold
train_pred_nb_class=classify(train_pred_nb,0.26)

In [48]:
train_pred_nb_class[:5]

[[0, 0, 1, 0, 0, 0, 1, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
 [0, 0, 0, 1, 0, 0, 1, 1, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 1]]

In [49]:
mlb.inverse_transform(np.array(train_pred_nb_class[:5]))

[('hypothesis testing', 'r', 'time series'),
 ('r', 'regression', 'time series'),
 ('logistic', 'r', 'regression'),
 ('r', 'regression', 'time series'),
 ('r', 'time series')]

In [50]:
# Evaluating on Training Set
print("F1-score on Train Set:",f1_score(y_tr,train_pred_nb_class,average="weighted"))

F1-score on Train Set: 0.7465320319440658


In [51]:
# Make Predictions on Validation Set
val_pred_nb=nb_model.predict_proba(test_word_features)

# Getting Classes
val_pred_nb_class=classify(val_pred_nb,0.26)

# Evaluating on Validation Set
print("F1-score on Validation Set:",f1_score(y_val,val_pred_nb_class,average="weighted"))

F1-score on Validation Set: 0.6783940007095779


## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# Defining Model
lr_model=OneVsRestClassifier(LogisticRegression())

# Training Model
lr_model.fit(train_word_features,y_tr)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [None]:
# Make Predictions on Train Set
train_pred_lr=lr_model.predict_proba(train_word_features)

In [None]:
train_pred_lr[:5]

array([[0.02628422, 0.03802301, 0.32864872, 0.0283266 , 0.04047505,
        0.0257426 , 0.7458225 , 0.14566368, 0.06306098, 0.68532804],
       [0.0205731 , 0.02353522, 0.06335822, 0.09140213, 0.03628546,
        0.01972218, 0.49634887, 0.89388967, 0.04181783, 0.56007417],
       [0.0984721 , 0.02163289, 0.03923887, 0.2188406 , 0.30334228,
        0.02699562, 0.52203059, 0.73301885, 0.06461424, 0.03406334],
       [0.01823237, 0.01210113, 0.35977266, 0.02255171, 0.07155586,
        0.00912383, 0.5446768 , 0.37655145, 0.07385593, 0.59889956],
       [0.02696836, 0.00531752, 0.01962782, 0.02888206, 0.07182421,
        0.00483854, 0.90914529, 0.19631622, 0.01220511, 0.97522508]])

In [None]:
# Finding Optimum value
print("Optimal threshold=>",optimum_threshold(y_tr,train_pred_lr))

Optimal threshold=> 0.34


In [None]:
# Getting classes using optimum threshold
train_pred_lr_class=classify(train_pred_lr,0.34)
train_pred_lr_class[:5]

[[0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
 [0, 0, 1, 0, 0, 0, 1, 1, 0, 1],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 1]]

In [None]:
# Evaluating on Training Set
print("F1-score on Train Set:",f1_score(y_tr,train_pred_lr_class,average="weighted"))

F1-score on Train Set: 0.8219911260031991


In [None]:
# Make Predictions on Validation Set
val_pred_lr=lr_model.predict_proba(test_word_features)

# Getting Classes
val_pred_lr_class=classify(val_pred_lr,0.34)

# Evaluating on Validation Set
print("F1-score on Validation Set:",f1_score(y_val,val_pred_lr_class,average="weighted"))

F1-score on Validation Set: 0.7561692360643565


## Model Building Summary
|        Model        | Train Set | Validation Set |
|:-------------------:|:---------:|:--------------:|
|     Naive Bayes     |   0.7273  |     0.6622     |
| Logistic Regression |   0.8220  |     0.7562     |

It is evident from the results that Logistic Regression performs better than Naive Bayes.

# Final Question Tagging Pipeline

In [None]:
def tagging(question):
  # Text Cleaning
  cleaned_question=cleaner(question)

  # Feature Engineering
  vector=word_vectorizer.transform([cleaned_question])

  # Predicting Probabilities
  pred_prob=lr_model.predict_proba(vector)

  # Converting Probabilities into classes
  pred_class=classify(pred_prob,0.34)

  return mlb.inverse_transform(np.array(pred_class))

<font size=4>**Sample Question:**</font>
<p>I'm using SVM classification (Matlab) within my research works, and I want to know:  </p><ol><li>The advantages and disadvantages of each training algorithm, i.e., SMO, LS and QP<br></li><li>In general case, what is the suitable algorithm?<br></li><li>If the choice of one of them can impact the classification performance?<br></li><li>If the are any relationship between the choice of training method (or algorithm) and the choice of the kernel function.</li></ol>

In [None]:
tagging("<p>I'm using SVM classification (Matlab) within my research works, and I want to know:  </p><ol><li>The advantages and disadvantages of each training algorithm, i.e., SMO, LS and QP<br></li><li>In general case, what is the suitable algorithm?<br></li><li>If the choice of one of them can impact the classification performance?<br></li><li>If the are any relationship between the choice of training method (or algorithm) and the choice of the kernel function.</li></ol")

[('classification', 'machine learning')]