1. Load Data and Import Libraries
2. Text Cleaning
3. Merge Tags with Questions
4. Dataset Preparation
5. Text Representation
6. Model Building
    1. Define Model Architecture
    2. Train the Model
7. Model Predictions
8. Model Evaluation


# Load Data and Import Libraries

In [66]:
import re

# for reading data
import pandas as pd

# for handling html data
from bs4 import BeautifulSoup

# for visualization
import matplotlib.pyplot as plt

pd.set_option('display.max_colwidth', 200)

In [67]:
# load the stackoverflow questions dataset
questions_df = pd.read_csv('Questions.csv',encoding='latin-1')

# load the tags dataset
tags_df = pd.read_csv('Tags.csv')

In [68]:
#print first 5 rows
questions_df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learning?,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain English,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values in statistical tests?,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests...."
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not mean causation,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ..."


# Text Cleaning

In [69]:
def cleaner(text):

  # take off html tags
  text = BeautifulSoup(text).get_text()

  # fetch alphabetic characters
  text = re.sub("[^a-zA-Z]", " ", text)

  # convert text to lower case
  text = text.lower()

  # split text into tokens to remove whitespaces
  tokens = text.split()

  return " ".join(tokens)

In [70]:
# call preprocessing function
questions_df['cleaned_text'] = questions_df['Body'].apply(cleaner)

In [71]:
questions_df['Body'][1]

"<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?</li>\n<li>if let's say I have census data\ndating back to 4 - 5 census periods,\nhow far can i forecast it into the\nfuture?</li>\n<li>if some of the census zone change\nlightly in boundaries, how can i\naccount for that change?</li>\n<li>What are the methods to validate\ncensus forecasts? for example, if i\nhave data for existing 5 census\nperiods, should I model the first 3\nand test it on the latter two? or is\nthere another way?</li>\n<li>what's the state of practice in\nforecasting census data, and what are\nsome of the state of the art methods?</li>\n</ul>\n"

In [72]:
questions_df['cleaned_text'][1]

'what are some of the ways to forecast demographic census with some validation and calibration techniques some of the concerns census blocks vary in sizes as rural areas are a lot larger than condensed urban areas is there a need to account for the area size difference if let s say i have census data dating back to census periods how far can i forecast it into the future if some of the census zone change lightly in boundaries how can i account for that change what are the methods to validate census forecasts for example if i have data for existing census periods should i model the first and test it on the latter two or is there another way what s the state of practice in forecasting census data and what are some of the state of the art methods'

# Merge Tags with Questions

In [73]:
tags_df.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [74]:
# count of unique tags
len(tags_df['Tag'].unique())

1315

In [75]:
tags_df['Tag'].value_counts()

Unnamed: 0_level_0,count
Tag,Unnamed: 1_level_1
r,13236
regression,10959
machine-learning,6089
time-series,5559
probability,4217
...,...
fmincon,1
shapley-value,1
american-community-survey,1
propensity,1


In [76]:
# remove "-" from the tags
tags_df['Tag']= tags_df['Tag'].apply(lambda x:re.sub("-"," ",x))

In [77]:
# group tags Id wise
tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags_df.head()

  tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')


Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [78]:
# merge tags and questions
df = pd.merge(questions_df,tags_df,how='inner',on='Id')

In [79]:
df = df[['Id','Body','cleaned_text','tags']]
df.head()

Unnamed: 0,Id,Body,cleaned_text,tags
0,6,"<p>Last year, I read a blog post from <a href=""http://anyall.org/"">Brendan O'Connor</a> entitled <a href=""http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/"">""Statistics vs. Mach...",last year i read a blog post from brendan o connor entitled statistics vs machine learning fight that discussed some of the differences between the two fields andrew gelman responded favorably to ...,[machine learning]
1,21,<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\n...,what are some of the ways to forecast demographic census with some validation and calibration techniques some of the concerns census blocks vary in sizes as rural areas are a lot larger than conde...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the characteristics that distinguish Bayesian from Frequentist reasoning?</p>\n,how would you describe in plain english the characteristics that distinguish bayesian from frequentist reasoning,"[bayesian, frequentist]"
3,31,"<p>After taking a statistics course and then trying to help fellow students, I noticed one subject that inspires much head-desk banging is interpreting the results of statistical hypothesis tests....",after taking a statistics course and then trying to help fellow students i noticed one subject that inspires much head desk banging is interpreting the results of statistical hypothesis tests it s...,"[hypothesis testing, t test, p value, interpretation, intuition]"
4,36,"<p>There is an old saying: ""Correlation does not mean causation"". When I teach, I tend to use the following standard examples to illustrate this point:</p>\n\n<ol>\n<li>number of storks and birth ...",there is an old saying correlation does not mean causation when i teach i tend to use the following standard examples to illustrate this point number of storks and birth rate in denmark number of ...,"[correlation, teaching]"


# Dataset Preparation

In [80]:
# check frequency of occurence of each tag
freq= {}
for i in df['tags']:
  for j in i:
    if j in freq.keys():
      freq[j] = freq[j] + 1
    else:
      freq[j] = 1

In [81]:
 #sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))

In [82]:
# Top 10 most frequent tags
common_tags = list(freq.keys())[:10]
common_tags

['r',
 'regression',
 'machine learning',
 'time series',
 'probability',
 'hypothesis testing',
 'self study',
 'distributions',
 'logistic',
 'classification']

In [83]:
x=[]
y=[]

for i in range(len(df['tags'])):

  temp=[]
  for j in df['tags'][i]:
    if j in common_tags:
      temp.append(j)

  if(len(temp)>1):
    x.append(df['cleaned_text'][i])
    y.append(temp)

In [84]:
y[:10]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series'],
 ['r', 'time series', 'self study'],
 ['probability', 'hypothesis testing'],
 ['r', 'regression'],
 ['r', 'regression'],
 ['regression', 'logistic']]

In [85]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

y = mlb.fit_transform(y)
y.shape

(11106, 10)

In [86]:
y[0,:]

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 1])

In [87]:
from sklearn.model_selection import train_test_split
x_tr,x_val,y_tr,y_val=train_test_split(x, y, test_size=0.2, random_state=0,shuffle=True)

# Text Representation

In [88]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#prepare a tokenizer
x_tokenizer = Tokenizer()

x_tokenizer.fit_on_texts(x_tr)

In [89]:
len(x_tokenizer.word_index)

25312

There are around 25,000 tokens in the training dataset. Let's see how many tokens appear at least 5 times in the dataset.

In [90]:
thresh = 3

cnt=0
for key,value in x_tokenizer.word_counts.items():
  if value>=thresh:
    cnt=cnt+1

print(cnt)

12574


Over 12,000 tokens have appeared three times or more in the training set.

In [91]:
# prepare the tokenizer again
x_tokenizer = Tokenizer(num_words=cnt,oov_token='unk')

#prepare vocabulary
x_tokenizer.fit_on_texts(x_tr)

Now that we have encoded every token to an integer, let's convert the text sequences to integer sequences. After that we will pad the integer sequences to the maximum sequence length, i.e., 100.

In [92]:
# maximum sequence length allowed
max_len = 100

#convert text sequences into integer sequences
x_tr_seq = x_tokenizer.texts_to_sequences(x_tr)
x_val_seq = x_tokenizer.texts_to_sequences(x_val)

#padding up with zero
x_tr_seq = pad_sequences(x_tr_seq,  padding='post', maxlen=max_len)
x_val_seq = pad_sequences(x_val_seq, padding='post', maxlen=max_len)


Since we are padding the sequences with zeros, we must increment the vocabulary size by one.

In [93]:
#no. of unique words
x_voc_size = x_tokenizer.num_words + 1
x_voc_size

12575

# Model Building


In [94]:
from keras.models import *
from keras.layers import *
from keras.callbacks import *

In [95]:
def create_simplernn_model(x_voc_size, max_len):
    model = Sequential(name="SimpleRNN_Model") # Naming the model is good practice
    model.add(Embedding(x_voc_size, 50, input_shape=(max_len,), mask_zero=True))
    model.add(SimpleRNN(128, activation='relu'))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(10, activation='sigmoid'))
    return model

def create_lstm_model(x_voc_size, max_len):
    model = Sequential(name="LSTM_Model")
    model.add(Embedding(x_voc_size, 50, input_shape=(max_len,), mask_zero=True))
    model.add(LSTM(128, activation='relu'))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(10, activation='sigmoid'))
    return model

def create_gru_model(x_voc_size, max_len):
    model = Sequential(name="GRU_Model")
    model.add(Embedding(x_voc_size, 50, input_shape=(max_len,), mask_zero=True))
    model.add(GRU(128, activation='relu'))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(10, activation='sigmoid'))
    return model

# Instantiate all three models
simplernn_model = create_simplernn_model(x_voc_size, max_len)
lstm_model = create_lstm_model(x_voc_size, max_len)
gru_model = create_gru_model(x_voc_size, max_len)

models = [simplernn_model, lstm_model, gru_model]

  super().__init__(**kwargs)


In [96]:
for model in models:
  model.summary()


In [97]:
#define optimizer and loss
for model in models:
  model.compile(optimizer='adam',loss='binary_crossentropy')

In [98]:
# checkpoint to save best model during training
from tensorflow.keras.callbacks import ModelCheckpoint
mc = ModelCheckpoint("weights.best.keras", monitor='val_loss', verbose=1, save_best_only=True, mode='min')

# Train the Model

In [100]:
histories = {}
best_val_losses = {}
for model in models:
    print(f"\nTraining {model.name}...")
    checkpoint_path = f"{model.name}_best_weights.keras"
    mc = ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

    start_time = time.time()
    history = model.fit(
        x_tr_seq, y_tr,
        epochs=10,
        batch_size=128,
        validation_data=(x_val_seq, y_val),
        callbacks=[mc],
        verbose=1
    )
    end_time = time.time()

    histories[model.name] = {
        'history': history,
        'training_time': end_time - start_time
    }

    best_val_loss = min(history.history['val_loss'])
    best_val_losses[model.name] = {
        'loss': best_val_loss,
        'weights': checkpoint_path
    }


Training SimpleRNN_Model...
Epoch 1/10
[1m68/70[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 12ms/step - loss: 0.2226
Epoch 1: val_loss improved from inf to 0.47562, saving model to SimpleRNN_Model_best_weights.keras
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - loss: 0.2229 - val_loss: 0.4756
Epoch 2/10
[1m65/70[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 10ms/step - loss: 0.2134
Epoch 2: val_loss did not improve from 0.47562
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - loss: 0.2130 - val_loss: 0.4758
Epoch 3/10
[1m66/70[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 10ms/step - loss: 0.1889
Epoch 3: val_loss did not improve from 0.47562
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - loss: 0.1890 - val_loss: 0.4899
Epoch 4/10
[1m66/70[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 10ms/step - loss: 0.1719
Epoch 4: val_loss did not improve from 0.47562
[1m70/

# Model Predictions


In [101]:
# Find the model with the lowest validation loss
best_model_name = min(best_val_losses, key=lambda k: best_val_losses[k]['loss'])
best_weights_path = best_val_losses[best_model_name]['weights']
print(f"\nBest model: {best_model_name} with validation loss: {best_val_losses[best_model_name]['loss']}")

# Load the best model structure and weights
if best_model_name == "SimpleRNN_Model":
    best_model = create_simplernn_model(x_voc_size, max_len)
elif best_model_name == "LSTM_Model":
    best_model = create_lstm_model(x_voc_size, max_len)
else:
    best_model = create_gru_model(x_voc_size, max_len)

best_model.load_weights(best_weights_path)
print(f"Loaded best weights into {best_model_name}.")

# Predict using the best model
pred_prob = best_model.predict(x_val_seq)


Best model: GRU_Model with validation loss: 0.3459194600582123
Loaded best weights into GRU_Model.
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step


In [102]:
pred_prob[0]

array([0.00922482, 0.01939741, 0.04479589, 0.16451098, 0.07125839,
       0.00337581, 0.84735906, 0.52113706, 0.02138527, 0.29382733],
      dtype=float32)

The predictions are in terms of probabilities for each of the 10 tags. Hence we need to have a threshold value to convert these probabilities to 0 or 1.

Let's specify a set of candidate threshold values. We will select the threshold value that performs the best for the validation set.

In [103]:
#define candidate threshold values
import numpy as np
threshold  = np.arange(0,0.5,0.01)
threshold

array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
       0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
       0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32,
       0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43,
       0.44, 0.45, 0.46, 0.47, 0.48, 0.49])

Let's define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.

In [104]:
# convert probabilities into classes or tags based on a threshold value
def classify(pred_prob,thresh):
  y_pred_seq = []

  for i in pred_prob:
    temp=[]
    for j in i:
      if j>=thresh:
        temp.append(1)
      else:
        temp.append(0)
    y_pred_seq.append(temp)

  return y_pred_seq


In [105]:
from sklearn import metrics
score=[]

#convert to 1 array
y_true = np.array(y_val).ravel()

for thresh in threshold:

    #classes for each threshold
    y_pred_seq = classify(pred_prob,thresh)

    #convert to 1d array
    y_pred = np.array(y_pred_seq).ravel()

    score.append(metrics.f1_score(y_true,y_pred))

# Model Evaluation

In [107]:
#predictions for optimal threshold
y_pred_seq = classify(pred_prob,opt)
y_pred = np.array(y_pred_seq).ravel()

In [108]:
score.append(metrics.f1_score(y_true,y_pred))

In [109]:
opt = threshold[score.index(max(score))]
opt

np.float64(0.37)

In [110]:
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.91      0.89      0.90     17520
           1       0.62      0.66      0.64      4700

    accuracy                           0.84     22220
   macro avg       0.76      0.77      0.77     22220
weighted avg       0.85      0.84      0.84     22220



In [111]:
y_pred = mlb.inverse_transform(np.array(y_pred_seq))
y_true = mlb.inverse_transform(np.array(y_val))

df = pd.DataFrame({'comment':x_val,'actual':y_true,'predictions':y_pred})

In [112]:
df.sample(10)

Unnamed: 0,comment,actual,predictions
2148,i have million financial time series data minute data each minute sample data point i want to find the causality in real time like if someone give me a data point how i know that one point is caus...,"(r, time series)","(r, time series)"
835,i have data that includes clicks spend signups and date for week i turn off advertising spend to see what clicks and signups are the next week i turn advertising back on to see what the new clicks...,"(r, regression, time series)","(regression,)"
1947,there is quite some content online interpreting odds in a logistic model with a dichotomous predictor my problem is understanding coefficients when there are more than levels for a categorical var...,"(logistic, regression)","(logistic, r, regression)"
1150,i am currently attempting to model student achievement using both categorical mostly demographic factors and numerical exam scores data i am specifically looking for an explanation of the observed...,"(r, regression)","(r, regression, time series)"
283,i am trying to fit a generalized logistic function to a dataset and am having trouble computing the partial derivatives with respect to each of the variables my cost function is as follows j theta...,"(logistic, regression)","(distributions, probability)"
848,in model output first i have got significant varaibles later after i tried to predict marginal effect all variables become insignificant what is problem with my step and does it affect my interpre...,"(logistic, regression)","(machine learning, regression)"
1883,i have a doubt about use of linear regression if the correlation between two variables is is there any use of applying linear regression on those variables if possible can you explain when we shou...,"(r, regression)","(regression,)"
1417,we have n realisations of five individual iid random variables x x x x and x we define another random variable s x x x x x now for a given s generated from the same process the individual componen...,"(machine learning, probability)","(distributions, probability, self study)"
1955,a quality characteristic of a product is normally distributed with mean and stdev specs on the characteristic are x a unit that falls with spec results in profit c if x profit is c if x profit is ...,"(probability, self study)","(distributions, probability, self study)"
1828,i ve implemented knn algorithm in python and now i am testing it on iris data set i have two questions the performance seems to be bad if i run the program times and then calculate the average acc...,"(classification, machine learning)","(classification, machine learning, r)"


In [113]:
def predict_tag(comment):
  text=[]

  #preprocess
  text = [cleaner(comment)]

  #convert to integer sequences
  seq = x_tokenizer.texts_to_sequences(text)

  #pad the sequence
  pad_seq = pad_sequences(seq,  padding='post', maxlen=max_len)

  #make predictions
  pred_prob = model.predict(pad_seq)
  classes = classify(pred_prob,opt)[0]

  classes = np.array([classes])
  classes = mlb.inverse_transform(classes)
  return classes

In [116]:
comment = "I just finished a great course on regression and probability."

print("Comment:",comment)
print("Predicted Tags:",predict_tag(comment))

Comment: I just finished a great course on regression and probability.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step
Predicted Tags: [('distributions', 'probability')]
