# Problem Statement

We have observed that many questions on web based question-answering/discussion platforms go unanswered for a long time.  The main reason behind that is either the question is asked in the wrong category or the similar kind of question has been asked before So people tend not to answer it. That’s why the CrowdSource team at Google Research, a group dedicated to advancing NLP and other types of ML science via crowdsourcing, has collected data on a number of these quality scoring aspects.
We use that dataset to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a "common-sense" fashion. Our raters received minimal guidance and training and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common sense to complete the task. Demonstrating these subjective labels can be predicted reliably and can shine a new light on this research area.

The fundamental tasks of our project are:

- Classify the questions based on the labels into various categories

- Relevant question-answer retrieval using semantic similarity


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import LSTM, Dense,Flatten,Conv2D,Conv1D,GlobalMaxPooling1D,GlobalMaxPool1D,SimpleRNN
from keras.optimizers import Adam
import numpy as np  
import pandas as pd 
import keras.backend as k
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional,GRU
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
import matplotlib.pyplot as plt

In [None]:
train_df=pd.read_csv('../input/google-quest-challenge/train.csv')
train_df.head()

In [None]:
train_df = train_df[['question_title', 'question_body', 'answer', 'category']]
train_df

In [None]:
#This is used to label encode the labels for categorization
from sklearn.preprocessing import LabelEncoder
label_y= LabelEncoder()
labels=label_y.fit_transform(train_df['category'])
labels

# Removing the URL's from the text
URLs (or Uniform Resource Locators) in a text are references to a location on the web, but do not provide any additional information. We thus, remove these too using the library named re, which provides regular expression matching operations.

In [None]:
import re
def remove_url(s):
  return re.sub(r'http\S+', '', s)

train_df['question_body'] = train_df['question_body'].apply(remove_url)
train_df['answer'] = train_df['answer'].apply(remove_url)

# Removing the Tags from the text
The web generates tons of text data and this text might have HTML tags in it. These HTML tags do not add any value to text data and only enable proper browser rendering. Hence we will remove the HTML tags from the text using re library

In [None]:
def remove_tag(s):
  return re.sub(r'<.*?>', ' ', s)


train_df['question_body'] = train_df['question_body'].apply(remove_tag)
train_df['answer'] = train_df['answer'].apply(remove_tag)

# Lowercasing the text
The generated text contains both uppecase characters as well as lower case characters. Systems are usually case sensitive so it would consider "the" and "The" as different word, which would not only increase the number of words we have process but also cause same word to have multiple meaning. Hence we will lower case the entire text

In [None]:
def lower_words(s):
   return s.lower()

train_df['question_body'] = train_df['question_body'].apply(lower_words)
train_df['answer'] = train_df['answer'].apply(lower_words)

# Expand contracted words in the text
In our everyday verbal and written communication, a lot of us tend to contract common words like “you are” becomes “you’re”. Converting contractions into their natural form will bring more insights.

In [None]:
def decontracted(phrase):
  """decontracted takes text and convert contractions into natural form.
     ref: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490"""

  # specific
  phrase = re.sub(r"won\'t", "will not", phrase)
  phrase = re.sub(r"can\'t", "can not", phrase)
  phrase = re.sub(r"won\’t", "will not", phrase)
  phrase = re.sub(r"can\’t", "can not", phrase)

  # general
  phrase = re.sub(r"n\'t", " not", phrase)
  phrase = re.sub(r"\'re", " are", phrase)
  phrase = re.sub(r"\'s", " is", phrase)
  phrase = re.sub(r"\'d", " would", phrase)
  phrase = re.sub(r"\'ll", " will", phrase)
  phrase = re.sub(r"\'t", " not", phrase)
  phrase = re.sub(r"\'ve", " have", phrase)
  phrase = re.sub(r"\'m", " am", phrase)

  phrase = re.sub(r"n\’t", " not", phrase)
  phrase = re.sub(r"\’re", " are", phrase)
  phrase = re.sub(r"\’s", " is", phrase)
  phrase = re.sub(r"\’d", " would", phrase)
  phrase = re.sub(r"\’ll", " will", phrase)
  phrase = re.sub(r"\’t", " not", phrase)
  phrase = re.sub(r"\’ve", " have", phrase)
  phrase = re.sub(r"\’m", " am", phrase)

  return phrase

train_df['question_body'] = train_df['question_body'].apply(decontracted)
train_df['answer'] = train_df['answer'].apply(decontracted)

# Remove words with numbers
The words which contain number tend to be spam, and add more noise to the data. Hence we'll remove them

In [None]:
def remove_words_with_nums(s):
  return re.sub(r"\S*\d\S*", "", s)


train_df['question_body'] = train_df['question_body'].apply(remove_words_with_nums)
train_df['answer'] = train_df['answer'].apply(remove_words_with_nums)

# Remove special characters
Special characters like  – (hyphen) or / (slash) don’t add any value, so we generally remove those. Characters are removed depending on the use case. If we are performing a task where the currency doesn’t play a role (for example in sentiment analysis), we remove the any currency sign.

In [None]:
def remove_special_character(s):
  return re.sub('[^A-Za-z0-9]+', ' ', s)

train_df['question_body'] = train_df['question_body'].apply(remove_special_character)
train_df['answer'] = train_df['answer'].apply(remove_special_character)

# Stop Word Removal
Apart from URLs, HTML tags and special characters, there are words that are not required for tasks such as sentiment analysis or text classification. Words like I, me, you, he and others increase the size of text data but don’t improve results dramatically and thus it is a good idea to remove those.

Instead of going with standard NLTK stopword set we decided to make our own set, as other set also includes negative words like 'not' which could be useful for the task

In [None]:
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren'])

In [None]:
def remove_stopword(s):
    res = ' '.join([word for word in s.split(' ') if word not in stopwords])
    return res

train_df['question_body'] = train_df['question_body'].apply(remove_stopword)
train_df['answer'] = train_df['answer'].apply(remove_stopword)

# Lemmatization
Now that we have removed all the “noise” from the text, it is time to normalize the data set. A word in a text may exist in multiple forms like stop and stopped (past participle or price and prices (plural). Text normalization converts variations of the word into root form of the same word.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatization(s):
    res = ' '.join([lemmatizer.lemmatize(word) for word in s.split(' ')])
    return res

train_df['question_body'] = train_df['question_body'].apply(lemmatization)
train_df['answer'] = train_df['answer'].apply(lemmatization)


In [None]:
def preprocess_text(text):
    text = remove_url(text)
    text = remove_tag(text)
    text = lower_words(text)
    text = decontracted(text)
    text = remove_words_with_nums(text)
    text = remove_special_character(text)
    text = remove_stopword(text)
    text = lemmatization(text)
    return text

train_df['question_body'] = train_df['question_body'].apply(preprocess_text)

# Building a Bare Minimal Neural Network 

Here, we will be building a stand-alonw neural network model just for classifying the labels to the respective questions. For this we will be using RNNs/LSTMs/GRU for our usecase. A classic LSTM based network is one of the most fundamental building blocks of all the robust architectures that we see today.For the first part we will be focussing on standard RNNs. Some resources for RNNs:


## Recurrent Neural Networks

Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language.Schematically, a RNN layer uses a for loop to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far. Forward pass of Classical RNNs have the following formula :


## Classical RNN image


A classic RNN consists of the following image:


<img src="https://miro.medium.com/max/627/1*go8PHsPNbbV6qRiwpUQ5BQ.png">



## Simple RNN

In [None]:
#Important parameters when using without pretrained embeddings
maxlen=1000
max_features=5000 
embed_size=768


#Desing a simple model
#Layers:
#1.Input
#2.Embedding
#3.Simple RNN- With Bidirectionality to increase efficiency
#4.GlobalMaxPooling (optional)
#5.Dense Layer with Relu activation
#6.Final Dense layer containing the input units = (no of unique labels in the corpus).In this case 5.

inp=Input(shape=(maxlen,))
z=Embedding(max_features,embed_size,input_length=maxlen)(inp)
z=Bidirectional(SimpleRNN(60,return_sequences='True'))(z)
z=GlobalMaxPool1D()(z)
z=Dense(16,activation='relu')(z)
z=Dense(5,activation='softmax')(z)
model=Model(inputs=inp,outputs=z)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
plot_model(
    model,
    to_file="Simple_RNN.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96,
)

#Split the training and test datasets
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['question_body'],train_y,test_size=0.2,random_state=42)
val_x=test_x

#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

#Run the model with the dataset with 128 batch size ,10 epochs and validation data.
model.fit(train_x,train_y,batch_size=128,epochs=10,verbose=2,validation_data=(val_x,val_y))

## Model Architecture

The model architecture for the Bidirectional Simple RNN can be seen as below:

<img src="https://i.imgur.com/QFsESSn.png">

In [None]:
temp = 'I am new to Wordpress. i have issue with Feature image. just i need to add URL to feature image(when we click on that feature image , it should redirect to that particular URL). also is it possible to give URL to Title of the Portfolio categories page which i used in normal page. This is Portfolio , i have used in the "mypage" . so in that" mypage" when we click on that image and title it should be redirect to the link (should able to give individual link) Any help would be appreciated. Thanks.'

In [None]:
temp = [preprocess_text(temp)]
temp = tokenizer.texts_to_sequences(temp)
temp = pad_sequences(temp,maxlen=maxlen)
temp_prediction = model.predict(temp)

In [None]:
temp_label = np.argmax(temp_prediction)
print('Predicted Category', label_y.inverse_transform([temp_label])[0])

## Creating Embedding Matrix using GloVe Word Embeddings

In [None]:
#Using Glove Embeddings, In this case, we will be using pretrained Glove 200dimension embeddings.
#The importance of using pretrained embeddings is to allow more semantic references of the word/sentence vectors.
from keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
maxlen=1000
max_features=5000 
embed_size=768

train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['question_body'],train_y,test_size=0.2,random_state=42)
val_x=test_x

#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

EMBEDDING_FILE = '../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector
plt.plot(embedding_matrix[10])

## Simple RNN with Glove200D pretrained embeddings


In [None]:
#Important parameters when using with pretrained Glove 200d embeddings
maxlen=1000
max_features=5000 
embed_size=200


#Desing a simple model
#Layers:
#1.Input
#2.Embedding -with pretrained glove weights
#3.Simple RNN- With Bidirectionality to increase efficiency
#4.GlobalMaxPooling (optional)
#5.Dense Layer with Relu activation
#6.Final Dense layer containing the input units = (no of unique labels in the corpus).In this case 5.

inp=Input(shape=(maxlen,))
z=Embedding(max_features,embed_size,weights=[embedding_matrix])(inp)
z=Bidirectional(SimpleRNN(60,return_sequences='True'))(z)
z=GlobalMaxPool1D()(z)
z=Dense(16,activation='relu')(z)
z=Dense(5,activation='softmax')(z)
model=Model(inputs=inp,outputs=z)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
plot_model(
    model,
    to_file="Simple_RNN_Glove200d.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96,
)

#Split the training and test datasets
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['question_body'],train_y,test_size=0.2,random_state=42)
val_x=test_x

#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

#Run the model with the dataset with 128 batch size ,10 epochs and validation data.
model.fit(train_x,train_y,batch_size=128,epochs=10,verbose=2,validation_data=(val_x,val_y))


The model architecture can be shown as below:

<img src="https://i.imgur.com/3ZBQApl.png">

# LSTM- Long Short Term Memory

[LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) are gated recurrent networks having 4 gates with (tanh/sigmoid) activation units. These architectures are the the building blocks of all the transformer architectures that we see, and the 4 gates combine input from different time stamps to produce the output. In a LSTM, there are typically 3 input and output signals: The h (hidden cell output from the previous timestep), c (the signal from previous cell), and the x(input vectors). Outputs involve the updated ht+1(hidden cell output of current block) value, ct+1, (updated c signal from the present cell) and the output(o).


<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png">

## LSTM model with Glove200D pretrained embeddings

Now we will be applying the glove embeddings (200d) for boosting performance (if any).

In [None]:
#Important parameters when using with pretrained Glove 200d embeddings
maxlen=1000
max_features=5000 
embed_size=200


#Desing a simple model
#Layers:
#1.Input
#2.Embedding -with pretrained glove weights
#3.Simple RNN- With Bidirectionality to increase efficiency
#4.GlobalMaxPooling (optional)
#5.Dense Layer with Relu activation
#6.Final Dense layer containing the input units = (no of unique labels in the corpus).In this case 5.

inp=Input(shape=(maxlen,))
z=Embedding(max_features,embed_size,weights=[embedding_matrix])(inp)
z=Bidirectional(LSTM(60,return_sequences='True'))(z)
z=GlobalMaxPool1D()(z)
z=Dense(16,activation='relu')(z)
z=Dense(5,activation='softmax')(z)
model=Model(inputs=inp,outputs=z)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
plot_model(
    model,
    to_file="Simple_LSTM_Glove200d.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96,
)

#Split the training and test datasets
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['question_body'],train_y,test_size=0.2,random_state=42)
val_x=test_x

#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

#Run the model with the dataset with 128 batch size ,10 epochs and validation data.
model.fit(train_x,train_y,batch_size=128,epochs=10,verbose=2,validation_data=(val_x,val_y))


The model architecture is as follows:

<img src="https://i.imgur.com/oOmKx56.png">

## Gated Recurrent Units

[GRUs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.


<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png">

## BiDirectional GRU

## Model Architecture for vanilla GRU

The model architecture is as follows:

<img src="https://i.imgur.com/jaZegBX.png">

## Relevant question-answer retrieval using semantic similarity

## Universal Sentence Encoder

<img src="https://jinglescode.github.io/assets/img/posts/build-textual-similarity-analysis-web-app-09.jpg">

We'll download the Universal Sentence Encoder model from tensorflow hub and use the same to obtain the embeddings for titles of all the question answer pairs

In [None]:
import tensorflow_hub as hub
model = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder-large/5?tf-hub-format=compressed")
train_df['question_title'] = train_df['question_title'].apply(preprocess_text)

# Semantic Similarity Based Retrival

We'll find the cosine similarity of query with every every question title and return the question title with maximum similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def query_match(query):
    query_embedding = model([query])
    sim = cosine_similarity(question_embeddings, query_embedding)
    sim_scores = [sim[i][0] for i in range(sim.shape[0])]
    return np.argmax(sim_scores)
    print(np.shape(sim))

In [None]:
print('Input the query you want to search')
# query = input()
query = "delete facebook appeal"
cleaned_query = preprocess_text(query)
query_idx = query_match(query)
print('Here is the result')
print(train_df.iloc[query_idx].question_title)