# Quora Question Pairs

## Business Problem

### Description
Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

### Problem Statement
- Identify which questions asked on Quora are duplicates of questions that have already been asked.
- This could be useful to instantly provide answers to questions that have already been answered.
- We are tasked with predicting whether a pair of questions are duplicates or not.

### Sources/Useful Links
- Source : https://www.kaggle.com/c/quora-question-pairs

Useful Links
- Discussions : https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb/comments
- Blog 1 : https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning
- Blog 2 : https://towardsdatascience.com/identifying-duplicate-questions-on-quora-top-12-on-kaggle-4c1cf93f1c30
        
        

## Importing the Library

- Import the important libraries for loading data.

In [1]:
from nltk.corpus import stopwords
# This package is used for finding longest common subsequence between two strings
# you can write your own dp code for this

from sklearn.manifold import TSNE
# Import the Required lib packages for WORD-Cloud generation

### Reading data and basic stats 

In [4]:
# Data fields

# - id - the id of a training set question pair
# - qid1, qid2 - unique ids of each question (only available in train.csv)
# - question1, question2 - the full text of each question
# - is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

### Distribution of data points


In [7]:
# - Number of duplicate(similar) and non-duplicate(non similar) questions

### Number of unique questions

In [8]:
# print len(np.unique(qids))

### Checking for Duplicates 

In [9]:
# checking whether there are any repeated pair of questions

### Number of occurrences of each question

In [10]:
# Check the Number of occurrences of each question

### Checking for NULL values

In [12]:
# Checking whether there are any rows with null values

# Filling the null values with ' '

### Feature Engineering

In [13]:
# Let us now construct a few features like:

# - freq_qid1 = Frequency of qid1's
# - freq_qid2 = Frequency of qid2's
# - q1len = Length of q1
# - q2len = Length of q2
# - q1_n_words = Number of words in Question 1
# - q2_n_words = Number of words in Question 2
# - word_Common = (Number of common unique words in Question 1 and Question 2)
# - word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
# - word_share = (word_common)/(word_Total)
# - freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
# - freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

### Analysis of some of the extracted features

In [16]:
# Feature: word_share

# Feature: word_Common

# The distributions of the word_Common feature in similar and non-similar questions are highly overlapping

### Preprocessing of Text


In [17]:
# Preprocessing:

# - Removing html tags
# - Removing Punctuations
# - Performing stemming
# - Removing Stopwords
# - Expanding contractions etc.

### Advanced Feature Extraction (NLP and Fuzzy Features)

In [18]:
# Definition:

# Token: You get a token by splitting sentence a space
# Stop_Word : stop words as per NLTK.
# Word : A token that is not a stop_word

In [20]:
# Converting the Sentence into Tokens: 
# Get the non-stopwords in Questions
# Get the stopwords in Questions
# Get the common non-stopwords from Question pair
# Get the common stopwords from Question pair
# Get the common Tokens from Question pair
# Last word of both question is same or not
# First word of both question is same or not
# Average Token Length of both Questions

In [21]:
# get the Longest Common sub string

In [22]:
# preprocessing each question

In [23]:
# Merging Features with dataset

In [24]:
#Computing Fuzzy Features and Merging with Dataset
    
    # do read this blog: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

In [25]:
# The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and 

# then joining them back into a string We then compare the transformed strings with a simple ratio().

### Analysis of extracted features

In [26]:
# Plotting Word clouds
# Creating Word Cloud of Duplicates and Non-Duplicates Question pairs
# We can observe the most frequent occuring words

In [31]:
# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
# Require to avoid Unicode issue
# Saving the np array into a text file
# reading the text files and removing the Stop Words:

### Pair plot of features ['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'] 

In [32]:
# Distribution of the token_sort_ratio

### Visualization

In [34]:
# Using TSNE for Dimentionality reduction for 15 Features(Generated after cleaning the data) to 3 dimention
# draw the plot in appropriate place in the grid

### Featurizing text data with tfidf weighted word-vectors

In [35]:
# exctract word2vec vectors

In [37]:
# avoid decoding problems
# encode questions to unicode
# https://stackoverflow.com/a/6812069
# df['question1'] = df['question1'].apply(lambda x: unicode(str(x),"utf-8"))
# df['question2'] = df['question2'].apply(lambda x: unicode(str(x),"utf-8"))

In [39]:
# Now merge texts
# dict key:word and value:tf-idf score

In [40]:
# After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
# here we use a pre-trained GLOVE model which comes free with "Spacy". https://spacy.io/usage/vectors-similarity
# It is trained on Wikipedia and therefore, it is stronger in terms of word semantics.

In [41]:
# en_vectors_web_lg, which includes over 1 million unique vectors.
# tqdm is used to print the progress bar
# 384 is the number of dimensions of vectors 

In [42]:
# prepro_features_train.csv (Simple Preprocessing Feartures)
# nlp_features_train.csv (NLP Features)

In [43]:
# dataframe of nlp features

In [44]:
# data before preprocessing 

In [45]:
# Questions 1 tfidf weighted word2vec
# Questions 2 tfidf weighted word2vec

### Machine Learning Models

#### Reading data from file and storing into sql table

In [47]:
# Creating db file from csv
# try to sample data according to the computing power you have
# for selecting first 1M rows
# data = pd.read_sql_query("""SELECT * FROM data LIMIT 100001;""", conn_r)
# for selecting random points
# remove the first row

### Converting strings to numerics

In [48]:
# after we read from sql table each entry was read it as a string
# we convert all the features into numaric before we apply any model

### Random train test split( 70:30)

In [49]:
# This function plots the confusion matrices given y_i, y_i_hat.
def plot_confusion_matrix(test_y, predict_y):
    C = confusion_matrix(test_y, predict_y)
    # C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j
    
    A =(((C.T)/(C.sum(axis=1))).T)
    #divid each element of the confusion matrix with the sum of elements in that column

### Building a random model (Finding worst-case log-loss) 

In [50]:
# we need to generate 9 numbers and the sum of numbers should be 1
# one solution is to genarate 9 numbers and divide each of the numbers by their sum
# we create a output array that has exactly same size as the CV data

### Logistic Regression with hyperparameter tuning 

In [51]:
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, 
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, 
# class_weight=None, warm_start=False, average=False, n_iter=None)

# some of methods
# fit(X, y[, coef_init, intercept_init, …])	Fit linear model with Stochastic Gradient Descent.
# predict(X)	Predict class labels for samples in X.

### Linear SVM with hyperparameter tuning

In [52]:
alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.

# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, 
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, 
# class_weight=None, warm_start=False, average=False, n_iter=None)

# some of methods
# fit(X, y[, coef_init, intercept_init, …])	Fit linear model with Stochastic Gradient Descent.
# predict(X)	Predict class labels for samples in X.

### XGBoost

In [56]:
# Use XGBoost for final Implementation.
# After executing all the algorithms the XGBoost will give the better result.
# In XGBoost the test log loss is: 0.357054433715 which is quite good which gives us the excellent result, compare to other algorithms.