# Quora Question Pairs (Complete Case Study)

## Introduction 
**Quora** is a place to gain and share knowledge. It's a platform to ask questions and connect with people who contribute unique insights and quality answers. </br>
At the time this competition first presented in Kaggle (5 years ago) the site was getting more than 100 million visitors every single month, now this number increased to 300 million visitors per month, anyone can ask question about the topic he or she want and get answers and Insights from others.


## The Problem from Business Point of View
Because the fact that those huge number of visitors can ask similar questions (which have slightly different words with the same meaning as example), Quora try to identify similar questions (very similar ones which we can say that they are duplicates).<br>
After identifying them they can provide answers to newly asked questions instantly depending on its high similarity to a question already answered on Quora.<br>
This process can improve a lot the User Experience (UX) in the application web or mobile because of three important ideas:<br>
- it will provide answers instantly for the questioner.<br>
- it also provides the best ones which may be upvoted by more people in the old question.<br>
- it helps the user who want to search about a question to quickly find the right answer instead of being overwhelmed by a lot of questions.


## Real World Business Constraints
1.	**No latency concerns** as it is not required to do this process in milliseconds, it may take few seconds or minutes to get the best answer depending on accurate similarity prediction. 
2.	**The cost of mis-classification can be very high**, if someone asks a question and the system recommends answers based on similar question but in fact, they seem similar but they are not so you suggest wrong answers which will affect user experience badly.
3.	You must **define a threshold** for the probability above which you can say they are duplicate.
4.	**Interpretability is partially important** to know what’s your classification depends on?


## Data 
-	CSV file for training data: **train.csv**.
-	Size of the file: 60.4 MB.
-   Every record contains 2 questions IDs and the actual text body for every question and finally the probability of being duplicate. 
-	More info about number of rows, features and sample will be explored below.

## The Problem from Machine Learning Point of View
We can say that it’s a binary classification problem as we need to predict if they are duplicate or not. </br>
Simply we need to solve this function f(q1,q2) = {0,1}
Evaluation Metrics: 
-	**Log-Loss metric** as Quora said in the Kaggle page and its obvious as it’s a probability.
-	**Binary Confusion Matrix** to look for the cost of misclassification as we go on and also the interpretability.


## Exploratory Data Analysis 
Now let’s more about our data.


#### Importing required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

#### Reading data and basic stats

In [5]:
# importing the data (one csv file)
df = pd.read_csv ("train.csv")
# printing the shape to knwo how many columns (features) and rows (records)
df.shape

(404290, 6)

In [6]:
# take a look at the head to see data sample 
df.head ()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [10]:
# more info about the fields to make a proper description of the data
# using info() function not describe() because it don't have numerical fields
# and also to find the number of non-null fields
df.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


#### Our data consists of 404290 rows and 6 columns. </br>
##### The 6 basic columns: 
- id: unique id for the record or the row.
- qid1 and qid2: unique identifier for each question we have in the record.
- question1 and question2: the actual text content of the two questions.
- Is_duplicate: this will be the label we are trying to learn from and predict in the final model which will say if the two question is very similar to be duplicates or not. 
</br></br>
##### We have only 3 null fields in the data which exists as 3 missing questions.
*	This percentage which is 3/404290 is very awesome we can remove or do anything with them. 
