duplicate-questions-pair-detection

BT4222 Project

Dataset

The dataset used in this project is available publicly on Kaggle website, provided by Quora. It consists of 404352 question pairs with 6 fields; unique identifier of the question pair, unique identifier of the first question, unique identifier of the second question, full text of first question, full text of second question, and the duplicate label. The data can be found here: Kaggle Quora Competition Website.

Data Preparation

Data Validation
Feature Engineering
Feature Encoding
Data Pre-processing
Feature Selection

For Pre-processing, we mixed around alot of different techniques. After the common pre-processing steps like removing stop words, lemmatization, etc, we used bag-of-words for topic modeling and word2vec to build our embedding layer. Other pre-processing steps include feature scaling numerical features and one-hot-encoding the topics generated.

Here are some example features:

NLP Features
- Latent Dirichlet Allocation, Topic Modeling
- Common Ratio
- Frequency
- Doc2vec
- Word2vec
- Similary measures (Jaccard)
- Sentiment
- Fuzzy
Distance Features
- Canberra
- Euclidean
- Cosine

Models

Logistic Regression
Random Forest
LightGBM
XGBoost
BERT
MLP
Siamese BiLSTM
Stacking

We also tried Manhantann Siamese LSTM and Support Vector Machine but these were discarded due to poor performance or long training times.

Overview of Stacking Architecture

Other tasks performed

Web Scraping

We scraped a random sample of questions from stackoverflow (since web crawling is against Quora rules) to test how our model will perform in production.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
data/raw		data/raw
reports/figures		reports/figures
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duplicate-questions-pair-detection

Dataset

Data Preparation

Models

Overview of Stacking Architecture

Other tasks performed

Results

About

Releases

Packages

Contributors 3

Languages

License

calvenjs/duplicate-questions-pair-detection

Folders and files

Latest commit

History

Repository files navigation

duplicate-questions-pair-detection

Dataset

Data Preparation

Models

Overview of Stacking Architecture

Other tasks performed

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages