BT4222 Project
The dataset used in this project is available publicly on Kaggle website, provided by Quora. It consists of 404352 question pairs with 6 fields; unique identifier of the question pair, unique identifier of the first question, unique identifier of the second question, full text of first question, full text of second question, and the duplicate label. The data can be found here: Kaggle Quora Competition Website.
- Data Validation
- Feature Engineering
- Feature Encoding
- Data Pre-processing
- Feature Selection
For Pre-processing, we mixed around alot of different techniques. After the common pre-processing steps like removing stop words, lemmatization, etc, we used bag-of-words for topic modeling and word2vec to build our embedding layer. Other pre-processing steps include feature scaling numerical features and one-hot-encoding the topics generated.
Here are some example features:
- NLP Features
- Latent Dirichlet Allocation, Topic Modeling
- Common Ratio
- Frequency
- Doc2vec
- Word2vec
- Similary measures (Jaccard)
- Sentiment
- Fuzzy
- Distance Features
- Canberra
- Euclidean
- Cosine
- Logistic Regression
- Random Forest
- LightGBM
- XGBoost
- BERT
- MLP
- Siamese BiLSTM
- Stacking
We also tried Manhantann Siamese LSTM and Support Vector Machine but these were discarded due to poor performance or long training times.
- Web Scraping
We scraped a random sample of questions from stackoverflow (since web crawling is against Quora rules) to test how our model will perform in production.