quora-ques

Summary of Finding similarity in Quora Questions set.

As a first Kaggle assignment, we started with reading up a top-rated Notebook [https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb] by anokas. The Notebook analyzed some basic properties of given data such as data size; max true positives etc max sentence size. word frequency tfidf words shared between the questions. etc.

Before applying a neural network to this problem, we thought of applying more basic techniques like Step 1: Machine Learning algorithms like Decision Trees, Random Forest, boosting algorithms. Step 2: Logistic Regression Step 3: RNN

However, step 1 and step 2 required extracting features. Two features we ended up using are word_share_match and tfidf_word_share_match. Using these two features, Logistic regression and Decision Trees, Random Forest, boosting algorithms were implemented.

Gradient Boosting leaderboard score was 0.35535.

Later RNN was applied using Dual encoder LSTM. Interestingly, the when model was being fitted on training data, accuracy was increasing while even logloss was increasing. Why accuracy can increase when logloss is also increasing.? (arjun please comment)

Current state:

We could not calculate/submit accuracy of NN on the test set because we are facing technical issues for the submission.

Understing ROC_AUC (just for information)

Learnt metric such as ROC_AUC accuracy What is ROC: It is plot of the True Positive Rate (on Y-axis) and False Positive Rate (on X-axis) for every positive classification threshold. True Positive Rate = # of true positive / all positives. False Positive Rate = # of false positive / all negatives. http://www.dataschool.io/roc-curves-and-auc-explained/ - very good explaination.

AUC score tells us how good is the classifier. AUC score of 0.5 means very poor classifier (equivalent to random guessing). AUC score of 1 means best classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quora-ques

Summary of Finding similarity in Quora Questions set.

Current state:

Understing ROC_AUC (just for information)

About

Releases

Packages

Contributors 2

Languages

arjunjauhari/quora-ques

Folders and files

Latest commit

History

Repository files navigation

quora-ques

Summary of Finding similarity in Quora Questions set.

Current state:

Understing ROC_AUC (just for information)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages