Quora Question Pairs
- Unzip Project_btekwani.zip
- If you want to create a virtual environment, run virtualenv <the name of the environment, say 'tekwani'>
Creating a virtualenv within Anaconda is highly recommended because of tensorflow, keras and their dependencies.
To begin using the virtual environment, you must activate it. $ source tekwani/bin/activate
4)Now install the packages specified in requirements.txt. You can do this using pip freeze > requirements.txt (freeze the current state of the environment)
pip install -r requirements.txt
- Depending on your installation, NLTK might require the Porter Stemmer and stopwords data. To install these, run python.
sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
srcfolder contains all the files used to generate visualizations, statistics, models and features used throughout the report.
blend.pyis the blended model made of 6 regressors.
features.pydoes all the feature engineering - FS-1 through FS-4 and generates 4 .npy files which correspond to word embedding vectors in train (q1 and q2) and test (q1 and q2). It also creates train_f.csv and test_f.csv which contain the cleaned text and all the computed features.
utilitiesis a module containing paths used for saving and loading Numpy arrays, csv files, loading data as pandas DataFrames and LSTM settings (constant values).
makesub.pyis a hack to create a submission file after
blend.pyhas finished executing.
Questions.pyhas simple pandas operations used to do basic counts, calculate average question lengths etc.
XGB_Baseline.pyis the XGBoost model that is the baseline submission I made.
XGBoost_GridSearch.py- a custom wrapper around xgboost so that we can do a grid search like scikit-learn allows.
Visualizations.py- generates violin plots and bar plots used in the report. Probably will not run as it is.
- data + train.csv + sample_submission.csv + test.csv - report - src +utilities + __init__.py + utilities.py + other *.py files - sub [generated submission files go here] - viz [plots and stuff] GoogleNews-vectors-negative300.bin.gz
xgb_gridsearch.out- contains the GridSearch logs
sub/XGB_Baseline_logs.txt- contains XGB outputs for various features
lstm_274_118_0.20_0.37.h5- the weights for the LSTM, will require an hd5 library in Python to open and view.
- All submitted prediction files are in the
- Download the original CSV files and feature files here.
- Download Google's pretrained Word2Vec model using
subinside the project folder so that all the submission files can be written there.
Run the source
features.pyonce you have atleast the train.csv and test.csv files in the
datafolder. train_f.csv and test_f.csv will be overwritten if they stored in
XGB_Baseline.pyfor the baseline model.
XGBoost_GridSearch.pyto perform a scikit-learn style GridSearch over the XGBoost parameters.
makesub.pyto generate a submission file (only for the blend model).
lstm.py. This will only run if you have keras and tensorflow installed.