UCL IRDM2017 Group 5

Information Retreival and Data Mining

Manuals

** Generate features:**

Warning: This could take half a day to complete.
Ensure datasets data.original/attributes.csv, data.original/product_descriptions.csv, data.original/test.csv and data.original/train.csv are available.
Set the desired features to generate in desiredFeatures variable of RunMe.py
Run RunMe.py
Generated csv will be located in data/features_full.csv

** Train and generate results for Ordinal Regression:**

Ensure featureset data/features_doc2vec_sense2vec_pmi_20170418.csv is available.
If using a different featureset, please change file reference in OrdinalRegressionRanker.py (myFeatureSetFileReference variable)
Run OrdinalRegressionRanker.py
Results will be generated on screen or ordinal_private.RMSE_NDCG_.csv and ordinal_public.RMSE_NDCG_.csv

** Train and generate results for DNN:**

Ensure featureset data/features_doc2vec_sense2vec_20170416.csv is available.
If using a different featureset, please change file reference in DNN.ipynb (full_features_filename variable)
Run DNN.ipynb
RMSE results will be generated on screen and results can be output to file specified.

** Train and generate results for XGBoost:**

Ensure featureset data/features_final_20170419.csv is available.
If using a different featureset, please change file reference in XGBoostRanker.py line 263
Run XGBoostRanker.py
RMSE results will be generated on screen and results.csv are output to file specified.

** Generate ensemble results:**

Ensure public and private test sets predictions are available in the format 'id','pred_revelance' columns.
Run ensemble.ipynb to generate ensemble prediction csv and RMSE, NDCG scores on screen.

** Generate features – for Random Forest & Bagging Algorithms:**

Ensure datasets attributes.csv, product_descriptions.csvtest.csv and train.csv are available.
Run RandomForestRanker.ipynb
The Generated csv will be located in data/features_rf_bag_lg.csv

** Train and generate results for Random Forest & Bagging Algorithms:**

Ensure featureset data/ features_rf_bag_lg.csv is available. (optional) .If using a different featureset, uncomment the appropriate line or if constructed from scratch please change file reference in RandomForestRanker.ipynb (df_full_clean.to_csv('features_rf_bag_lg.csv',index=False))
Run RandomForestRanker.ipynb
RMSE results for the selected algorithms (either Random Forest, Bagging or Logistic Regression) will be generated on screen and results can be output to file path specified.

Ideas list

Ideas for Data Processing:

Stemming + lower case (Chun Siong: Done)
Spelling correction (Chun Siong: Done)
Remove punctuation (Chun Siong: Done. There is a tokeniser based on RE under FeatureEngineering.py)
Remove Non-ASCII (Kah Siong: Done)
Stopword removal (Min: Done)
Merge all attributes key value pair into a single text field (Min: Done)
Brand Column (Chun Siong: Done)
Color and material Column (Chun Siong: Done)

Ideas for Feature Engineering: More ideas of features can be found here https://www.microsoft.com/en-us/research/project/mslr/

Query-independant (Document only)
- Document Length (Chun Siong: Done)
- Brand Column (Chun Siong: Done)
- Document, search term, title length (Chun Siong: Done)
Query-dependant (Document and query)
- TF-IDF (Chun Siong: Done)
- Binary indicator if color/material in search term is also in product (Chun Siong: Done)
- Binary indicator if brand in search term is also product brand (Chun Siong: Done)
- BM25 on product title and description combined (Kah Siong: Done)
- LMIR.ABS (Min: WIP, Feature_LMIR.py implemented, but not incorporated into FeatureEngineering.py)
- LMIR.DIR
- LMIR.JM
- LDA
- PMI (Kah Siong: Done)
- Sense2vec (Min: Done)
- Productuid (Min: Done)
- spaCy noun chunks (Min: Done)
Cosine Similarity (Chun Siong: Done)
Doc2Vec (Chun Siong: Done)
Word2Vec (Kah Siong: Done)
Query expansion (Kah Siong: Done with Word2Vec Query Expansion)
KL ? (What's this?) Kulback lieber, i've seen it mentioned in comparisons which include BM25, LMIR, KL https://www.microsoft.com/en-us/research/publication/relevance-ranking-using-kernels/
Output all computed feature to Ranklib format

Ideas for Model Selection:

Pointwise
- Logistic Regression (Kah Siong: Done)
- Ordinal Regression and variants (LAD, LOGIT, LOGAT) (Kah Siong: Done) (Stick with Ridge variant best performer) (MORD, and if MORD doesn't work then https://gist.github.com/agramfort/2071994)
- Factorisation Machine multiclass classifier (Kah Siong: Done) (NOT GOOD..Its running now but it doesn't seem to predict properly for one vs all multiclass... above 1 RMSE)
- Support Vector Machine
- Boosted Regression
- Perception <- perceptron?
- Gradient Boosted Regression Trees (Chun Siong: Done)
- Deep learning methods
  - RNN - Match tensor https://arxiv.org/pdf/1701.07795.pdf - (Min: Done)
  - CNN - (Min: Done)
  - DNN - (Min: Done)
Pairwise
- RankNet (RankLib)
- RankBoost (RankLib)
- Coordinate Ascent (RankLib)
- LambdaMart (RankLib)
- MART (RankLib)
- Random Forests (RankLib)
Listwise
- ListNet (RankLib)
- AdaRank (RankLib)
Ensemble
- weighted ensemble (Min: Done)

Ideas for Evaluation:

NDCG (Kah Siong: Done to accommodate our datasets)
RMSE

Goal

Predict a relevance score for the provided combinations of search terms and products.

id, relevance

where id is id of test sample, relevance (score) is a value between 1 and 3..

Important note to raters.

The relevance score was made by the following considerations

Examine the Product Description
Do not only use the product title to determine relevancy
Focus on the following _attributes _when comparing the product to the query (E.g. Brand, Material, and Functionality)
Brand is as important as the Functionality!

Glossary

Items	Explanation
Relevance	Real number between 1 (not relevant) to 3 (relevant). (E.g. Can be 1.33)
Search/Product pairs	Search-Product pairsOne or more search terms that produced a product.
Evaluations	Search-Product-RelevanceOne or more search terms that produced a product. And this match was evaluated at a relevance defined above.
Id	Unique id to identify rows on the train or test sets.
Product_uid	Id to identify the product. Product may appear more than once on the records
product_title	Short title of the product.
search_term	One or more search words separated by a space.
product_description	A long description of the product (Think in terms of commercial)
Name	Name of an attribute (E.g. Material). This should be used with 'value'.
Value	Value of an attribute (E.g. A value of a 'Material' attribute is 'Steel'

Datasets

File	Description	Sample
Train.csv	Our training set, contains products, searches, and relevance scores.	Id, product_uid, product_title, search_term, relevance1,100001,Husky 18 in. Total Tech Bag , husky tool bag, 3 2,100002, Vigoro 60 ft. No-Dig Edging, landscape edging, 2.67
Test.csv	Same as training set, except no relevance.	Id, product_uid, product_title, search_term1,100001, "Husky 18 in. Total Tech Bag , husky tool bag" 2,100002, "Vigoro 60 ft. No-Dig Edging", "landscape edging"
Product_descriptions.csv	Description of each product.	"product_uid","product_description"100001,"Not only do angles make joints stronger, they also provide more consistent, straight corners. …. SD screws"
Attributes.csv	provides extended information about a subset of the products (typically representing detailed technical specifications). Not every product will have attributes.	"product_uid","name","value"100001,"Bullet01","Versatile connector for various 90° connections and home repair projects" 100001,"Material","Galvanized Steel"

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
data.prune		data.prune
data		data
java		java
python		python
report		report
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.prune

data.prune

data

data

java

java

python

python

report

report

README.md

README.md

Repository files navigation

UCL IRDM2017 Group 5

Manuals

Ideas list

Goal

About

Releases

Packages

Contributors 4

Languages

cspoh/IRDM2017

Folders and files

Latest commit

History

Repository files navigation

UCL IRDM2017 Group 5

Manuals

Ideas list

Goal

About

Resources

Stars

Watchers

Forks

Languages