Information Retreival and Data Mining
** Generate features:**
- Warning: This could take half a day to complete.
- Ensure datasets data.original/attributes.csv, data.original/product_descriptions.csv, data.original/test.csv and data.original/train.csv are available.
- Set the desired features to generate in desiredFeatures variable of RunMe.py
- Run RunMe.py
- Generated csv will be located in data/features_full.csv
** Train and generate results for Ordinal Regression:**
- Ensure featureset data/features_doc2vec_sense2vec_pmi_20170418.csv is available.
- If using a different featureset, please change file reference in OrdinalRegressionRanker.py (myFeatureSetFileReference variable)
- Run OrdinalRegressionRanker.py
- Results will be generated on screen or ordinal_private.RMSE_NDCG_.csv and ordinal_public.RMSE_NDCG_.csv
** Train and generate results for DNN:**
- Ensure featureset data/features_doc2vec_sense2vec_20170416.csv is available.
- If using a different featureset, please change file reference in DNN.ipynb (full_features_filename variable)
- Run DNN.ipynb
- RMSE results will be generated on screen and results can be output to file specified.
** Train and generate results for XGBoost:**
- Ensure featureset data/features_final_20170419.csv is available.
- If using a different featureset, please change file reference in XGBoostRanker.py line 263
- Run XGBoostRanker.py
- RMSE results will be generated on screen and results.csv are output to file specified.
** Generate ensemble results:**
- Ensure public and private test sets predictions are available in the format 'id','pred_revelance' columns.
- Run ensemble.ipynb to generate ensemble prediction csv and RMSE, NDCG scores on screen.
** Generate features – for Random Forest & Bagging Algorithms:**
- Ensure datasets attributes.csv, product_descriptions.csvtest.csv and train.csv are available.
- Run RandomForestRanker.ipynb
- The Generated csv will be located in data/features_rf_bag_lg.csv
** Train and generate results for Random Forest & Bagging Algorithms:**
- Ensure featureset data/ features_rf_bag_lg.csv is available. (optional) .If using a different featureset, uncomment the appropriate line or if constructed from scratch please change file reference in RandomForestRanker.ipynb (df_full_clean.to_csv('features_rf_bag_lg.csv',index=False))
- Run RandomForestRanker.ipynb
- RMSE results for the selected algorithms (either Random Forest, Bagging or Logistic Regression) will be generated on screen and results can be output to file path specified.
Ideas for Data Processing:
- Stemming + lower case (Chun Siong: Done)
- Spelling correction (Chun Siong: Done)
- Remove punctuation (Chun Siong: Done. There is a tokeniser based on RE under FeatureEngineering.py)
- Remove Non-ASCII (Kah Siong: Done)
- Stopword removal (Min: Done)
- Merge all attributes key value pair into a single text field (Min: Done)
- Brand Column (Chun Siong: Done)
- Color and material Column (Chun Siong: Done)
Ideas for Feature Engineering: More ideas of features can be found here https://www.microsoft.com/en-us/research/project/mslr/
-
Query-independant (Document only)
- Document Length (Chun Siong: Done)
- Brand Column (Chun Siong: Done)
- Document, search term, title length (Chun Siong: Done)
-
Query-dependant (Document and query)
- TF-IDF (Chun Siong: Done)
- Binary indicator if color/material in search term is also in product (Chun Siong: Done)
- Binary indicator if brand in search term is also product brand (Chun Siong: Done)
- BM25 on product title and description combined (Kah Siong: Done)
- LMIR.ABS (Min: WIP, Feature_LMIR.py implemented, but not incorporated into FeatureEngineering.py)
- LMIR.DIR
- LMIR.JM
- LDA
- PMI (Kah Siong: Done)
- Sense2vec (Min: Done)
- Productuid (Min: Done)
- spaCy noun chunks (Min: Done)
-
Cosine Similarity (Chun Siong: Done)
-
Doc2Vec (Chun Siong: Done)
-
Word2Vec (Kah Siong: Done)
-
Query expansion (Kah Siong: Done with Word2Vec Query Expansion)
-
KL ? (What's this?) Kulback lieber, i've seen it mentioned in comparisons which include BM25, LMIR, KL https://www.microsoft.com/en-us/research/publication/relevance-ranking-using-kernels/
-
Output all computed feature to Ranklib format
Ideas for Model Selection:
-
Pointwise
- Logistic Regression (Kah Siong: Done)
- Ordinal Regression and variants (LAD, LOGIT, LOGAT) (Kah Siong: Done) (Stick with Ridge variant best performer) (MORD, and if MORD doesn't work then https://gist.github.com/agramfort/2071994)
- Factorisation Machine multiclass classifier (Kah Siong: Done) (NOT GOOD..Its running now but it doesn't seem to predict properly for one vs all multiclass... above 1 RMSE)
- Support Vector Machine
- Boosted Regression
- Perception <- perceptron?
- Gradient Boosted Regression Trees (Chun Siong: Done)
- Deep learning methods
- RNN - Match tensor https://arxiv.org/pdf/1701.07795.pdf - (Min: Done)
- CNN - (Min: Done)
- DNN - (Min: Done)
-
Pairwise
- RankNet (RankLib)
- RankBoost (RankLib)
- Coordinate Ascent (RankLib)
- LambdaMart (RankLib)
- MART (RankLib)
- Random Forests (RankLib)
-
Listwise
- ListNet (RankLib)
- AdaRank (RankLib)
-
Ensemble
- weighted ensemble (Min: Done)
Ideas for Evaluation:
- NDCG (Kah Siong: Done to accommodate our datasets)
- RMSE
Predict a relevance score for the provided combinations of search terms and products.
id, relevance
where id is id of test sample, relevance (score) is a value between 1 and 3..
Important note to raters.
The relevance score was made by the following considerations
- Examine the Product Description
- Do not only use the product title to determine relevancy
- Focus on the following _attributes _when comparing the product to the query (E.g. Brand, Material, and Functionality)
- Brand is as important as the Functionality!
Glossary
Items | Explanation |
---|---|
Relevance | Real number between 1 (not relevant) to 3 (relevant). (E.g. Can be 1.33) |
Search/Product pairs | Search-Product pairsOne or more search terms that produced a product. |
Evaluations | Search-Product-RelevanceOne or more search terms that produced a product. And this match was evaluated at a relevance defined above. |
Id | Unique id to identify rows on the train or test sets. |
Product_uid | Id to identify the product. Product may appear more than once on the records |
product_title | Short title of the product. |
search_term | One or more search words separated by a space. |
product_description | A long description of the product (Think in terms of commercial) |
Name | Name of an attribute (E.g. Material). This should be used with 'value'. |
Value | Value of an attribute (E.g. A value of a 'Material' attribute is 'Steel' |
Datasets
File | Description | Sample |
---|---|---|
Train.csv | Our training set, contains products, searches, and relevance scores. | Id, product_uid, product_title, search_term, relevance1,100001,Husky 18 in. Total Tech Bag , husky tool bag, 3 2,100002, Vigoro 60 ft. No-Dig Edging, landscape edging, 2.67 |
Test.csv | Same as training set, except no relevance. | Id, product_uid, product_title, search_term1,100001, "Husky 18 in. Total Tech Bag , husky tool bag" 2,100002, "Vigoro 60 ft. No-Dig Edging", "landscape edging" |
Product_descriptions.csv | Description of each product. | "product_uid","product_description"100001,"Not only do angles make joints stronger, they also provide more consistent, straight corners. …. SD screws" |
Attributes.csv | provides extended information about a subset of the products (typically representing detailed technical specifications). Not every product will have attributes. | "product_uid","name","value"100001,"Bullet01","Versatile connector for various 90° connections and home repair projects" 100001,"Material","Galvanized Steel" |