New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study guestwalk's 'base' solution #12

Open
atkm opened this Issue Oct 11, 2018 · 2 comments

Comments

1 participant
@atkm
Copy link
Owner

atkm commented Oct 11, 2018

@atkm atkm created this issue from a note in Models (To Do) Oct 11, 2018

@atkm

This comment has been minimized.

Copy link
Owner

atkm commented Oct 11, 2018

Look at gen_data.py more carefully.

base/run.py:

  • uses cols ['id','click','hour','banner_pos','device_id','device_ip','device_model','device_conn_type','C14','C17','C20','C21'], and add cols ['pub_id','pub_domain','pub_category','device_id_count','device_ip_count','user_count','smooth_user_hour_count','user_click_histroy'].
  • util/gen_data.py: feature engineering
    • add count features
    • a row "is_app" iff site_id is null (== 85f751fd). Then define pub_id/domain/category := app or site var.
    • split the data into two based on "is_app".
    • user := ip & model if device_id is null (== a99f214a), else device_id.
    • do these for the training set (tr_*) and validation set (va_*).
  • util/parallelizer.py: ? Looks like it's splitting a file.
  • mark1: call train.cpp
  • pickle, merge (this is where predict_proba is computed), and unpickle (write predicted proba to a csv).

base/converter/2.py:

  • for features where counts are used, use the raw feature if its count is high enough; else use its count. The solution uses count > 1000 as a threshold for device_ip/id, and count > 30 for user and hourly user.

@atkm atkm moved this from To Do to In Progress in Models Oct 11, 2018

@atkm

This comment has been minimized.

Copy link
Owner

atkm commented Oct 16, 2018

Slide 8 of https://www.csie.ntu.edu.tw/~r01922136/slides/kaggle-avazu.pdf
Implement count features for the following:

  • device_ip, device_id
  • user
  • hourly user
  • hourly impression. They define an impression as the concatenation of all 'raw features'; not sure what they mean by that. An impression usually means that an ad is shown to an user.

Slide 10:

  • click history

@atkm atkm moved this from In Progress to To Do in Models Oct 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment