# Problem Statement

The Task
Given a sequence of click events performed by some user during a typical session in an e-commerce website, the goal is to predict whether the user is going to buy something or not, and if he is buying, what would be the items he is going to buy. The task could therefore be divided into two sub goals:

Is the user going to buy items in this session? Yes|No
If yes, what are the items that are going to be bought?
The Data
Download the Data
Training Data Files
The training data comprises two different files:

yoochoose-clicks.dat - Click events. Each record/line in the file has the following fields:
- Session ID – the id of the session. In one session there are one or many clicks.
- Timestamp – the time when the click occurred.
- Item ID – the unique identifier of the item.
- Category – the category of the item.


yoochoose-buys.dat - Buy events. Each record/line in the file has the following fields:
- Session ID - the id of the session. In one session there are one or many buying events.
- Timestamp - the time when the buy occurred.
- Item ID – the unique identifier of item.
- Price – the price of the item.
- Quantity – how many of this item were bought.

The Session ID in yoochoose-buys.dat will always exist in the yoochoose-clicks.dat file – the records with the same Session ID together form the sequence of click events of a certain user during the session. The session could be short (few minutes) or very long (few hours), it could have one click or hundreds of clicks. All depends on the activity of the user.

Test File
The Test data is one file:

yoochoose-test.dat - identically structured as the yoochoose-clicks.dat of the training data
Session ID
Timestamp
Item ID
Category


Solution file
The task is to predict for each session in the test file, whether there is going to be a buying event in this session, and if there is, what are the items that will be bought. No need to predict quantities.

The solution file, that has to be submitted, comprises records that have exactly two fields:

solution.dat
Session ID

Comma separated list of Item IDs that have been bought in this session
All the Session IDs that exist in the solution file are coming from the test file. In the solution file there will be all the Session IDs that the challenger predicts to be ending with at least one buying event. The second field in each record/line will comprise a list of one or many Item IDs separated by “,” character. The field delimiter between Session ID and the list of Item IDs is “;”. No spaces are required in the file. An example of a solution file could be found here.

If a Session ID exists in the test file but does not exist in the solution file, it means that this is a session that the challenger predicts not to end up with a buying event. Since approximately 95% of the sessions end without a buying event, the challenge is hence twofold, first - recognize the exact sessions that end with a buying event, and second - what are the items that will be bought in these sessions.





Read the full problem statement and evaluation criteria here: https://2015.recsyschallenge.com/challenge.html

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn import metrics
from pandas import DataFrame

# pip3 install --no-deps fastai

In [2]:
# Any results you write to the current directory are saved as output.
PATH = '/Users/abasar/data/recsys-challenge-2015' 
import os
for dirname, _, filenames in os.walk(PATH):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/Users/abasar/data/recsys-challenge-2015/yoochoose-test.dat
/Users/abasar/data/recsys-challenge-2015/dataset-README.txt
/Users/abasar/data/recsys-challenge-2015/yoochoose-clicks.dat
/Users/abasar/data/recsys-challenge-2015/yoochoose-buys.dat


In [3]:
# https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L43
import re
import calendar
from functools import partial


def add_cyclic_datepart(df:DataFrame, field_name:str, prefix:str=None, drop:bool=True, time:bool=False, add_linear:bool=False):
    "Helper function that adds trigonometric date/time features to a date in the column `field_name` of `df`."
    make_date(df, field_name)
    field = df[field_name]
    prefix = ifnone(prefix, re.sub('[Dd]ate$', '', field_name))
    series = field.apply(partial(cyclic_dt_features, time=time, add_linear=add_linear))
    columns = [prefix + c for c in cyclic_dt_feat_names(time, add_linear)]
    df_feats = pd.DataFrame([item for item in series], columns=columns, index=series.index)
    for column in columns: df[column] = df_feats[column]
    if drop: df.drop(field_name, axis=1, inplace=True)
    return df

def add_datepart(df:DataFrame, field_name:str, prefix:str=None, drop:bool=True, time:bool=False):
    "Helper function that adds columns relevant to a date in the column `field_name` of `df`."
    make_date(df, field_name)
    field = df[field_name]
    prefix = ifnone(prefix, re.sub('[Dd]ate$', '', field_name))
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 
            'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[prefix + n] = getattr(field.dt, n.lower())
    df[prefix + 'Elapsed'] = field.astype(np.int64) // 10 ** 9
    if drop: df.drop(field_name, axis=1, inplace=True)
    return df

def make_date(df:DataFrame, date_field:str):
    "Make sure `df[field_name]` is of the right date type."
    field_dtype = df[date_field].dtype
    if isinstance(field_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        field_dtype = np.datetime64
    if not np.issubdtype(field_dtype, np.datetime64):
        df[date_field] = pd.to_datetime(df[date_field], infer_datetime_format=True)
        
        
def ifnone(a,b):
    "`a` if `a` is not None, otherwise `b`."
    return b if a is None else a



def cyclic_dt_features(d, time:bool=True, add_linear:bool=False):
    "Calculate the cos and sin of date/time cycles."
    tt,fs = d.timetuple(), [np.cos, np.sin]
    day_year,days_month = tt.tm_yday, calendar.monthrange(d.year, d.month)[1]
    days_year = 366 if calendar.isleap(d.year) else 365
    rs = d.weekday()/7, (d.day-1)/days_month, (d.month-1)/12, (day_year-1)/days_year
    feats = [f(r * 2 * np.pi) for r in rs for f in fs]
    if time and isinstance(d, datetime) and type(d) != date:
        rs = tt.tm_hour/24, tt.tm_hour%12/12, tt.tm_min/60, tt.tm_sec/60
        feats += [f(r * 2 * np.pi) for r in rs for f in fs]
    if add_linear:
        if type(d) == date: feats.append(d.year + rs[-1])
        else:
            secs_in_year = (datetime(d.year+1, 1, 1) - datetime(d.year, 1, 1)).total_seconds()
            feats.append(d.year + ((d - datetime(d.year, 1, 1)).total_seconds() / secs_in_year))
    return feats

def cyclic_dt_feat_names(time:bool=True, add_linear:bool=False):
    "Return feature names of date/time cycles as produced by `cyclic_dt_features`."
    fs = ['cos','sin']
    attr = [f'{r}_{f}' for r in 'weekday day_month month_year day_year'.split() for f in fs]
    if time: attr += [f'{r}_{f}' for r in 'hour clock min sec'.split() for f in fs]
    if add_linear: attr.append('year_lin')
    return attr



In [4]:
limit = 100000

In [5]:
def read_buys(limit=None):
    print("Loading buys")
    buys = pd.read_csv(f"{PATH}/yoochoose-buys.dat",
                    names=["session", "timestamp", "item", "price", "qty"],
                    parse_dates=["timestamp"], 
                    nrows = limit)
    buys = buys.sort_values(by=["timestamp", "session"])
    if limit:
        buys = buys.iloc[:limit]
    return buys

buys = read_buys(limit)
buys

Loading buys


Unnamed: 0,session,timestamp,item,price,qty
15774,263073,2014-04-01 03:05:31.743000+00:00,214716982,2617,1
20274,466654,2014-04-01 03:21:30.255000+00:00,214820812,313,1
20275,466654,2014-04-01 03:21:30.271000+00:00,214584907,4711,1
48952,185089,2014-04-01 03:29:09.230000+00:00,214819376,418,1
48953,185089,2014-04-01 03:29:09.280000+00:00,214819430,418,1
...,...,...,...,...,...
99143,1158269,2014-04-21 23:29:29.844000+00:00,214829004,6282,1
99463,1219516,2014-04-21 23:39:39.581000+00:00,214699022,48066,1
99188,1245741,2014-04-22 00:11:27.396000+00:00,214551900,732,1
96462,1214317,2014-04-22 01:33:10.934000+00:00,214600502,26075,1


In [6]:
def read_clicks(limit=None):
    print("Loading clicks")
    filename = f"{PATH}/yoochoose-clicks.dat"
    df = pd.read_csv(filename,
                     names=["session", "timestamp", "item", "category"],
                     parse_dates=["timestamp"],
                     converters={"category": lambda c: -1 if c == "S" else c}, 
                     nrows = limit)
    df = df.sort_values(by=["timestamp", "session"])
    if limit:
        df = df.iloc[:limit]
    print("Clicks shape %s %s" % df.shape)
    return df

clicks = read_clicks(limit)
clicks

Loading clicks
Clicks shape 100000 4


Unnamed: 0,session,timestamp,item,category
17670,5182,2014-04-01 03:02:21.719000+00:00,214821026,0
76947,23422,2014-04-01 03:02:33.088000+00:00,214821371,0
17671,5182,2014-04-01 03:03:11.600000+00:00,214821028,0
91911,30213,2014-04-01 03:03:30.774000+00:00,214832557,0
48140,14611,2014-04-01 03:03:45.041000+00:00,214585644,0
...,...,...,...,...
15293,4986,2014-04-08 02:47:37.676000+00:00,214510044,0
15294,4986,2014-04-08 02:48:30.134000+00:00,214537537,0
15295,4986,2014-04-08 02:48:57.006000+00:00,214582387,0
15296,4986,2014-04-08 02:49:23.683000+00:00,214510044,0


In [7]:
def process_clicks(clicks, rolling_days=15):
    # Compute dwell time for each click
    print("Processing clicks")
    clicks['prev_ts'] = clicks.groupby('session')['timestamp'].transform(lambda x: x.shift())
    clicks['diff_prev'] = clicks["timestamp"] - clicks["prev_ts"] # in minutes
    clicks["dwell"] = clicks.groupby('session')['diff_prev'].transform(lambda x: x.shift(-1)).dt.seconds/60
    clicks = clicks.sort_values(by=["session", "timestamp"])
    print("Processed clicks shape %s %s" % clicks.shape)
    return clicks

clicks_proc = process_clicks(clicks)
clicks_proc

Processing clicks
Processed clicks shape 100000 7


Unnamed: 0,session,timestamp,item,category,prev_ts,diff_prev,dwell
0,1,2014-04-07 10:51:09.277000+00:00,214536502,0,NaT,NaT,3.000000
1,1,2014-04-07 10:54:09.868000+00:00,214536500,0,2014-04-07 10:51:09.277000+00:00,00:03:00.591000,0.616667
2,1,2014-04-07 10:54:46.998000+00:00,214536506,0,2014-04-07 10:54:09.868000+00:00,00:00:37.130000,2.216667
3,1,2014-04-07 10:57:00.306000+00:00,214577561,0,2014-04-07 10:54:46.998000+00:00,00:02:13.308000,
4,2,2014-04-07 13:56:37.614000+00:00,214662742,0,NaT,NaT,0.683333
...,...,...,...,...,...,...,...
98003,32766,2014-04-05 17:13:29.754000+00:00,214707883,0,2014-04-05 17:12:41.133000+00:00,00:00:48.621000,0.433333
98004,32766,2014-04-05 17:13:56.017000+00:00,214585554,0,2014-04-05 17:13:29.754000+00:00,00:00:26.263000,2.716667
98005,32766,2014-04-05 17:16:39.422000+00:00,214819762,0,2014-04-05 17:13:56.017000+00:00,00:02:43.405000,
97999,32767,2014-04-03 14:25:12.899000+00:00,214512467,0,NaT,NaT,9.683333


In [8]:
def process_buys(limit=None):
    # Group into sessions, compute nr of items bought and set label column
    buys = read_buys(limit=limit)
    print("Processing buys")
    print("Buys from %s to %s" % (buys.timestamp.min(), buys.timestamp.max()))
    grouped = buys.groupby("session")
    buys_g = pd.DataFrame(index=grouped.groups.keys())
    buys_g["items_bought"] = grouped.item.count() # quantity may be zero which is weird so dont use it
    buys_g["is_buy"] = 1 # for easier merge later on
    buys_g.index.name = "session"
    print("Buys grouped by session %s %s" % buys_g.shape)
    return buys_g

buys_proc = process_buys(limit)
buys_proc

Loading buys
Processing buys
Buys from 2014-04-01 03:05:31.743000+00:00 to 2014-04-22 01:47:44.603000+00:00
Buys grouped by session 52270 2


Unnamed: 0_level_0,items_bought,is_buy
session,Unnamed: 1_level_1,Unnamed: 2_level_1
11,2,1
12,1,1
21,2,1
33,6,1
46,2,1
...,...,...
1286027,1,1
1286032,1,1
1286049,2,1
1286071,1,1


In [9]:
def get_items_cats_percent(clicks, limit=None):
    buys = read_buys(limit=limit)
    # percent bought
    item_id_bought_pct = buys.item.value_counts(normalize=True)
    cat_id_viewed_pct = clicks.category.value_counts(normalize=True)
    item_id_viewed_pct = clicks.item.value_counts(normalize=True)

    return dict(views=dict(item=item_id_viewed_pct, cat=cat_id_viewed_pct), buys=item_id_bought_pct)

items_cats_percent = get_items_cats_percent(clicks, limit)
items_cats_percent

Loading buys


{'views': {'item': 214839313    0.00949
  214826705    0.00897
  214826955    0.00827
  214717003    0.00755
  214821277    0.00750
                ...   
  214683819    0.00001
  214843490    0.00001
  214714520    0.00001
  214573205    0.00001
  214829055    0.00001
  Name: item, Length: 10067, dtype: float64,
  'cat': 0    1.0
  Name: category, dtype: float64},
 'buys': 214821277    0.01624
 214839313    0.01481
 214821305    0.00989
 214826705    0.00754
 214821285    0.00738
               ...   
 214629952    0.00001
 214550055    0.00001
 214683174    0.00001
 214828477    0.00001
 214839302    0.00001
 Name: item, Length: 7936, dtype: float64}

In [10]:
def process_sessions(processed_clicks, limit=None):
    print("Preprocessing - Grouping clicks into sessions")
    clicks = processed_clicks
    
    # Group clicks by session
    grouped = clicks.groupby("session")
    sessions = pd.DataFrame(index=grouped.groups.keys())
    
    # Session counters
    sessions["total_clicks"] = grouped.item.count()
    sessions["total_items"] = grouped.item.unique().apply(lambda x: len(x))
    sessions["total_cats"] = grouped.category.unique().apply(lambda x: len(x))
    print("Computed counters")
    
    # Session duration
    sessions["max_dwell"] = grouped.dwell.max()
    sessions["mean_dwell"] = grouped.dwell.mean()
    sessions["start_ts"] = grouped.timestamp.min()
    sessions["end_ts"] = grouped.timestamp.max()
    sessions["total_duration"] = (sessions["end_ts"] - sessions["start_ts"]).dt.seconds / 60
    print("Computed dwell and duration")
    
    # Click rate
    sessions["total_duration_secs"] = (sessions["end_ts"] - sessions["start_ts"]).dt.seconds
    sessions["click_rate"] = sessions["total_clicks"] / sessions["total_duration_secs"]
    sessions.click_rate = sessions.click_rate.replace(np.inf, np.nan)
    sessions.click_rate = sessions.click_rate.fillna(0)
    del sessions["total_duration_secs"]
    print("Computed click rate")
    
    # Dates
    #sessions = add_datepart(sessions, "start_ts", drop=False)
    #sessions = add_datepart(sessions, "end_ts", drop=False)
    sessions = add_cyclic_datepart(sessions, "start_ts", drop=False)
    sessions = add_cyclic_datepart(sessions, "end_ts", drop=False)
    print("Computed cyclic date parts")
    
    # What is the item and cat most viewed in this session?
    # How many times were they viewed?
    sessions["cat_most_viewed_n_times"] = grouped.category.value_counts().unstack().max(axis=1)
    sessions["cat_most_viewed"] = grouped.category.value_counts().unstack().idxmax(axis=1)
    sessions["item_most_viewed_n_times"] = grouped.item.value_counts().unstack().max(axis=1)
    sessions["item_most_viewed"] = grouped.item.value_counts().unstack().idxmax(axis=1)
    print("Computed most viewed item/cat per session")

    # For the item most viewed in each session, what is its global buy/view frequency?
    freqs = get_items_cats_percent(clicks, limit=limit)
    cat_views = pd.DataFrame(freqs["views"]["cat"])
    cat_views.columns = ["cat_views_freqs"]
    sessions = sessions.merge(cat_views, how="left", left_on="cat_most_viewed", right_index=True)
    sessions.cat_views_freqs = sessions.cat_views_freqs.fillna(0)
    item_views = pd.DataFrame(freqs["views"]["item"])
    item_views.columns = ["item_views_freqs"]
    sessions = sessions.merge(item_views, how="left", left_on="item_most_viewed", right_index=True)
    sessions.item_views_freqs = sessions.item_views_freqs.fillna(0)
    item_buys = pd.DataFrame(freqs["buys"])
    item_buys.columns = ["item_buys_freqs"]
    sessions = sessions.merge(item_buys, how="left", left_on="item_most_viewed", right_index=True)
    sessions.item_buys_freqs = sessions.item_buys_freqs.fillna(0)
    print("Computed most viewed/bought freqs")
    
    # Sorting sessions
    sessions = sessions.sort_values(by=["start_ts"])
    sessions.index.name = "session"
    
    print("Sessions shape %s %s" % sessions.shape)
    print("Sessions columns %s " % sessions.columns)
    print("Sessions from %s to %s" % (sessions.start_ts.min(), sessions.start_ts.max()))
    return sessions


sessions = process_sessions(clicks_proc,limit)
sessions

Preprocessing - Grouping clicks into sessions
Computed counters
Computed dwell and duration
Computed click rate
Computed cyclic date parts
Computed most viewed item/cat per session
Loading buys
Computed most viewed/bought freqs
Sessions shape 25292 32
Sessions columns Index(['total_clicks', 'total_items', 'total_cats', 'max_dwell', 'mean_dwell',
       'start_ts', 'end_ts', 'total_duration', 'click_rate',
       'start_tsweekday_cos', 'start_tsweekday_sin', 'start_tsday_month_cos',
       'start_tsday_month_sin', 'start_tsmonth_year_cos',
       'start_tsmonth_year_sin', 'start_tsday_year_cos',
       'start_tsday_year_sin', 'end_tsweekday_cos', 'end_tsweekday_sin',
       'end_tsday_month_cos', 'end_tsday_month_sin', 'end_tsmonth_year_cos',
       'end_tsmonth_year_sin', 'end_tsday_year_cos', 'end_tsday_year_sin',
       'cat_most_viewed_n_times', 'cat_most_viewed',
       'item_most_viewed_n_times', 'item_most_viewed', 'cat_views_freqs',
       'item_views_freqs', 'item_buys_freqs'],

Unnamed: 0_level_0,total_clicks,total_items,total_cats,max_dwell,mean_dwell,start_ts,end_ts,total_duration,click_rate,start_tsweekday_cos,...,end_tsmonth_year_sin,end_tsday_year_cos,end_tsday_year_sin,cat_most_viewed_n_times,cat_most_viewed,item_most_viewed_n_times,item_most_viewed,cat_views_freqs,item_views_freqs,item_buys_freqs
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5182,6,6,1,1.783333,0.833333,2014-04-01 03:02:21.719000+00:00,2014-04-01 03:06:34.307000+00:00,4.200000,0.023810,0.62349,...,1.0,0.021516,0.999769,6,0,1.0,214706432,1.0,0.00192,0.00089
23422,8,3,1,4.766667,0.895238,2014-04-01 03:02:33.088000+00:00,2014-04-01 03:08:53.298000+00:00,6.333333,0.021053,0.62349,...,1.0,0.021516,0.999769,8,0,5.0,214530572,1.0,0.00008,0.00004
30213,3,2,1,0.666667,0.608333,2014-04-01 03:03:30.774000+00:00,2014-04-01 03:04:45.164000+00:00,1.233333,0.040541,0.62349,...,1.0,0.021516,0.999769,3,0,2.0,214832557,1.0,0.00499,0.00353
14611,2,2,1,1.466667,1.466667,2014-04-01 03:03:45.041000+00:00,2014-04-01 03:05:14.036000+00:00,1.466667,0.022727,0.62349,...,1.0,0.021516,0.999769,2,0,1.0,214561496,1.0,0.00005,0.00005
17111,3,2,1,1.466667,1.108333,2014-04-01 03:06:47.766000+00:00,2014-04-01 03:09:01.676000+00:00,2.216667,0.022556,0.62349,...,1.0,0.021516,0.999769,3,0,2.0,214561500,1.0,0.00021,0.00003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3686,3,2,1,2.383333,1.791667,2014-04-08 00:52:04.863000+00:00,2014-04-08 00:55:40.934000+00:00,3.600000,0.013889,0.62349,...,1.0,-0.098820,0.995105,3,0,2.0,214585397,1.0,0.00084,0.00086
554,8,8,1,5.716667,2.119048,2014-04-08 01:36:10.320000+00:00,2014-04-08 01:51:03.413000+00:00,14.883333,0.008959,0.62349,...,1.0,-0.098820,0.995105,8,0,1.0,214652882,1.0,0.00008,0.00001
751,5,4,1,2.550000,1.658333,2014-04-08 01:58:10.922000+00:00,2014-04-08 02:04:49.923000+00:00,6.650000,0.012531,0.62349,...,1.0,-0.098820,0.995105,5,0,2.0,214640893,1.0,0.00033,0.00063
29047,1,1,1,,,2014-04-08 02:38:50.755000+00:00,2014-04-08 02:38:50.755000+00:00,0.000000,0.000000,0.62349,...,1.0,-0.098820,0.995105,1,0,1.0,214687867,1.0,0.00006,0.00010


In [13]:
def prep(limit=None):
    print("Prepping data for classification")
    buys = process_buys(limit=limit)
    clicks = read_clicks(limit=limit)
    processed_clicks = process_clicks(clicks)
    sessions = process_sessions(clicks, limit=limit)
    
    print("Merging clicks and buys")
    X = pd.merge(sessions, buys, how="left", left_index=True, right_index=True)
    
    X = X.sort_values(by=["start_ts"])
    y = X["is_buy"]
    y = y.fillna(0)
    
    X["cat_most_viewed"] = X["cat_most_viewed"].astype("float64")
    
    # Delete label
    del X["is_buy"]
    
    # Delete time columns (cant be used as is and we already have the cyclic date parts)
    del X["start_ts"]
    del X["end_ts"]
    
    return X, y

X,  y  = prep(limit)

Prepping data for classification
Loading buys
Processing buys
Buys from 2014-04-01 03:05:31.743000+00:00 to 2014-04-22 01:47:44.603000+00:00
Buys grouped by session 52270 2
Loading clicks
Clicks shape 100000 4
Processing clicks
Processed clicks shape 100000 7
Preprocessing - Grouping clicks into sessions
Computed counters
Computed dwell and duration
Computed click rate
Computed cyclic date parts
Computed most viewed item/cat per session
Loading buys
Computed most viewed/bought freqs
Sessions shape 25292 32
Sessions columns Index(['total_clicks', 'total_items', 'total_cats', 'max_dwell', 'mean_dwell',
       'start_ts', 'end_ts', 'total_duration', 'click_rate',
       'start_tsweekday_cos', 'start_tsweekday_sin', 'start_tsday_month_cos',
       'start_tsday_month_sin', 'start_tsmonth_year_cos',
       'start_tsmonth_year_sin', 'start_tsday_year_cos',
       'start_tsday_year_sin', 'end_tsweekday_cos', 'end_tsweekday_sin',
       'end_tsday_month_cos', 'end_tsday_month_sin', 'end_tsmonth

In [14]:
X

Unnamed: 0_level_0,total_clicks,total_items,total_cats,max_dwell,mean_dwell,total_duration,click_rate,start_tsweekday_cos,start_tsweekday_sin,start_tsday_month_cos,...,end_tsday_year_cos,end_tsday_year_sin,cat_most_viewed_n_times,cat_most_viewed,item_most_viewed_n_times,item_most_viewed,cat_views_freqs,item_views_freqs,item_buys_freqs,items_bought
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5182,6,6,1,1.783333,0.833333,4.200000,0.023810,0.62349,0.781831,1.000000,...,0.021516,0.999769,6,0.0,1.0,214706432,1.0,0.00192,0.00089,
23422,8,3,1,4.766667,0.895238,6.333333,0.021053,0.62349,0.781831,1.000000,...,0.021516,0.999769,8,0.0,5.0,214530572,1.0,0.00008,0.00004,
30213,3,2,1,0.666667,0.608333,1.233333,0.040541,0.62349,0.781831,1.000000,...,0.021516,0.999769,3,0.0,2.0,214832557,1.0,0.00499,0.00353,
14611,2,2,1,1.466667,1.466667,1.466667,0.022727,0.62349,0.781831,1.000000,...,0.021516,0.999769,2,0.0,1.0,214561496,1.0,0.00005,0.00005,
17111,3,2,1,1.466667,1.108333,2.216667,0.022556,0.62349,0.781831,1.000000,...,0.021516,0.999769,3,0.0,2.0,214561500,1.0,0.00021,0.00003,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3686,3,2,1,2.383333,1.791667,3.600000,0.013889,0.62349,0.781831,0.104528,...,-0.098820,0.995105,3,0.0,2.0,214585397,1.0,0.00084,0.00086,
554,8,8,1,5.716667,2.119048,14.883333,0.008959,0.62349,0.781831,0.104528,...,-0.098820,0.995105,8,0.0,1.0,214652882,1.0,0.00008,0.00001,
751,5,4,1,2.550000,1.658333,6.650000,0.012531,0.62349,0.781831,0.104528,...,-0.098820,0.995105,5,0.0,2.0,214640893,1.0,0.00033,0.00063,
29047,1,1,1,,,0.000000,0.000000,0.62349,0.781831,0.104528,...,-0.098820,0.995105,1,0.0,1.0,214687867,1.0,0.00006,0.00010,


In [15]:
y

session
5182     0.0
23422    0.0
30213    0.0
14611    0.0
17111    0.0
        ... 
3686     0.0
554      0.0
751      0.0
29047    0.0
4986     0.0
Name: is_buy, Length: 25292, dtype: float64

In [20]:
from sklearn import linear_model

def classify(X, y):
    
    print("Splitting into train and test")
    X = X.fillna(X.median())
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.25, shuffle=False)

    print("Building datasets for lightgbm")
    
    cats = ["cat_most_viewed", "item_most_viewed"]
    evals_results = {}
    
    print("Starting classification")
    model = linear_model.LogisticRegression()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    y_pred = np.round_(y_pred, 0)
    print('The accuracy of prediction is:', metrics.accuracy_score(y_test, y_pred))
    print('The roc_auc_score of prediction is:', metrics.roc_auc_score(y_test, y_pred))
    print('The null acccuracy is:', max(y_test.mean(), 1 - y_test.mean()))
    
    return model

classify(X,  y)

Splitting into train and test
Building datasets for lightgbm
Starting classification
The accuracy of prediction is: 0.9356318203384469
The roc_auc_score of prediction is: 0.5
The null acccuracy is: 0.9356318203384469


LogisticRegression()

In [21]:
X, y = prep(limit=100)

Prepping data for classification
Loading buys
Processing buys
Buys from 2014-04-01 07:42:09.535000+00:00 to 2014-04-07 17:53:29.459000+00:00
Buys grouped by session 49 2
Loading clicks
Clicks shape 100 4
Processing clicks
Processed clicks shape 100 7
Preprocessing - Grouping clicks into sessions
Computed counters
Computed dwell and duration
Computed click rate
Computed cyclic date parts
Computed most viewed item/cat per session
Loading buys
Computed most viewed/bought freqs
Sessions shape 23 32
Sessions columns Index(['total_clicks', 'total_items', 'total_cats', 'max_dwell', 'mean_dwell',
       'start_ts', 'end_ts', 'total_duration', 'click_rate',
       'start_tsweekday_cos', 'start_tsweekday_sin', 'start_tsday_month_cos',
       'start_tsday_month_sin', 'start_tsmonth_year_cos',
       'start_tsmonth_year_sin', 'start_tsday_year_cos',
       'start_tsday_year_sin', 'end_tsweekday_cos', 'end_tsweekday_sin',
       'end_tsday_month_cos', 'end_tsday_month_sin', 'end_tsmonth_year_cos',


In [22]:
model = classify(X, y)

Splitting into train and test
Building datasets for lightgbm
Starting classification
The accuracy of prediction is: 0.8333333333333334
The roc_auc_score of prediction is: 0.5
The null acccuracy is: 0.8333333333333334


In [23]:
vs = model.feature_importance()
ks = X.columns
d = dict(zip(ks, vs))
sorted(d.items(), key=lambda x: x[1], reverse=True)[:15]

AttributeError: 'LogisticRegression' object has no attribute 'feature_importance'