 ## EDA to Justify Validity of Proposed Feature Transform

### Imports and Reading in Data

In [None]:
import pandas as pd
import numpy as np
% matplotlib inline

In [3]:
train = pd.read_csv("train.csv")

In [4]:
len(train)

250874

In [5]:
train.head()

Unnamed: 0,ex_id,user_id,prod_id,rating,label,date,review
0,0,923,0,3.0,1,2014-12-08,The food at snack is a selection of popular Gr...
1,1,924,0,3.0,1,2013-05-16,This little place in Soho is wonderful. I had ...
2,2,925,0,4.0,1,2013-07-01,ordered lunch for 15 from Snack last Friday. ...
3,3,926,0,4.0,1,2011-07-28,This is a beautiful quaint little restaurant o...
4,4,927,0,4.0,1,2010-11-01,Snack is great place for a casual sit down lu...


In [34]:
train.user_id.nunique()

125679

The first thing we want to do is understand how much user repetition there is in our dataset. 

### Exploring Dataset for User Counts

In [6]:
counts = pd.DataFrame(train.user_id.value_counts()).reset_index()
counts.columns = ["id","count"]
counts.head()

Unnamed: 0,id,count
0,3504,121
1,2980,99
2,3459,84
3,1011,82
4,3324,81


In [7]:
len(counts)

125679

In [8]:
for i in range(1,11):
    subset = counts[counts['count']>i]
    proportion = (len(subset) / len(counts))*100
    print("The amount of users with more than",i,"review(s) make up",proportion,"% of the dataset")
    

The amount of users with more than 1 review(s) make up 30.643942106477613 % of the dataset
The amount of users with more than 2 review(s) make up 15.833989767582493 % of the dataset
The amount of users with more than 3 review(s) make up 9.942790760588483 % of the dataset
The amount of users with more than 4 review(s) make up 6.911258046292539 % of the dataset
The amount of users with more than 5 review(s) make up 5.167132138225162 % of the dataset
The amount of users with more than 6 review(s) make up 4.0961497147494805 % of the dataset
The amount of users with more than 7 review(s) make up 3.320363783925716 % of the dataset
The amount of users with more than 8 review(s) make up 2.742701644666173 % of the dataset
The amount of users with more than 9 review(s) make up 2.3313361818601357 % of the dataset
The amount of users with more than 10 review(s) make up 1.994764439564287 % of the dataset


Now let's see how the label distribution changes as we filter by # of reviews

### Checking Label Distribution Based on Filtering By # of Reviews

In [9]:
train.label.value_counts(normalize=True)



0    0.897084
1    0.102916
Name: label, dtype: float64

In [10]:
for j in [1,5,10,15,20,100]:
    print("-----------------------------------------------------")
    print("LABEL DISTRIBUTION DELETING USERS WITH <=",j,"REVIEWS")
    copy = train.copy()
    filtered = counts[counts['count']<=j]
    print("size of counts df reduced from",len(counts),"to",len(filtered))
    toDelete = filtered['id'].tolist()
    print("We are deleting",len(toDelete),"users")
    print("Original training size:",len(copy))
    copy = copy[~copy['user_id'].isin(toDelete)]
    print("New training size:",len(copy))
    print(copy.label.value_counts(normalize=True))
    print("-----------------------------------------------------")

-----------------------------------------------------
LABEL DISTRIBUTION DELETING USERS WITH <= 1 REVIEWS
size of counts df reduced from 125679 to 87166
We are deleting 87166 users
Original training size: 250874
New training size: 163708
0    0.952256
1    0.047744
Name: label, dtype: float64
-----------------------------------------------------
-----------------------------------------------------
LABEL DISTRIBUTION DELETING USERS WITH <= 5 REVIEWS
size of counts df reduced from 125679 to 119185
We are deleting 119185 users
Original training size: 250874
New training size: 78070
0    0.980236
1    0.019764
Name: label, dtype: float64
-----------------------------------------------------
-----------------------------------------------------
LABEL DISTRIBUTION DELETING USERS WITH <= 10 REVIEWS
size of counts df reduced from 125679 to 123172
We are deleting 123172 users
Original training size: 250874
New training size: 48478
0    0.9816
1    0.0184
Name: label, dtype: float64
-----------

### Summary of Results and Appropriate Data Transformation

We found a "splitting criterion" of approximately 5 reviews that changes the label balance significantly. However, implementing this fact alone into our dataset might cause some problems, so an alternative approach to encode this information is discussed below. 

Thinking about the training set at the granularity of a single instance, it would be data leakage to simply add a column that simply gives the total # of reviews given by a user in the set. Luckily, we can use the <i>date </i> column to instead give us the <b> number of reviews by the user up to the date of the current instance </b>, which will circumvent the data leakage issue and still encode what we believe to be (using evidence from above) a feature with fairly significant predictive power. 

## Implementation of Feature Transform

In [73]:
def engineered_df(df):
    rolling_rev = []
    user_dict = {}
    for index,row in df.iterrows():
        curr_date = row['date']
        curr_user = row['user_id']
        
        if(curr_user not in user_dict):
            dates = df.loc[df.user_id == curr_user,'date'].tolist()
            dates.sort()
            user_dict[curr_user] = dates
        index = user_dict[curr_user].index(curr_date)
        
        rolling_rev.append(index+1)
        
    df['reviewsToDate'] = rolling_rev
    return df
        
        
        

In [79]:
transformed_df = engineered_df(train)

In [80]:
transformed_df.head(10)

Unnamed: 0,ex_id,user_id,prod_id,rating,label,date,review,reviewsToDate
0,0,923,0,3.0,1,2014-12-08,The food at snack is a selection of popular Gr...,25
1,1,924,0,3.0,1,2013-05-16,This little place in Soho is wonderful. I had ...,1
2,2,925,0,4.0,1,2013-07-01,ordered lunch for 15 from Snack last Friday. ...,1
3,3,926,0,4.0,1,2011-07-28,This is a beautiful quaint little restaurant o...,1
4,4,927,0,4.0,1,2010-11-01,Snack is great place for a casual sit down lu...,2
5,5,928,0,4.0,1,2009-09-02,A solid 4 stars for this greek food spot. If ...,1
6,7,930,0,4.0,1,2007-05-20,Love this place! Try the Chicken sandwich or ...,1
7,8,931,0,4.0,1,2005-12-27,My friend and I were intrigued by the nightly ...,14
8,10,933,0,5.0,1,2014-01-21,pretty cool place...good food...good people,1
9,12,935,0,5.0,1,2011-01-31,Fabulous Authentic Greek Food!!! This little s...,1


### Sanity Check 

In the first 10 rows, we see 3 users: 923, 927, and 931 who have a reviewsToDate value > 1. Let's verify these results.

In [82]:
train[train['user_id']==923].sort_values(by=['date'])

Unnamed: 0,ex_id,user_id,prod_id,rating,label,date,review,reviewsToDate
202384,289603,923,759,5.0,1,2013-11-04,"The falafel were superb, stuffed grape leaved ...",1
122365,174979,923,131,5.0,1,2013-11-11,The food is simply excellent. Everything is as...,2
172282,246460,923,906,5.0,1,2013-11-19,I had Nasi Lemak and Nyonya Seafood Fried Rice...,3
166153,237627,923,622,5.0,1,2013-11-19,This place is amazing.We really love good lati...,3
81951,117298,923,675,5.0,1,2014-01-04,I recently ate at Olea again and continue to b...,5
129686,185412,923,505,5.0,1,2014-01-04,I was in the neighborhood with out-of-town gue...,5
4180,5992,923,19,5.0,1,2014-01-14,The restaurant is on the ground floor of a typ...,7
50805,72648,923,649,5.0,1,2014-02-12,This restaurant was quite a pleasant surprise....,8
128545,183766,923,502,5.0,1,2014-02-12,The crab and pork noodles is delicious!!! I re...,8
182849,261658,923,337,5.0,1,2014-02-18,My husband and I went to this restaurant for a...,10


In [83]:
train[train['user_id']==927].sort_values(by=['date'])

Unnamed: 0,ex_id,user_id,prod_id,rating,label,date,review,reviewsToDate
130005,185887,927,540,5.0,1,2010-10-24,By far one of my favorite restaurants in the c...,1
4,4,927,0,4.0,1,2010-11-01,Snack is great place for a casual sit down lu...,2
120874,172833,927,470,4.0,1,2010-11-22,One of the BEST brunches I have had in the cit...,3


In [84]:
train[train['user_id']==931].sort_values(by=['date'])

Unnamed: 0,ex_id,user_id,prod_id,rating,label,date,review,reviewsToDate
19817,28217,931,78,5.0,1,2005-11-03,"If you have a large group in New York, then th...",1
20823,29684,931,79,4.0,1,2005-11-08,Any restaurant that you can go into at 2 am an...,2
204552,292709,931,766,5.0,1,2005-12-06,"Walking in on a cold fall night, it is not har...",3
1173,1676,931,7,3.0,1,2005-12-08,Very good food in a very casual atmosphere for...,4
228393,326747,931,860,4.0,1,2005-12-09,My cousin and I went to grab a drink on Stone ...,5
115455,165136,931,465,5.0,1,2005-12-12,I honestly do not understand how anyone can no...,6
239312,342337,931,901,4.0,1,2005-12-13,In order to get tapas and a small table you wi...,7
193841,277404,931,723,5.0,1,2005-12-13,I've been to Alta five times since they opened...,7
52140,74496,931,209,4.0,1,2005-12-13,Their wings are great and they make them hot! ...,7
107338,153559,931,442,4.0,1,2005-12-13,"Song is fabulous, even better than the origina...",7
