You have been provided descriptions of products on Amazon and Flipkart, including details like product title, ratings, reviews, and actual prices. In this challenge, **you will predict discounted prices of the listed products based on their ratings and actual prices.**

# Random Forest
We will be using RF here.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Loading Data

In [3]:
train_set = pd.read_csv('Data/train.csv')
train_set.head(3)

Unnamed: 0,id,title,Rating,maincateg,platform,price1,actprice1,Offer %,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
0,16695,Fashionable & Comfortable Bellies For Women (...,3.9,Women,Flipkart,698,999,30.13%,38.0,7.0,17.0,9.0,6.0,3,3,0
1,5120,Combo Pack of 4 Casual Shoes Sneakers For Men ...,3.8,Men,Flipkart,999,1999,50.03%,531.0,69.0,264.0,92.0,73.0,29,73,1
2,18391,Cilia Mode Leo Sneakers For Women (White),4.4,Women,Flipkart,2749,4999,45.01%,17.0,4.0,11.0,3.0,2.0,1,0,1


In [4]:
# Loading X_train & y_train

X_train_orig = train_set.drop(['Offer %', 'price1'], axis=1)
print(X_train_orig.shape)  # same as X_test !
X_train_orig.head(2)

(15730, 14)


Unnamed: 0,id,title,Rating,maincateg,platform,actprice1,norating1,noreviews1,star_5f,star_4f,star_3f,star_2f,star_1f,fulfilled1
0,16695,Fashionable & Comfortable Bellies For Women (...,3.9,Women,Flipkart,999,38.0,7.0,17.0,9.0,6.0,3,3,0
1,5120,Combo Pack of 4 Casual Shoes Sneakers For Men ...,3.8,Men,Flipkart,1999,531.0,69.0,264.0,92.0,73.0,29,73,1


In [5]:
# y_train
y_train_orig = train_set[['Offer %', 'price1']]
y_train_orig.head()

Unnamed: 0,Offer %,price1
0,30.13%,698
1,50.03%,999
2,45.01%,2749
3,15.85%,518
4,40.02%,1379


In [6]:
y_train_price = y_train_orig['price1']
y_train_offer = y_train_orig['Offer %']
y_train_price.shape

(15730,)

### Encode Columns

Columns:

    title - Name of the product
    Rating- average rating given to a product
    maincateg - category that the product is listed under(men/women)
    platform - platform on which it is sold on (Eg. Amazon, Flipkart)
    actprice1 - Actual price of the listed product
    norating1 - number of ratings available for a particular product
    noreviews1 - number of reviews available for a particular product
    star_5f - number of five star ratings given to a particular product
    star_4f - number of four star ratings given to a particular product
    star_3f - number of three star ratings given to a particular product
    star_2f - number of two star ratings given to a particular product
    star_1f - number of one star ratings given to a particular product
    fulfilled1- whether it is Amazon fulfilled or not
    Offer % - Discount percent
    price1 - Discounted Price of the listed product
    
The goal is to predict **discounted prices** of the listed products based on their ratings and actual prices.

In [7]:
X_train_orig.isna().sum()

id              0
title           0
Rating          0
maincateg     526
platform        0
actprice1       0
norating1     678
noreviews1    578
star_5f       588
star_4f       539
star_3f       231
star_2f         0
star_1f         0
fulfilled1      0
dtype: int64

In [8]:
# Filling maincateg NaN using title
def fill_maincateg(df):
    for ind, item in enumerate(df.maincateg):
        # print(item)
        
        # how else to check if item is nan
        if(item!="Men" and item != "Women"):
            # print(df.title[ind])
            #if(df.title[ind].str.contains('Men')):
            if("Men" in df.title[ind]):
                df.loc[ind, "maincateg"] = 'Men'
            else:
                df.loc[ind, "maincateg"] = 'Women'
    print("Done")
    
    return df

In [9]:
train_na_cols = {'norating1': X_train_orig.norating1.mean(), 'noreviews1': X_train_orig.noreviews1.mean()}
train_na_cols

{'norating1': 3057.6607759766143, 'noreviews1': 423.97630675818374}

In [10]:
# for encoding 'train_set'

def encode_train_cols(X):
    # Filling maincateg using title
    #X.maincateg = X.maincateg.fillna('Men' if X.title.str.con)
    fill_maincateg(X)
    
    # Drop "title" & "id" & ratings
    cols_to_drop = ['id', 'title', 'star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']
    X.drop(cols_to_drop, axis=1, inplace=True)
    
    # Handling Missing values
    # replacing with most common value in train set
    X.fillna(train_na_cols, inplace=True)
    
    # OHE "maincateg" & "platform"
    dummy_features = ['maincateg', 'platform']
    X = pd.get_dummies(X, columns=dummy_features)
    
    return X

In [11]:
test_na_cols = {'Rating': X_train_orig.Rating.mean()}
test_na_cols

{'Rating': 4.012873490146217}

In [12]:
# for encoding 'test set'
def encode_test_cols(X):
    # Filling maincateg using title
    fill_maincateg(X)
    
    # Drop "title" & "id" & 'norating1'
    cols_to_drop = ['id', 'title', 'star_5f', 'star_4f', 'star_3f', 'star_2f', 'star_1f']
    X.drop(cols_to_drop, axis=1, inplace=True)
    
    # Handling Missing values
    # replacing with most common value in train set
    X.fillna(test_na_cols, inplace=True)
    
    # OHE "maincateg" & "platform"
    dummy_features = ['maincateg', 'platform']
    X = pd.get_dummies(X, columns=dummy_features)
    
    return X

In [13]:
X_train_orig = encode_train_cols(X_train_orig)
print(X_train_orig.shape)
X_train_orig.head()

Done
(15730, 9)


Unnamed: 0,Rating,actprice1,norating1,noreviews1,fulfilled1,maincateg_Men,maincateg_Women,platform_Amazon,platform_Flipkart
0,3.9,999,38.0,7.0,0,False,True,False,True
1,3.8,1999,531.0,69.0,1,True,False,False,True
2,4.4,4999,17.0,4.0,1,False,True,False,True
3,4.2,724,46413.0,6229.0,1,True,False,False,True
4,3.9,2299,77.0,3.0,1,True,False,False,True


### Training Model - RF

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor   # shit

In [15]:
X_train, X_valid, y_train_both, y_valid_both = train_test_split(X_train_orig, y_train_orig, test_size=0.15, random_state=0)
X_train.shape

(13370, 9)

In [17]:
X_train.head()

Unnamed: 0,Rating,actprice1,norating1,noreviews1,fulfilled1,maincateg_Men,maincateg_Women,platform_Amazon,platform_Flipkart
13681,4.2,3499,3057.660776,423.976307,1,True,False,False,True
11135,3.8,998,2419.0,313.0,1,True,False,False,True
2206,4.2,1474,14908.0,2466.0,1,False,True,False,True
4446,3.7,1400,108.0,21.0,1,False,True,False,True
14137,4.0,2999,10.0,2.0,0,False,True,False,True


In [None]:
# # y_train
# y_train_offer = y_train['Offer %']
# y_train_price = y_train['price1']
# y_train_offer.head()
# y_train_price.shape

In [18]:
y_train = y_train_both['price1']
y_valid = y_valid_both['price1']

In [19]:
rf = RandomForestRegressor(n_estimators=20)
rf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=20)

In [20]:
print(rf.score(X_train, y_train))
rf.score(X_valid, y_valid)

# very less on valid - overfit?

0.9818623156142355


0.9070186174860693

In [None]:
# from sklearn.metrics import accuracy_score
# pred_val = rf.predict(X_valid)
# print(accuracy_score(y_valid, pred_val))

In [21]:
from sklearn.metrics import mean_squared_error

pred_train = rf.predict(X_train)
print("Train: ", np.sqrt(mean_squared_error(y_train, pred_train)))

pred_val = rf.predict(X_valid)
print("Val: ", np.sqrt(mean_squared_error(y_valid, pred_val)))

Train:  87.74420445382452
Val:  194.26846167677164


#### Generate submission file for rf


In [22]:
X_test = pd.read_csv('Data/test.csv')
test_id = X_test['id']
print(test_id[:3])

0     2242
1    20532
2    10648
Name: id, dtype: int64


In [23]:
X_test = encode_test_cols(X_test)
X_test.head()

Done


Unnamed: 0,Rating,actprice1,norating1,noreviews1,fulfilled1,maincateg_Men,maincateg_Women,platform_Amazon,platform_Flipkart
0,3.8,999,27928,3543,1,True,False,False,True
1,3.9,499,3015,404,1,False,True,False,True
2,3.9,999,449,52,1,False,True,False,True
3,3.9,2999,290,40,1,True,False,False,True
4,3.9,999,2423,326,0,True,False,False,True


In [24]:
pred_test = rf.predict(X_test)
pred_test[:5]

array([426.36727273, 294.60277778, 458.18333333, 895.4       ,
       400.4       ])

In [25]:
subm_file = pd.DataFrame(test_id)
subm_file['price1'] = pred_test
subm_file.head()

Unnamed: 0,id,price1
0,2242,426.367273
1,20532,294.602778
2,10648,458.183333
3,20677,895.4
4,12593,400.4


In [26]:
import os
if not os.path.exists("subm"):
    print("Created subm dir")
    os.makedirs("subm")
    
subm_file.to_csv('subm/5_rf.csv', index=False)

**Score: 199** Damn??

## Hyperparameter-tuning

In [27]:
def score(model, title):
    model.fit(X_train, y_train)
    
    print("RMSE for", title, ": ")
    
    pred_train = model.predict(X_train)
    print("Train: ", np.sqrt(mean_squared_error(y_train, pred_train)))

    pred_val = model.predict(X_valid)
    print("Val: ", np.sqrt(mean_squared_error(y_valid, pred_val)))
    
    # print("Accurancy for ", title)
    # print("\tTrain: ", model.score(X_train, y_train))
    # print("\tTest: ", model.score(X_valid, y_valid))

In [28]:
RF = RandomForestRegressor(n_estimators=1000, max_depth=10, random_state=0)
score(RF, "RandomForest")

RMSE for RandomForest : 
Train:  173.29092056470597
Val:  220.28506755619608


#### Generate submission file for RF 

In [29]:
# pred_val = RF.predict(X_valid)
# print("Val: ", np.sqrt(mean_squared_error(y_valid, pred_val)))

# means RF is fitted globally _/

In [30]:
pred_test2 = RF.predict(X_test)

In [31]:
subm_file = pd.DataFrame(test_id)
subm_file['price1'] = pred_test2
subm_file.head()

Unnamed: 0,id,price1
0,2242,436.641355
1,20532,293.234288
2,10648,443.537603
3,20677,950.391864
4,12593,412.550063


In [32]:
subm_file.to_csv('subm/5_rf_2.csv', index=False)

**Score: 225** hmm, expected.

### RF with more params

In [33]:
SRF = RandomForestRegressor(max_depth=30,max_features=5,min_samples_leaf=1,min_samples_split=2,n_estimators=580,bootstrap=True)
score(SRF, "SRF")

RMSE for SRF : 
Train:  80.84492525665672
Val:  183.66551430813826


In [36]:
def gen_subm_file(model):
    X_test = pd.read_csv('Data/test.csv')
    test_id = X_test['id']
    
    X_test = encode_test_cols(X_test)
    
    pred_test = model.predict(X_test)
    
    subm_file = pd.DataFrame(test_id)
    subm_file['price1'] = pred_test
    
    return subm_file

In [37]:
subm_file = gen_subm_file(SRF)
subm_file.to_csv("subm/5_rf_3.csv", index=False)

Done


In [38]:
# subm_file = pd.read_csv('5_rf_3.csv')
# print(subm_file.isna().sum())
# subm_file.head()

**Score: 191** best

# Feature Scaling

In [39]:
def normalize(X):
    features = X.columns
    X[features] /= X_train[features].max()
    return X

In [40]:
X_train_norm = normalize(X_train_orig.copy())
X_train_norm.head()

Unnamed: 0,Rating,actprice1,norating1,noreviews1,fulfilled1,maincateg_Men,maincateg_Women,platform_Amazon,platform_Flipkart
0,0.78,0.074005,0.000131,0.000154,0.0,0.0,1.0,0.0,1.0
1,0.76,0.148085,0.001831,0.001518,1.0,1.0,0.0,0.0,1.0
2,0.88,0.370324,5.9e-05,8.8e-05,1.0,0.0,1.0,0.0,1.0
3,0.84,0.053634,0.16006,0.137058,1.0,1.0,0.0,0.0,1.0
4,0.78,0.170309,0.000266,6.6e-05,1.0,1.0,0.0,0.0,1.0


In [41]:
X_train_orig.head()  # should not get normalized

Unnamed: 0,Rating,actprice1,norating1,noreviews1,fulfilled1,maincateg_Men,maincateg_Women,platform_Amazon,platform_Flipkart
0,3.9,999,38.0,7.0,0,False,True,False,True
1,3.8,1999,531.0,69.0,1,True,False,False,True
2,4.4,4999,17.0,4.0,1,False,True,False,True
3,4.2,724,46413.0,6229.0,1,True,False,False,True
4,3.9,2299,77.0,3.0,1,True,False,False,True


In [42]:
X_train_norm.shape  # should be (15730, 9)

(15730, 9)

In [43]:
y_train_offer = y_train_orig['Offer %']
y_train_offer.shape   # should be (15730,)

(15730,)

In [44]:
y_train_offer.head()

0    30.13%
1    50.03%
2    45.01%
3    15.85%
4    40.02%
Name: Offer %, dtype: object

In [45]:
y_train_offer = y_train_offer.str.replace(r'%', '')
y_train_offer = y_train_offer.astype(float)
y_train_offer.head()

0    30.13
1    50.03
2    45.01
3    15.85
4    40.02
Name: Offer %, dtype: float64

In [46]:
y_train_offer /= 100
y_train_offer.head()
y_train_offer.describe()

count    15730.000000
mean         0.468025
std          0.192687
min          0.000000
25%          0.359400
50%          0.500700
75%          0.601600
max          0.889300
Name: Offer %, dtype: float64

In [47]:
y_train_offer.shape  # should be (15730,)

(15730,)

### Training - after feature scaling

In [48]:
X_train2, X_valid2, y_train2, y_valid2 = train_test_split(X_train_norm, y_train_offer,test_size=0.15, random_state=0)
X_train2.shape

(13370, 9)

In [49]:
def score2(model, title):
    print("fitting the model..")
    model.fit(X_train2, y_train2)
    
    print("RMSE for", title, ": ")
    
    pred_train = model.predict(X_train2)
    print("Train: ", np.sqrt(mean_squared_error(y_train2, pred_train)))

    pred_val = model.predict(X_valid2)
    print("Val: ", np.sqrt(mean_squared_error(y_valid2, pred_val)))

In [50]:
rf2 = RandomForestRegressor(n_estimators=20)
score2(rf2, "RF2")

fitting the model..
RMSE for RF2 : 
Train:  0.04810694527140603
Val:  0.11005138954937278


In [51]:
# offer is offer%
def predict_price(offer, test_actprice):
    # offer *= 100
    test_actprice -= (test_actprice * offer)
    return test_actprice

In [52]:
def gen_subm_file2(model):
    X_test = pd.read_csv('Data/test.csv')
    test_id = X_test['id']
    test_actprice = X_test['actprice1']
    
    X_test = encode_test_cols(X_test)
    X_test = normalize(X_test)
    
    pred_test_offer = model.predict(X_test)
#     print("offer: ", pred_test_offer[:5])
    pred_test_price = predict_price(pred_test_offer, test_actprice)
#     print("price: ", pred_test_price[:5])
    
    subm_file = pd.DataFrame(test_id)
    subm_file['price1'] = pred_test_price
    
    return subm_file

In [53]:
subm_file = gen_subm_file2(rf2)
subm_file.to_csv("subm/5_rf_norm.csv", index=False)

Done


In [55]:
subm_file = pd.read_csv('subm/5_rf_norm.csv')
subm_file.head()

Unnamed: 0,id,price1
0,2242,424.536705
1,20532,297.599982
2,10648,387.247365
3,20677,867.07088
4,12593,402.1974


**Score: 197**, was expecting a significant inc than norm ;-;

### Hyperparameter search using GridSearchCV (for unscaled data)

In [56]:
from sklearn.model_selection import GridSearchCV

In [57]:
rfc=RandomForestRegressor(random_state=0)
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8,9,10],
    'criterion' :['squared_error']
}
param_grid

{'n_estimators': [200, 500],
 'max_features': ['auto', 'sqrt', 'log2'],
 'max_depth': [4, 5, 6, 7, 8, 9, 10],
 'criterion': ['squared_error']}

In [55]:
# param_grid = {  'bootstrap': [True], 'max_depth': [5, 10, None], 'max_features': ['auto', 'log2'], 'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15]}

In [58]:
# takes time to fit GridSearchCV
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
score2(CV_rfc, "CV RF")
print(CV_rfc.best_params_)

fitting the model..
RMSE for CV RF : 
Train:  0.12391851669903312
Val:  0.13590428983143404
{'criterion': 'squared_error', 'max_depth': 10, 'max_features': 'auto', 'n_estimators': 500}


In [59]:
CV_rfc.best_params_

{'criterion': 'squared_error',
 'max_depth': 10,
 'max_features': 'auto',
 'n_estimators': 500}

In [60]:
# CV_rfc.best_params_ = {'criterion': 'squared_error', 'max_depth': 10, 'max_features': 'auto', 'n_estimators': 500}

In [61]:
subm_file = gen_subm_file2(rf2)
subm_file.to_csv("subm/5_rf_CV.csv", index=False)

Done


In [62]:
subm_file.head()

Unnamed: 0,id,price1
0,2242,424.536705
1,20532,297.599982
2,10648,387.247365
3,20677,867.07088
4,12593,402.1974


**Score: 197**