# Problem Definition

Predicting product rating from Wish.com dataset. I will start by emplementing data exploration techniques to evalute the nature and significant of given features. I will pre-process the data to only contain valuable information used in model prediction. Several models will be hypertuned and compared based on their F-score (mean F-score). 

Steps:

1. Loading data / libraries

2. Data exploration

3. Data pre-processing

4. Normalizing / Scaling / Encoding / Feature generation

5. Final data preparation

6. Modeling and performance evaluation

7. Feature evaluation

8. Findings

9. Predict rating for test_new.csv to be sumbitted on Kaggle

# Loading Data / Libraries

In [None]:
# list of code resources used

# https://www.kaggle.com/nibukdk93/summer-sales-wish-acc-100
# https://www.kaggle.com/niteshhalai/sales-of-summer-clothes-in-e-commerce-wish
# https://www.kaggle.com/mudithsilva/unit-sold-predict-model
# https://www.kaggle.com/tanujdhiman/summer-product-sale-rating

In [None]:
# libriaries used are loaded
from google.colab import drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing


from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict
from statistics import mean, stdev
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import KFold,train_test_split,cross_val_score

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier



In [None]:
# grab dataset from google drive
drive.mount('/content/drive')

# loading the train_new.csv data as a pandas dataframe
df = pd.read_csv("/content/drive/My Drive/CISC873_Assignment0_data/train_new.csv")

# loading the test_new.csv data as a pandas dataframe to be used later for kaggle prediction
df_test_for_kaggle = pd.read_csv("/content/drive/My Drive/CISC873_Assignment0_data/test_new.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# grabbing the IDs to be later used in predicting using x_test dataset
df_test_id = pd.read_csv("/content/drive/My Drive/CISC873_Assignment0_data/test_new.csv")
test_id=df_test_id['id']
print(test_id)

# Data exploration

In [None]:
# peak into data

# print first and last few rows of the dataset
print(df.head())
print(df.tail())

# print all column headers + type
print(df.info())

# checking for missing values for every feature
df.isnull().sum()

#NOTES: These statistics will show how the data is structured and what it cocntains and gives an idea how much 'fun' this is going to be

In [None]:
# synonym features check

# heat correlation map to check to for synonym columsn (identical or almost identical faetures)
plt.figure(figsize = (20, 10))
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', center = 0)
plt.show()

#NOTES: heat map correlation map can be used to identify near identical features (columns) if they are highly correlated

In [None]:
# uni-value features chck

# check to for uni-value columns
def is_unique(s):
    a = s.to_numpy()
    return (a[0] == a).all()
# returns True if all values of a feature are the same
is_unique(df['currency_buyer']) # replace 'currency_buyer' with any feature you want to check

#NOTES: checking for features that have the same or mostly the same values for every row as they are not descriptive enough 

In [None]:
# features/ target features distribution check

# checking the ditribution of features
select_feature = df['rating'] # replace 'rating' with any feature you want to check

feature_distribution_hist = plt.hist(select_feature) # dist using his
plt.show()

feature_distribution_kde = sns.kdeplot(select_feature) # dist using kde
plt.show()

# stats summary of features
print(round(df.describe().T))
print(df.median())


#NOTES: checking the distribution helps understand the feature in terms of its range. It will help see if the values are centered or scattered.
#NOTES: I am checking with 2 graphs (kde and histogram) since some feature are better represented using one
#NOTES: I included stats summary (mean) and median to mathematically check to see if any features are heavily + or - skewed

In [None]:
# checking for significant relationships between features/features and features/target


# to see if there is a relantionship between features and target first I have determine what is considered a good rating and section off those products as is_successful_rating
# coverting to int since it was all .0 anyway
df.rating = df.rating.astype(int)
def is_successful_rating(rating):
    if rating >= 4:
        return 1
    else:
        return 0

# creating the column containing those only with ratings >= 4 and adds to df
df['is_successful_rating'] = df['rating'].apply(is_successful_rating)

# Check the difference between price and retail_price and see if there an association between those compared to successful vs unsucessful products
print('Overall stats:')
print(df['price'].mean())
print(df['retail_price'].mean())
print('----------------------')
print('Stats for successful products:')
print(df[df['is_successful_rating'] == 1]['price'].mean())
print(df[df['is_successful_rating'] == 1]['retail_price'].mean())
print('----------------------')
print('Stats for unsuccessful products:')
print(df[df['is_successful_rating'] == 0]['price'].mean())
print(df[df['is_successful_rating'] == 0]['retail_price'].mean())

# checking to see if difference in price between retail and price is associated with ratings
df['diff_in_price'] = round(df['retail_price'] - df['price'],2)
sns.violinplot(data=df, y='diff_in_price', x='is_successful_rating') # y (for sns plot) was changed to various features to see the results


# Data cleaning and pre-processing


In [None]:
# fixing some of the feature formatting for uniformity 

# fixing 'has_urgency_banner'
# has_urgency_banner needs to be converted binomial so here I am filling 0 for the missing values
df['has_urgency_banner'] = df['has_urgency_banner'].fillna(0)
# has_urgency_banner needs to be converted binomial so here I am converting to int to match other binary columns
df.has_urgency_banner = df.has_urgency_banner.astype(int)

# fixing 'urgency_text'
# has_urgency_banner needs to be converted binomial so here I am filling 0 for the missing values
df['urgency_text'] = df['urgency_text'].fillna(0)
# convert string to int to change uantité limitée ! to NaN
df.urgency_text = pd.to_numeric(df.urgency_text, errors='coerce')
# Nan to 1 to create binary field
df['urgency_text'] = df['urgency_text'].fillna(1)
# has_urgency_banner needs to be converted binomial so here I am converting to int to match other binary columns
df.urgency_text = df.urgency_text.astype(int)

In [None]:
# unifying different variations of the same thing under one category e.g blue and Blue and light blue = blue

# Since this feature consists mainly of US and China we can combine the rest into others
df['origin_country'] = df['origin_country'].replace('VE', 'Other')
df['origin_country'] = df['origin_country'].replace('AT', 'Other')
df['origin_country'] = df['origin_country'].replace('SG', 'Other')
df['origin_country'] = df['origin_country'].replace('GB', 'Other')
df['origin_country'] = df['origin_country'].replace(np.nan, 'Other')
# visual check to see if it correctly categorized everything
sns.countplot('origin_country', data=df)



# Replacing different variations of sizes into one
df['product_variation_size_id'] = df['product_variation_size_id'].replace('S.', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('XS.', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('M.', 'M')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size S', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size-XS', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('SIZE XS', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size-S', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size4XL', 'XL')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('size S', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size M', 'M')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size -XXS', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('SIZE-XXS', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('Size S.', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('s', 'S')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('SizeL', 'L')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('5XL', 'XL')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('4XL', 'XL')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('3XL', 'XL')
df['product_variation_size_id'] = df['product_variation_size_id'].replace('2XL', 'XL')

# list anything != name to OTHER to account for other variations
def pr_var(name):
    if name == 'XXXS' \
    or name == 'XXS' \
    or name == 'XS' \
    or name == 'S' \
    or name == 'M' \
    or name == 'L' \
    or name == 'XL' \
    or name == 'XXL' \
    or name == 'XXXXL' \
    or name == 'XXXXXL':
        return name
    else:
        return "OTHER"

# replace missing values with OTHER 
df['product_variation_size_id'] = df['product_variation_size_id'].replace(np.nan, 'OTHER')
# adding the new categories to df['feature']
df['product_variation_size_id'] = df['product_variation_size_id'].apply(pr_var)


# graph to look at size distribution to see if it correctly categorized everything
fig_dims = (10, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot('product_variation_size_id',
              order = df['product_variation_size_id'].value_counts().index,
              palette="magma",
              data = df,
              ax = ax)
ax.set(xlabel='Size', ylabel='Count')
plt.show()



# Replacing different variations of colors into one and change missing values to OTHER

df['product_color'] = df['product_color'].replace('Black', 'black')
df['product_color'] = df['product_color'].replace('White', 'white')
df['product_color'] = df['product_color'].replace('navyblue', 'blue')
df['product_color'] = df['product_color'].replace('lightblue', 'blue')
df['product_color'] = df['product_color'].replace('skyblue', 'blue')
df['product_color'] = df['product_color'].replace('darkblue', 'blue')
df['product_color'] = df['product_color'].replace('navy', 'blue')
df['product_color'] = df['product_color'].replace('winered', 'red')
df['product_color'] = df['product_color'].replace('rosered', 'red')
df['product_color'] = df['product_color'].replace('rose', 'red')
df['product_color'] = df['product_color'].replace('orange-red', 'red')
df['product_color'] = df['product_color'].replace('lightpink', 'pink')
df['product_color'] = df['product_color'].replace('armygreen', 'green')
df['product_color'] = df['product_color'].replace('khaki', 'green')
df['product_color'] = df['product_color'].replace('lightgreen', 'green')
df['product_color'] = df['product_color'].replace('fluorescentgreen', 'green')
df['product_color'] = df['product_color'].replace('gray', 'grey')
df['product_color'] = df['product_color'].replace('coffee', 'brown')
df['product_color'] = df['product_color'].replace('multicolor', 'other')
df['product_color'] = df['product_color'].replace('floral', 'other')
df['product_color'] = df['product_color'].replace('leopard', 'other')
df['product_color'] = df['product_color'].replace('camouflage', 'other')
df['product_color'] = df['product_color'].replace('white & green', 'dual')
df['product_color'] = df['product_color'].replace('black & green', 'dual')
df['product_color'] = df['product_color'].replace('black & white', 'dual')
df['product_color'] = df['product_color'].replace('camouflage', 'other')
df['product_color'] = df['product_color'].replace(np.nan, 'other')

# graph to look at size distribution to see if it correctly categorized everything
fig_dims = (10, 15)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot('product_color',
              data = df,
              order = df['product_color'].value_counts().iloc[:15].index,
              ax = ax)
ax.set(xlabel='Product Colour', ylabel='Count')
plt.xticks(rotation=45, ha='right')
plt.show()




# checking for distribution of units sold 
df['units_sold'].value_counts()

# we have some units less than 10 so we're going to group them
def below_ten(units_sold):
    if units_sold < 10:
        return 10
    else:
        return units_sold

df['units_sold'] = df['units_sold'].apply(below_ten)
# checking for distribution of units sold to see if it correctly categorized everything
df['units_sold'].value_counts()

In [None]:
# checking for duplicates in df
df.duplicated().sum()

# removing duplicates
df.drop_duplicates(inplace=True)

In [None]:
# drop unneeded columns from dataset
df.drop(['currency_buyer', 'theme','crawl_month', 'shipping_option_name','inventory_total','merchant_name','merchant_title','merchant_info_subtitle','id','merchant_profile_picture','urgency_text','is_successful_rating','countries_shipped_to','product_color','product_variation_size_id','merchant_id','shipping_is_express','badge_local_product'], axis=1, inplace=True)


# NOTE: very dynamic column. features were added and removed to optimize accuracy. For example if "product_color" was removed here then it cannot be normalzied in the next section

# Normalizing / scaling / encoding features

In [None]:
# normalizing feature to prepare data for model prediction (for color)
df = pd.get_dummies(df, 
                    columns = ['product_color'],
                    prefix = 'color_',
                    drop_first = True)
# quick check to see if it worked
df.head()

In [None]:
# normalizing feature to prepare data for model prediction (for size)
df = pd.get_dummies(df, 
                    columns = ['product_variation_size_id'],
                    prefix = 'size_',
                    drop_first = True)
# quick check to see if it worked
df.head()

In [None]:
# normalizing feature to prepare data for model prediction (for origin_country)
df = pd.get_dummies(df, columns = ['origin_country'],
                    prefix = 'country_')
# quick check to see if it worked
df.head()

In [None]:
# feature manipulation
# breaking up the tag column to see the number of tags per row. 

def tag_count(tags):
    tag_str = tags
    # seperate words by commas
    prod_tags = tag_str.split(',')
    return len(prod_tags)
    
# replaced tags with tag_count from above to the df
df['tag_count'] = df['tags'].apply(tag_count)
df.drop(['tags'], axis=1, inplace=True)

# NOTES: this is done bacause tags contains all strings which cannot be processed by models so this is one way to encode it. 
# NOTES: if you don't want to create "tag_count" make sure to remove "tags" in the drop columns cell

In [None]:
# FEATURE ENGINEERING : comparing rating and rating count
# Run only if you want to add this feature to df
def new_feature(rating_count):
  # assigning mean and standard deviation to variables to be used in the conditionals
  mean_count = mean(df['rating_count'])
  stdev_count = stdev(df['rating_count'])

  if int(rating_count) >= mean_count + stdev_count:
    return 2
  elif int(rating_count) >= mean_count - stdev_count:
    return 1
  else:
    return 0


df['new_feature_rating'] = df[['rating_count']].apply(new_feature, axis=1)

# NOTES: this feature helps identify those that have high-rating/low-rating-count and vice versa. 

In [None]:
print(df['new_feature_rating'].to_numpy()[0:1000])

In [None]:
# FEATURE ENGINEERING : comparing rating and rating count
# Run only if you want to add this feature to df
def new_feature(merchant_rating_count):

  mean_count = mean(df['merchant_rating_count'])
  stdev_count = stdev(df['merchant_rating_count'])

  if int(merchant_rating_count) >= mean_count + stdev_count:
    return 2
  elif int(merchant_rating_count) >= mean_count - stdev_count:
    return 1
  else:
    return 0


df['new_feature_merchant_rating'] = df[['merchant_rating_count']].apply(new_feature, axis=1)

# NOTES: this feature helps identify those that have high-rating/low-rating-count and vice versa. Same feature as previous cell but this is for merchants not products

In [None]:
# CAUTION: only run if you want to normalize values
# MinMaxScaler scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one, where min, max = feature_range.
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df)
df_normalized=pd.DataFrame(x_scaled, columns=df.columns)

# NOTES: normalizes numerical values and saved them to a new df called "df_normalized" so we don't lose the not-normlized values


In [None]:
# quick check before modeling

# View info un-normalized dataset
print(df)
print(df.info())

# View info normalized dataset
print(df_normalized)
print(df_normalized.info())

# Final data preparation

Cross validation dataset 

*only run one*

In [None]:
# training x and y without splitting to be used for cross validation if needed (FOR UN-NORMALIZED)
X_no_split = df.drop(['rating'], axis = 1)
y_no_split = df['rating']

In [None]:
# training x and y without splitting to be used for cross validation if needed (FOR NORMALIZED)
X_no_split = df_normalized.drop(['rating'], axis = 1)
y_no_split = df['rating']

Un-normalized and normalized dataset 

*only run one*

In [None]:
# X/y train/test creation UN-NORMALIZED (only run this or NORMALIZED)

X_not_nor = df.drop(['rating'], axis = 1)
y_not_nor = df['rating']


X_train, X_test, y_train, y_test = train_test_split(X_not_nor, y_not_nor, 
                                                    test_size = 0.2)
# NOTE : uses df
# NOTE : seed was set when optimizing models but I removed it after

In [None]:
# X/y train/test creation for NORMALIZED (only run this or UN-NORMALIZED)


X_nor = df_normalized.drop(['rating'], axis = 1)
# we don't want a normalized target attribute since this is a classification problem
y_nor = df['rating']


X_train, X_test, y_train, y_test = train_test_split(X_nor, y_nor, 
                                                    test_size = 0.2)
# NOTE : uses df_normalized
# NOTE : seed was set when optimizing models but I removed it after

Resampled (over-sampled) dataset

*do not run if using cross validated dataset*

*you can run on un-normalized and normalized datasets*

In [None]:
# over sampling data (only run if you want to over sample the data) !!!OPTIONAL!!!
## CAUTION: have to do some variable editing if you want to use oversampled data
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X_train, y_train)


# NOTE : used for imbalanced dataset where you don't want to lose data to equally represent every target level (under sampling)
# NOTE : use 'X_res' and  'y_res' to train model if you want to use oversampled data

In [None]:
# quick check too see if it did the oversampling 
print(np.count_nonzero(y_res == 1))
print(np.count_nonzero(y_res == 2))
print(np.count_nonzero(y_res == 3))
print(np.count_nonzero(y_res == 4))
print(np.count_nonzero(y_res == 5))
print(np.count_nonzero(y_res == 6))

In [None]:
# quick check of X and y before modeling to make sure we are using the intended dataset
print(X_train)
print(y_train)

# Modeling and evaluation

*some parameter tuning was done for the below models (which I deleted in the final jupyter notebook) but i explained them in my documentation file.* 



Gradient boost

In [None]:
# XG Boost not CROSS VALIDATED
classifier_XGBC = GradientBoostingClassifier()
classifier_XGBC.fit(X_train, y_train)
y_pred_XGBC = classifier_XGBC.predict(X_test)

f1_scores_XGBC=f1_score(y_test, y_pred_XGBC, average=None)
print(f1_scores_XGBC)
mean(f1_scores_XGBC)

In [None]:
# XG Boost CROSS VALIDATED
classifier_XGBC = GradientBoostingClassifier()
kfold = KFold(n_splits=10)
y_pred_XGBC = cross_val_predict(classifier_XGBC, X_no_split, y_no_split, cv=kfold) #Cross validation on training set

f1_scores_XGBC=f1_score(y_no_split, y_pred_XGBC, average=None)
mean(f1_scores_XGBC)

In [None]:
# XG Boost oversampled
classifier_XGBC = GradientBoostingClassifier()
classifier_XGBC.fit(X_train, y_train)
y_pred_XGBC = classifier_XGBC.predict(X_test)

f1_scores_XGBC=f1_score(y_test, y_pred_XGBC, average=None)
print(f1_scores_XGBC)
mean(f1_scores_XGBC)

Decision tree

In [None]:
# Decision tree NOT cross validated
classifier_DTC = DecisionTreeClassifier()
classifier_DTC.fit(X_train, y_train)
y_pred_DTC = classifier_DTC.predict(X_test)

f1_scores_DTC=f1_score(y_test, y_pred_DTC, average=None)
print(f1_scores_DTC)
mean(f1_scores_DTC)

In [None]:
# Decision tree CROSS VALIDATED

classifier_DTC = DecisionTreeClassifier()
kfold = KFold(n_splits=10)
y_pred_DTC = cross_val_predict(classifier_DTC, X_no_split, y_no_split, cv=kfold) #Cross validation on training set

f1_scores_DTC=f1_score(y_no_split, y_pred_DTC, average=None)
mean(f1_scores_DTC)

In [None]:
# Decision tree OVER SAMPLED
classifier_DTC = DecisionTreeClassifier()
classifier_DTC.fit(X_res, y_res)
y_pred_DTC = classifier_DTC.predict(X_test)

f1_scores_DTC=f1_score(y_test, y_pred_DTC, average=None)
print(f1_scores_DTC)
mean(f1_scores_DTC)

Random forest

In [None]:
# Random forest NOT cross validated
classifier_RFC = RandomForestClassifier()
classifier_RFC.fit(X_train, y_train)
y_pred_RFC = classifier_RFC.predict(X_test)

f1_scores_RFC=f1_score(y_test, y_pred_RFC, average=None)
print(f1_scores_RFC)
mean(f1_scores_RFC)

In [None]:
# Random forest CROSS VALIDATED
classifier_RFC = RandomForestClassifier()
kfold = KFold(n_splits=10)
y_pred_RFC = cross_val_predict(classifier_RFC, X_no_split, y_no_split, cv=kfold) #Cross validation on training set

f1_scores_RFC=f1_score(y_no_split, y_pred_RFC, average=None)
mean(f1_scores_RFC)

In [None]:
# Random forest OVER SAMPLED
classifier_RFC = RandomForestClassifier()
classifier_RFC.fit(X_res, y_res)
y_pred_RFC = classifier_RFC.predict(X_test)

f1_scores_RFC=f1_score(y_test, y_pred_RFC, average=None)
print(f1_scores_RFC)
mean(f1_scores_RFC)

Adaboost

In [None]:
# Adaboost NOT cross validated
classifier_ABC_RF = AdaBoostClassifier(RandomForestClassifier(), n_estimators= 30, learning_rate = 0.01) 
classifier_ABC_RF.fit(X_train, y_train)
y_pred_ABC_RF = classifier_ABC_RF.predict(X_test)

f1_scores_ABC=f1_score(y_test, y_pred_ABC_RF, average=None)
mean(f1_scores_ABC)

In [None]:
# Adaboost CROSS VALIDATED
classifier_ABC_RF = AdaBoostClassifier(RandomForestClassifier(), learning_rate = 0.01) 
kfold = KFold(n_splits=10)
y_pred_ABC_RF = cross_val_predict(classifier_ABC_RF, X_no_split, y_no_split, cv=kfold) #Cross validation on training set

f1_scores_ABC=f1_score(y_no_split, y_pred_ABC_RF, average=None)
mean(f1_scores_ABC)

In [None]:
# Adaboost OVER SAMPLED
classifier_ABC_RF = AdaBoostClassifier(RandomForestClassifier(), n_estimators= 30, learning_rate = 0.01) 
classifier_ABC_RF.fit(X_res, y_res)
y_pred_ABC_RF = classifier_ABC_RF.predict(X_test)

f1_scores_ABC=f1_score(y_test, y_pred_ABC_RF, average=None)
mean(f1_scores_ABC)

Neural net

In [None]:
# MLP NOT cross validated
classifier_neural_MLP = MLPClassifier(random_state=42)
classifier_neural_MLP.fit(X_train, y_train)
y_pred_MLP = classifier_neural_MLP.predict(X_test)

f1_scores_MLP=f1_score(y_test, y_pred_MLP, average=None)
mean(f1_scores_MLP)

In [None]:
# MLP CROSS VALIDATED
classifier_neural_MLP = MLPRegressor(random_state=42)
kfold = KFold(n_splits=10)
y_pred_MLP = cross_val_predict(classifier_neural_MLP, X_no_split, y_no_split, cv=kfold) #Cross validation on training set

f1_scores_MLP=f1_score(y_no_split, y_pred_MLP, average=None)
mean(f1_scores_MLP)

In [None]:
# MLP OVER SAMPLED
classifier_neural_MLP = MLPClassifier(random_state=42)
classifier_neural_MLP.fit(X_res, y_res)
y_pred_MLP = classifier_neural_MLP.predict(X_test)

f1_scores_MLP=f1_score(y_test, y_pred_MLP, average=None)
mean(f1_scores_MLP)

Ensemble learning

In [None]:
# ensemble learning NOT cross validated

voting_cl = VotingClassifier(estimators = [('Random Forest',classifier_RFC), ('Ada Boost',classifier_ABC_RF)], voting = 'hard')
voting_cl.fit(X_train, y_train)
y_pred_vcl = voting_cl.predict(X_test)


f1_scores_Ensemble=f1_score(y_test, y_pred_vcl, average=None)
mean(f1_scores_Ensemble)

# NOTE: the different models used for ensemble learning changes dependeing on which models you want to try here. This is further discussed in the pdf

In [None]:
# ensemble learning CROSS VALIDATED

voting_cl = VotingClassifier(estimators = [('Decision Tree', classifier_DTC),
                                              ('Random Forest',classifier_RFC),
                                              ('Ada Boost',classifier_ABC_RF)], 
                                voting = 'hard')
kfold = KFold(n_splits=10)
y_pred_vcl = cross_val_predict(voting_cl, X_no_split, y_no_split, cv=kfold) # Cross validation on training set


f1_scores_Ensemble=f1_score(y_no_split, y_pred_vcl, average=None)
mean(f1_scores_Ensemble)

# NOTE: the different models used for ensemble learning changes dependeing on which models you want to try here. This is further discussed in the pdf

In [None]:
# Over sampled 
voting_cl = VotingClassifier(estimators = [('Random Forest',classifier_RFC), ('Ada Boost',classifier_ABC_RF)], voting = 'hard')
voting_cl.fit(X_res, y_res)
y_pred_vcl = voting_cl.predict(X_test)


f1_scores_Ensemble=f1_score(y_test, y_pred_vcl, average=None)
mean(f1_scores_Ensemble)

# NOTE: the different models used for ensemble learning changes dependeing on which models you want to try here. This is further discussed in the pdf

# Feature evaluation

In [None]:
# determining which feature contribute most significantly to prediction
importances = classifier_RFC.feature_importances_
std = np.std([tree.feature_importances_ for tree in classifier_RFC.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_nor.shape[1]):
    print("%d. feature %d (%s) (%f)" % (f + 1, indices[f], X_nor.columns[indices[f]], importances[indices[f]]))

# Predict rating for test_new.csv to be sumbitted on Kaggle

preparing X_test the same way as X_train

different cells were ran to try different model variances

In [None]:
# loading the test_new.csv data as a pandas dataframe to be used later for kaggle prediction
df_test_for_kaggle = pd.read_csv("/content/drive/My Drive/test_new.csv")

In [None]:
# peak into data

# print first and last few rows of the dataset
print(df_test_for_kaggle.head())
print(df_test_for_kaggle.tail())

# print all column headers + type
print(df_test_for_kaggle.info())

# checking for missing values for every feature
df_test_for_kaggle.isnull().sum()

In [None]:
# fixing some of the feature formatting for uniformity 

# fixing 'has_urgency_banner'
# has_urgency_banner needs to be converted binomial so here I am filling 0 for the missing values
df_test_for_kaggle['has_urgency_banner'] = df_test_for_kaggle['has_urgency_banner'].fillna(0)
# has_urgency_banner needs to be converted binomial so here I am converting to int to match other binary columns
df_test_for_kaggle.has_urgency_banner = df_test_for_kaggle.has_urgency_banner.astype(int)

# fixing 'urgency_text'
# has_urgency_banner needs to be converted binomial so here I am filling 0 for the missing values
df_test_for_kaggle['urgency_text'] = df_test_for_kaggle['urgency_text'].fillna(0)
# convert string to int to change uantité limitée ! to NaN
df_test_for_kaggle.urgency_text = pd.to_numeric(df_test_for_kaggle.urgency_text, errors='coerce')
# Nan to 1 to create binary field
df_test_for_kaggle['urgency_text'] = df_test_for_kaggle['urgency_text'].fillna(1)
# has_urgency_banner needs to be converted binomial so here I am converting to int to match other binary columns
df_test_for_kaggle.urgency_text = df_test_for_kaggle.urgency_text.astype(int)

# Since this feature consists mainly of US and China we can combine the rest into others
df_test_for_kaggle['origin_country'] = df_test_for_kaggle['origin_country'].replace('VE', 'Other')
df_test_for_kaggle['origin_country'] = df_test_for_kaggle['origin_country'].replace('AT', 'Other')
df_test_for_kaggle['origin_country'] = df_test_for_kaggle['origin_country'].replace('SG', 'Other')
df_test_for_kaggle['origin_country'] = df_test_for_kaggle['origin_country'].replace('GB', 'Other')
df_test_for_kaggle['origin_country'] = df_test_for_kaggle['origin_country'].replace(np.nan, 'Other')
# visual check to see if it correctly categorized everything
sns.countplot('origin_country', data=df_test_for_kaggle)

def below_ten(units_sold):
    if units_sold < 10:
        return 10
    else:
        return units_sold

df_test_for_kaggle['units_sold'] = df_test_for_kaggle['units_sold'].apply(below_ten)
# checking for distribution of units sold to see if it correctly categorized everything
df_test_for_kaggle['units_sold'].value_counts()

In [None]:
# normalizing feature to prepare data for model prediction (for origin_country)
df_test_for_kaggle = pd.get_dummies(df_test_for_kaggle, columns = ['origin_country'],
                    prefix = 'country_')
# quick check to see if it worked
df_test_for_kaggle.head()

In [None]:
# breaking up the tag column to see the number of tags per row. 
def tag_count(tags):
    tag_str = tags
    prod_tags = tag_str.split(',')
    return len(prod_tags)
    
# replaced tags with tag_count from above to the df
df_test_for_kaggle['tag_count'] = df_test_for_kaggle['tags'].apply(tag_count)
df_test_for_kaggle.drop(['tags'], axis=1, inplace=True)

In [None]:
# checking to see if difference in price between retail and price is associated with ratings
df_test_for_kaggle['diff_in_price'] = round(df_test_for_kaggle['retail_price'] - df_test_for_kaggle['price'],2)


In [None]:
# drop unneeded columns from dataset
df_test_for_kaggle.drop(['currency_buyer', 'theme','crawl_month', 'shipping_option_name','inventory_total','merchant_name','merchant_title','merchant_info_subtitle','id','merchant_profile_picture','urgency_text','countries_shipped_to','product_color','product_variation_size_id','merchant_id','shipping_is_express','badge_local_product','badge_local_product'], axis=1, inplace=True)



In [None]:
# CAUTION: only run if you want to normalize values
min_max_scaler_kaggle = preprocessing.MinMaxScaler()
x_scaled_kaggle = min_max_scaler_kaggle.fit_transform(df_test_for_kaggle)
df_normalized_kaggle=pd.DataFrame(x_scaled_kaggle, columns=df_test_for_kaggle.columns)

In [None]:
# quick look
print(df_normalized.info())
print(df_normalized_kaggle.info())

In [None]:
# name the set
X_test_kaggle = df_normalized_kaggle


In [None]:
# run this to create file with the id attached
def gen_kaggle_submission(id,y_pred,filename="test_out.csv"):
  with open("/content/drive/My Drive/"+filename,"w") as file_out:
    file_out.write("id,rating\n")
    for i in zip(id,y_pred):
      str_out = f"{i[0]},{i[1]:.1f}\n"
      file_out.write(str_out)

In [None]:
# Random forest model output
# run this cell for every difference variation of this model 
y_pred_kaggle=classifier_RFC.predict(X_test_kaggle)
gen_kaggle_submission(test_id, y_pred_kaggle,"rf.1_out.csv")

# NOTE : remember to change file name 'rf.1_out' for each output prediction

In [None]:
# ensemble learning model output
# run this cell for every difference variation of this model 
y_pred_kaggle=voting_cl.predict(X_test_kaggle)
gen_kaggle_submission(test_id, y_pred_kaggle,"vc.1.1_out.csv")

# NOTE : remember to change file name 'vc.1_out' for each output prediction
# NOTE : my kaggle submissions are all called vc because i at that point i was too lazy to change to file name for every variance of my models, but i obviously did the model just not the file name, my apologies :(

In [None]:
# Adaboost model output
# run this cell for every difference variation of this model 
y_pred_kaggle=classifier_ABC_RF.predict(X_test_kaggle)
gen_kaggle_submission(test_id, y_pred_kaggle,"ab.1_out.csv")

# NOTE : remember to change file name 'ab.1_out' for each output prediction

In [None]:
# comparing the differences between the files. just to make sure 2 files aren't the exact same ya know
!diff "/content/drive/My Drive/vc.3_out.csv" "/content/drive/My Drive/vc.3_out-new.csv"