# Instacart Market Basket Analysis: Data Wrangling, Exploratory Data Analysis,Data Visualization, Algorithms and Machine Learning


The overall goal of this iPython notebook serves as the code for the final project - Instacart Market Basket Analysis. Data Cleaning has been done to get the data in a format needed for further analysis. Data manipulation, and data exploration has been implemented to uncover interesting insights from the data. Data tranformation, and merging different datasets for analysis have been carried out as well. Inferential statistics have also been performed.Data visualizations have been used to tell a story with the data and get a better understanding of Instacart users.Finally, the notebook consists of the models and algorithms that were implemented. It also contains feature engineering that were performed for the project.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

In [None]:
aisles= pd.read_csv('./aisles.csv')
departments=pd.read_csv('./departments.csv')
order_products_prior=pd.read_csv('./order_products__prior.csv')
order_products_train=pd.read_csv('./order_products__train.csv')
orders=pd.read_csv('./orders.csv')
products=pd.read_csv('./products.csv')

In [None]:
aisles.head()

In [None]:
departments.head()

In [None]:
order_products_prior.head()

In [None]:
order_products_train.head()

In [None]:
orders.head()

In [None]:
products.head()

Now we'll clean the data and look at the missing values in each data set. 


In [None]:
len(orders)

# Data Cleaning

In [None]:
#checking for missing values
total=orders.isnull().sum()
total

In [None]:
#checking for the percentage
percentage=total/orders.isnull().count()
percentage

In [None]:
missing_value_table_orders = pd.concat([total,percentage],keys=['Total','Percentage'],axis=1)
missing_value_table_orders

We can see that only 6% of days_since_prior_order column is null. So we can exclude them and use the data.

In [None]:
orders_new=orders[orders['days_since_prior_order'].notnull()]
orders_new.head()

Similarly, we check for missing values for all the other 5 data sets to clean the data.

In [None]:
#aisles
total_a=aisles.isnull().count()
total_a

In [None]:
percentage_a=total_a/aisles.isnull().count()
percentage_a

In [None]:
missing_value_table_aisles = pd.concat([total_a, percentage_a],keys=['Total','Percentage'],axis=1)
missing_value_table_aisles

In [None]:
#departments
total_d=departments.isnull().count()
total_d

In [None]:
percentage_d=total_d/departments.isnull().count()
percentage_d

In [None]:
missing_value_table_departments = pd.concat([total_d,percentage_d],keys=['Total','Percentage'],axis=1)
missing_value_table_departments

In [None]:
#orders_prior
total_order_p_p=order_products_prior.isnull().sum()
total_order_p_p

In [None]:
percentage_order_p_p=total_order_p_p/order_products_prior.isnull().count()
percentage_order_p_p

In [None]:
missing_value_table_order_p_p = pd.concat([total_order_p_p,percentage_order_p_p],keys=['Total','Percentage'],axis=1)
missing_value_table_order_p_p

In [None]:
#order_train
total_order_train=order_products_train.isnull().sum()
total_order_train

In [None]:
percentage_order_train=total_order_train/order_products_train.isnull().count()
percentage_order_train

In [None]:
missing_value_table_order_train = pd.concat([total_order_train,percentage_order_train],keys=['Total','Percentage'],axis=1)
missing_value_table_order_train

In [None]:
#products
total_products=products.isnull().sum()
total_products

In [None]:
percentage_products=total_products/products.isnull().count()
percentage_products

In [None]:
missing_value_table_products = pd.concat([total_products,percentage_products],keys=['Total','Percentage'],axis=1)
missing_value_table_products

Looking at the other 5 data sets we see that there are no missing values and hence conclude the data cleaning process.

# Exploratory Data Analysis & Data Visualization

We now try to get the count of the three evaluation set prior,train and test and then plot them to get an idea about the distribution.

In [None]:
count=orders['eval_set'].value_counts()
count

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(count.index, count.values)
plt.ylabel('Number of Occurrences in the dataset', fontsize=14)
plt.xlabel('Evaluation set type', fontsize=14)
plt.title('Eval_set breakdown in orders dataset', fontsize=16)

From the above graph we can see that that test evaluation as 75000 samples on which we will obtain predictions. Now let's see the distribution with respect to the hour of the day

In [None]:
count_hour_of_day=orders['order_hour_of_day'].value_counts()
count_hour_of_day

In [None]:
plt.figure(figsize=(12,8))
ax = sns.countplot(x="order_hour_of_day", data=orders,palette='BuGn_d')

From the graph above we can see that the maximum number of orders is around 10 and 11AM followed by 3-4PM which makes sense since morning is the time before lunch and dinner. On the other hand the orders are least at 3-4AM since that's the time people are asleep. Now, in order to know more. let's see how the orders vary across different days of the week.

In [None]:
count_dow=orders['order_dow'].value_counts()
count_dow

In [None]:
plt.figure(figsize=(12,8))
ax = sns.countplot(x="order_dow", data=orders,palette='GnBu_d')

From this graph we can see that, the number of orders is maximum on Sunday and Monday which makes sense since people want to shop for groceries either at the start of the week or in the weekend. On the other hand, it's least in the middle of the week which is thursday followed by wednesday. Now to know more, let's look at the orders with respect to the hours on a given day of the week.

In [None]:
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==0])

This is the plot of orders on Sunday. We can see that maximum number of orders are placed around 2-3pm. similarly, let's look at the orders by hour distribution for the rest of the days to get an idea about weekend vs weekdays patterns.

In [None]:
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==1])

So on monday, the peak is reached at 10AM. So most of the orders are placed in the morning on Monday from 9-11AM. Followed by 12-3pm.

In [None]:
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==2])

Tuesday also pretty much follows the same trend as Monday. Having most orders in the morning from 10-11AM.

In [None]:
#wednesday
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==3])

In [None]:
#thursday
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==4])

In [None]:
#friday
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==5])

In [None]:
#saturday
plt.figure(figsize=(12,8))
ax=sns.countplot(x='order_hour_of_day',data=orders[orders['order_dow']==6])

The graphs for wednesday,Thursday and Friday pretty much follow the trend for Monday and Tuesday. But in the above graph for Saturday, the peak times is in the afternoon around 2-3pm. So we can see that, during the weekends, peak orders are in the afternoon from 2-4pm whereas in the weekdays, it's in the morning from 10AM-12PM.

Now, let's get the orders in terms of hour of the day and day of the week in a single dataset by using the groupby option for better visualization.

In [None]:
grouped_orders = orders.groupby(["order_dow", "order_hour_of_day"])["order_number"].aggregate("count").reset_index()
grouped_orders

In [None]:
grouped_orders.head()

In [None]:
#pivoting the table for clarity
grouped_orders = grouped_orders.pivot('order_dow', 'order_hour_of_day', 'order_number')
grouped_orders

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(grouped_orders)
plt.title("Frequency of Day of week Vs Hour of day")
plt.show()

From the above heatmap, we can see that peak orders are in the afternoon on Sunday and Monday, from 9AM-4PM.

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="days_since_prior_order", data=orders)
plt.ylabel('Count', fontsize=14)
plt.xlabel('Days since prior order', fontsize=14)
plt.title("Frequency distribution by days since prior order", fontsize=16)
plt.show()

From this plot we can see that 7th day is where we have a spike, and then a relative small peak at days 14,21 and 28 which indicates that every 7 days or weekly is the order frequency. And then again there's a huge peak at the end of the month indicating that there's a monthly peak.

In [None]:
# percentage of re-orders in orders_products_prior
order_products_prior.reordered.sum() / len(order_products_prior)

Approximately 59% of the products are re-ordered from the prior dataset

In [None]:
# percentage of re-orders in orders_products_train
order_products_train.reordered.sum() / len(order_products_train)

Approximately 60% of the products are re-ordered from the train dataset

In [None]:
#merging order_products_prior and products
op_prior_merged = pd.merge(order_products_prior, products, on='product_id', how='left')


In [None]:
#merging op_merged with aisles
op_prior_merged = pd.merge(op_prior_merged, aisles, on='aisle_id', how='left')

In [None]:
#merging the new op_prior_merged with departments
op_prior_merged= pd.merge(op_prior_merged, departments, on='department_id', how='left')

In [None]:
#let's see the new op_merged
op_prior_merged.head()

In [None]:
#del op_prior_merged['department_y']
#op_prior_merged.head()

In [None]:
count_products = op_prior_merged['product_name'].value_counts().reset_index().head(20)
count_products.columns=['product_name','frequency']

In [None]:
count_products

In [None]:
plt.figure(figsize=(30,15))
sns.barplot(count_products.product_name, count_products.frequency, alpha=0.8)
plt.ylabel('Frequencies', fontsize=14)
plt.xlabel('Products', fontsize=10)
plt.xticks(rotation='vertical')
plt.show()

From this graph, the product that is most ordered are fruits like bananas, strawberries and organic products.

In [None]:
count_aisles = op_prior_merged['aisle'].value_counts().head(20)
count_aisles

In [None]:
plt.figure(figsize=(30,15))
sns.barplot(count_aisles.index, count_aisles.values, alpha=0.8)
plt.ylabel('Frequencies', fontsize=14)
plt.xlabel('Aisle', fontsize=10)
plt.xticks(rotation='vertical')
plt.show()

From this graph we can see that the fresh food and fresh vegetables aisles are the most frequently visited. We can do the same analysis for department and also check the reordered items against day of the week and the hour of the day.

In [None]:
count_dept = op_prior_merged['department'].value_counts()
count_dept

In [None]:
plt.figure(figsize=(30,15))
sns.barplot(count_dept.index, count_dept.values, alpha=0.8)
plt.ylabel('Frequencies', fontsize=14)
plt.xlabel('Departments', fontsize=10)
plt.xticks(rotation='vertical')
plt.show()

From the graph we can see that the department wise frequency is more for produce which aligns with the aisles frequency and then for dairy eggs.

In [None]:
#merge order_product_prior with orders 
merged_reorders = pd.merge(order_products_prior, orders, on='order_id', how='left')
merged_reorders.head()

In [None]:
count_reordered = merged_reorders['reordered'].value_counts()
count_reordered

In [None]:
plt.figure(figsize=(6,12))
sns.barplot(count_reordered.index, count_reordered.values)
plt.ylabel('Frequencies', fontsize=14)
plt.xlabel('Reordered', fontsize=4)
plt.show()

In [None]:
#finding reorders against day of the week
grouped_reorders_dow = merged_reorders.groupby(["order_dow"])["reordered"].aggregate("count").reset_index()
grouped_reorders_dow

In [None]:
plt.figure(figsize=(6,12))
sns.barplot(grouped_reorders_dow.order_dow, grouped_reorders_dow.reordered)
plt.ylabel('Total number of reordered products', fontsize=14)
plt.xlabel('order_day_of_week', fontsize=14)
plt.show()

From this graph, we can see that most products are reordered on Sunday followed by Monday and Saturday. Which follows the same trend as orders placed over the week.

In [None]:
#finding reorders against hour of the day
grouped_reorders = merged_reorders.groupby(["order_hour_of_day"])["reordered"].aggregate("count").reset_index()
grouped_reorders

In [None]:
plt.figure(figsize=(12,12))
sns.barplot(grouped_reorders.order_hour_of_day, grouped_reorders.reordered)
plt.ylabel('Total number of reordered products', fontsize=14)
plt.xlabel('order_hour_of_day', fontsize=14)
plt.show()

This graph shows that most products are reordered from 10-11AM followed by 1-3pm. This aligns with the number of products ordered during the week and the weekends.

In [None]:
from scipy import stats

In [None]:
from scipy.stats import norm

In [None]:
test_stats_hour_of_day, p_value_hour_of_day = stats.shapiro(orders.order_hour_of_day)
print("Test statistic for hour of the day is ",test_stats_hour_of_day)
print("P-value for hour of the day is",p_value_hour_of_day)

In [None]:
test_stats_dow, p_value_dow =stats.shapiro(orders.order_dow)
print("Test statistic for day of the week ",test_stats_dow)
print("P-value for hour of the day is",p_value_dow)

In [None]:
test_stats_days_since_prior, p_value_days_since_prior =stats.shapiro(orders.days_since_prior_order)
print("Test statistic for days_since_prior ",test_stats_days_since_prior)
print("P-value for days_since_prior",p_value_days_since_prior)

In [None]:
stats.normaltest(orders.order_dow, axis=0)

In [None]:
stats.normaltest(orders_new.days_since_prior_order,axis = 0)

In [None]:
stats.normaltest(orders.order_hour_of_day,axis = 0)

In [None]:
stats.normaltest(orders.order_number,axis = 0)

In [None]:
stats.normaltest(order_products_prior.reordered,axis = 0)

Let's see a correlation plot to check for correlation between the different columns.

In [None]:
plt.figure(figsize=(12,12))
corr_products = products.corr()
sns.heatmap(corr_products, 
            xticklabels=corr_products.columns.values,
            yticklabels=corr_products.columns.values)

Trying the anderson-darling normality tests for the variables.

In [None]:
anderson_results_dow = stats.anderson(orders.order_dow)
print(anderson_results_dow)

In [None]:
anderson_hour_of_day = stats.anderson(orders.order_hour_of_day)
print(anderson_hour_of_day)

In [None]:
anderson_order_number = stats.anderson(orders.order_number)
print(anderson_order_number)

In [None]:
anderson_days_since_prior_order = stats.anderson(orders_new.days_since_prior_order)
print(anderson_days_since_prior_order)

In [None]:
anderson_reordered = stats.anderson(order_products_prior.reordered)
print(anderson_reordered)

In [None]:
plt.figure(figsize=(12,12))
corr_orders = orders.corr()
sns.heatmap(corr_orders, 
            xticklabels=corr_orders.columns.values,
            yticklabels=corr_orders.columns.values,annot=True)

In [None]:
plt.figure(figsize=(12,12))
corr_op = op_prior_merged.corr()
sns.heatmap(corr_op, 
            xticklabels=corr_op.columns.values,
            yticklabels=corr_op.columns.values, annot=True)

In [None]:
plt.figure(figsize=(12,12))
corr_reorders = merged_reorders.corr()
sns.heatmap(corr_reorders, 
            xticklabels=corr_reorders.columns.values,
            yticklabels=corr_reorders.columns.values, annot=True)

In [None]:
ks_result_dow = stats.kstest(orders.order_dow, cdf='norm')
ks_result_dow

In [None]:
ks_order_hour_of_day = stats.kstest(orders.order_hour_of_day, cdf='norm')
ks_order_hour_of_day

In [None]:
ks_order_number= stats.kstest(orders.order_number, cdf='norm')
ks_order_number

In [None]:
ks_days_since_prior_order = stats.kstest(orders_new.days_since_prior_order, cdf='norm')
ks_days_since_prior_order

In [None]:
ks_reordered = stats.kstest(order_products_prior.reordered, cdf='norm')
ks_reordered

## Feature Engineering

The fetures that will be used to build our models are as follows:

1) predict whether a product will be reordered or not :
Order_id,Order_number,Average_days_between_orders,Nb_orders(Number of orders),Average_basket,Total items,Aisle,Department,Product,User_id,Order_hour_of_day,Order_dow(day of week),Days_since_prior_order,Days_since_ratio

2) Predict which department a product will belong to :
Order_id,Order_number,Average_days_between_orders,Nb_orders(Number of orders),Average_basket,Orders,Reorders,Reordered rate,Total items,User_id,Order_hour_of_day,Order_dow(day of week),Days_since_prior_order,Days_since_ratio



In [None]:
merged1 = pd.merge(order_products_train, orders, on='order_id', how='left')
merged1.head()

In [None]:
df_merged1 = pd.merge(merged1, products, on='product_id', how='left')
df_merged1.head()

In [None]:
#merging all the datasets to get a final train dataset
df = pd.merge(df_merged1, departments, on='department_id', how='left')
df.head()

In [None]:
df_new = df.copy()
df_new.head()

In [None]:
del df['eval_set']

In [None]:
del df['add_to_cart_order']

In [None]:
df.head()

In [None]:
#Getting average days between orders as a feature by using days_since_prior_order
df['average_days_between_orders'] = orders_new.groupby('user_id')['days_since_prior_order'].mean().astype(np.float32)
df['average_days_between_orders'] = df['average_days_between_orders'].replace(np.nan, 0)

In [None]:
df['average_days_between_orders'] = df['average_days_between_orders'].replace(0, 1)

In [None]:
#number of orders as a feature using the orders_new dataset
df['nb_orders'] = orders_new.groupby('user_id').size().astype(np.int16)
df['nb_orders'] = df['nb_orders'].replace(np.nan, 0)

In [None]:
df.head()

In [None]:
#Getting the total items using the entire dataset 
df['total_items'] = df_merged1.groupby('user_id').size().astype(np.int16)
df['total_items'] = df['total_items'].replace(np.nan, 0)

In [None]:
df.head()

In [None]:
#getting average basket as afeature by using total items and number of orders
df['average_basket'] = (df.total_items /df.nb_orders).astype(np.float32)
df.head()

In [None]:
df['average_basket'] = df['average_basket'].replace(np.nan, 0)

In [None]:
# creating a days_since_ratio using days_since_prior_order and average_days_between_orders
df['days_since_ratio'] = df.days_since_prior_order / df.average_days_between_orders

In [None]:
df.head()

In [None]:
del df['user_id']

In [None]:
del df['product_name']

In [None]:
del df['department']

In [None]:
df.head()

Converting hour, aisle, dept, product, days_since_prior_order, day of week into categories.

In [None]:
hour = {c:i for i,c in enumerate(df['order_hour_of_day'].unique())}

In [None]:
aisle = {c:i for i,c in enumerate(df['aisle_id'].unique())}
dept = {c:i for i,c in enumerate(df['department_id'].unique())}
product = {c:i for i,c in enumerate(df['product_id'].unique())}

In [None]:
df['aisle_new'] = [float(aisle[t]) for t in df.aisle_id]

In [None]:
df['dept_new'] = [float(dept[t]) for t in df.department_id]

In [None]:
df['product_new'] = [float(product[t]) for t in df.product_id]

In [None]:
df['order_hour_of_day_new'] = [float(hour[t]) for t in df.order_hour_of_day]

In [None]:
df['order_hour_of_day_new'].value_counts()

In [None]:
dow = {c:i for i,c in enumerate(df['order_dow'].unique())}

In [None]:
df['order_dow_new'] = [float(dow[t]) for t in df.order_dow]

In [None]:
dspo = {c:i for i,c in enumerate(df['days_since_prior_order'].unique())}

In [None]:
df['days_since_prior_order__new'] = [float(dspo[t]) for t in df.days_since_prior_order]

In [None]:
df['reordered'] = df['reordered'].astype('float')

In [None]:
df.head()

In [None]:
del df['aisle_id']

In [None]:
del df['department_id']

In [None]:
df.head()

In [None]:
del df['order_hour_of_day']
del df['order_dow']
del df['days_since_prior_order']
del df['product_id']

In [None]:
df.head()

In [None]:
#Variable to be predicted
y=df['reordered']

In [None]:
del df['reordered']

In [None]:
#final df which will be used to run our algorithms
df.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
Xtr, Xtest, ytr, ytest = train_test_split(df, y, test_size=0.30, random_state=5)

In [None]:
Xtr.shape

In [None]:
ytr=ytr.ravel()

In [None]:
ytest=ytest.ravel()

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.metrics import log_loss

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
#Logistic Regression model
clf=(LogisticRegression(C=0.02))

In [None]:
#fitting the model
clf.fit(Xtr, ytr)

In [None]:
#predictions
pred=clf.predict(Xtest)

In [None]:
pred

In [None]:
#accuracy score of Logistic Regression Model
print(accuracy_score(clf.predict(Xtest), ytest))

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
#Random Forest model
clfrf = RandomForestClassifier(max_features="log2", max_depth=11, n_estimators=24,min_samples_split=1000, 
                               oob_score=True)

In [None]:
#fitting
clfrf.fit(Xtr, ytr)

In [None]:
#predictions
predrf=clfrf.predict(Xtest)

In [None]:
#accuracy score for the random forest model
accuracy_score(predrf, ytest)

In [None]:
plt.figure(figsize=(12,8))
feature_imp_reordered = pd.Series(clfrf.feature_importances_,index= df.columns)
feature_imp_reordered.sort_values(ascending=False).plot(kind='Bar')

Looking at the feature_importances for predicting whether a product will be reoredered or not, the most important features turn out to be order_number, department, product, days since prior order and aisle.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
#AdaBoost Classifier
clfa = AdaBoostClassifier( n_estimators=24,random_state=True)

In [None]:
#fitting
clfa.fit(Xtr, ytr)

In [None]:
#predictions
preda = clfa.predict(Xtest)

In [None]:
#Accuracy Score for AdaBoost Classifier
accuracy_score(preda, ytest)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
#Gradient Boosting Classifier
clfg= GradientBoostingClassifier(max_features="log2", max_depth=11, n_estimators=24,min_samples_split=1000)

In [None]:
#fitting
clfg.fit(Xtr, ytr)

In [None]:
#predictions
predg = clfg.predict(Xtest)

In [None]:
#accuracy score for Gradient Boosting Classifier
accuracy_score(predg, ytest)

In [None]:
# for predicting the department variable
df_new.head()

In [None]:
# creating new features such as orders,reorders and reorder_rate for predicting the department variable

df_new['orders'] = df_new.groupby(df_new.product_id).size().astype(np.int32) 
df_new['orders'] = df_new['orders'].replace(np.nan,0)
df_new['reorders'] = df_new['reordered'].groupby(df_new.product_id).sum().astype(np.float32)
df_new['reorders'] = df_new['reorders'].replace(np.nan,0)
df_new['reorder_rate'] = (df_new.reorders / df_new.orders).astype(np.float32)
df_new['reorder_rate'] = df_new['reorder_rate'].replace(np.nan,0)

In [None]:
df_new.head()

In [None]:
df_new['average_days_between_orders'] = orders_new.groupby('user_id')['days_since_prior_order'].mean().astype(np.float32)
df_new['average_days_between_orders'] = df_new['average_days_between_orders'].replace(np.nan, 0)

In [None]:
df_new['average_days_between_orders'] = df_new['average_days_between_orders'].replace(0, 1)

In [None]:
df_new['nb_orders'] = orders_new.groupby('user_id').size().astype(np.int16)
df_new['nb_orders'] = df_new['nb_orders'].replace(np.nan, 0)

In [None]:
df_new['total_items'] = df_merged1.groupby('user_id').size().astype(np.int16)
df_new['total_items'] = df_new['total_items'].replace(np.nan, 0)

In [None]:
df_new['average_basket'] = (df_new.total_items /df_new.nb_orders).astype(np.float32)

In [None]:
df_new.head()

In [None]:
df_new['average_basket'] = df_new['average_basket'].replace(np.nan, 0)

In [None]:
df_new['days_since_ratio'] = df_new.days_since_prior_order / df_new.average_days_between_orders

In [None]:
df_new['order_hour_of_day_new'] = [float(hour[t]) for t in df_new.order_hour_of_day]

In [None]:
df_new['reordered'] = df_new['reordered'].astype('float')

In [None]:
df_new['order_dow_new'] = [float(dow[t]) for t in df_new.order_dow]

In [None]:
df_new['days_since_prior_order__new'] = [float(dspo[t]) for t in df_new.days_since_prior_order]

In [None]:
df_new['dept_new'] = [float(dept[t]) for t in df_new.department_id]

In [None]:
df_new['product_new'] = [float(product[t]) for t in df_new.product_id]

In [None]:
del df_new['days_since_prior_order']
del df_new['order_dow']
del df_new['order_hour_of_day']
del df_new['department_id']
del df_new['aisle_id']
del df_new['product_id']
#del df_new['user_id']
del df_new['add_to_cart_order']
del df_new['eval_set']
del df_new['department']
del df_new['product_name']
del df_new['product_new']

In [None]:
#final df which will be used to run our model to predict the category of department
df_new.head()

In [None]:
# our variable to be predicted
ynew = df_new['dept_new']

In [None]:
del df_new['dept_new']


In [None]:
Xtrnew, Xtestnew, ytrnew, ytestnew = train_test_split(df_new, ynew, test_size=0.30, random_state=5)

In [None]:
#Random Forest classifier
clfrfnew = RandomForestClassifier(max_features="log2", max_depth=11, n_estimators=24,min_samples_split=1000, 
                               oob_score=True)

In [None]:
#fitting
clfrfnew.fit(Xtrnew, ytrnew)

In [None]:
#predictions and probabilities
predrfnewp =clfrfnew.predict_proba(Xtestnew)

In [None]:
plt.figure(figsize=(12,8))
feature_imp_dept = pd.Series(clfrfnew.feature_importances_,index= df_new.columns)
feature_imp_dept.sort_values(ascending=False).plot(kind='Bar')

The most important features while predicting the department are: Reordered, day of week, order number, user_id and order_id.

In [None]:
# Log loss for the Random Forest model
log_loss( ytestnew,predrfnewp)

In [None]:
#Gradient Boosting Classifier
clfgb= GradientBoostingClassifier(max_features="log2", max_depth=11, n_estimators=24,min_samples_split=1000)

In [None]:
#fitting
clfgb.fit(Xtrnew,ytrnew)

In [None]:
#predictions and probabilties
predgbnewp =clfgb.predict_proba(Xtestnew)

In [None]:
#Log loss score for Gradient Boosting Classifier
log_loss( ytestnew,predgbnewp)

In [None]:
#AdaBoost Classifier
clfada = AdaBoostClassifier( n_estimators=24,random_state=True)

In [None]:
#fitting
clfada.fit(Xtrnew, ytrnew)

In [None]:
#predictions and probabilities
predadap = clfada.predict_proba(Xtestnew)

In [None]:
#Log loss score for AdaBoost Classifier
log_loss( ytestnew,predadap)

The best model for predicting whether a product will be reorodered or not is the Gradient Boosting Classifier with 0.67 accuracy. Whereas while predicting the category of the department, Random Forest Classifier is the best model with a log loss score of 2.342.