# Report


This file carries out the basic preparation of data from the csv files.
We create a SINGLE dataframe that combines ALL the datafiles in order to avoid the hassle of preparing everything separately.
We identify the fact that we are supposed to predict the reorder variable, that is whether a specific product will be reordered or not. 
After combining the data frame, we drop certain attributes such as evalset since they offer no pertinent information to our relevant prediction. Product name is also dropped because it is redundant with product ID.


After preparing the dataset, we want to select the best attributes to use from our data and for this we use KBest feature mapping.
It uses ANOVA testing and sorts according to the highest K-value.
Our best attributes are = ['order_number', 'add_to_cart_order', 'department_id', 'days_since_prior_order', 'order_hour_of_day', 'order_dow', 'product_id', 'aisle_id']


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import matplotlib.pyplot as plot
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from keras.utils import np_utils


Using TensorFlow backend.


In [2]:
order_prior = pd.read_csv("order_products__prior.csv")

In [3]:
orders = pd.read_csv("orders.csv")
orders = orders.iloc[:1000000]
print(orders)

        order_id  user_id eval_set  order_number  order_dow  \
0        2539329        1    prior             1          2   
1        2398795        1    prior             2          3   
2         473747        1    prior             3          3   
3        2254736        1    prior             4          4   
4         431534        1    prior             5          4   
5        3367565        1    prior             6          2   
6         550135        1    prior             7          1   
7        3108588        1    prior             8          1   
8        2295261        1    prior             9          1   
9        2550362        1    prior            10          4   
10       1187899        1    train            11          4   
11       2168274        2    prior             1          2   
12       1501582        2    prior             2          5   
13       1901567        2    prior             3          1   
14        738281        2    prior             4       

In [4]:
products = pd.read_csv("products.csv")

In [5]:
order_prior = order_prior.merge(orders)

In [6]:
order_prior.keys()

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered', 'user_id',
       'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order'],
      dtype='object')

In [7]:
order_prior = order_prior.merge(products)

In [8]:
order_prior.keys()

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered', 'user_id',
       'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order', 'product_name', 'aisle_id', 'department_id'],
      dtype='object')

In [22]:
orders = order_prior.copy()
reord = order_prior["reordered"]
orders = orders.fillna(0)
orders = orders.drop(["reordered","eval_set","product_name"],axis=1)
print(orders)

         order_id  product_id  add_to_cart_order  user_id  order_number  \
0               6       40462                  1    22352             4   
1           15854       40462                  2     5374             7   
2           21553       40462                  1    31136            13   
3           59858       40462                  1    59606             4   
4          119328       40462                  1     1409             4   
5          138567       40462                  1    57161            11   
6          143326       40462                  1    57882             5   
7          208931       40462                  1     7137             4   
8          270044       40462                  1    42843             3   
9          290017       40462                  6    29836             5   
10         424301       40462                  1    31136             7   
11         432041       40462                  3    53624             3   
12         455382       4

Source for below tab = https://gist.github.com/olgabradford/f04f23692c78fc0beb377894ce5e5e59

In [24]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
skb = SelectKBest(f_classif,k='all').fit(orders,order_prior["reordered"])
scores = skb.scores_
all_features = orders.columns.values
sort_index = np.argsort(scores)[::-1]
rank = 1
ranked_features = []
print ("Ranking of features is ")
for x in sort_index:
    print (rank,". Score  ",all_features[x]," is ",scores[x])
    ranked_features.append(all_features[x])
    rank += 1


Ranking of features is 
1 . Score   order_number  is  970805.6672021801
2 . Score   add_to_cart_order  is  177015.956502899
3 . Score   department_id  is  14074.846143402254
4 . Score   days_since_prior_order  is  5312.091992387544
5 . Score   order_hour_of_day  is  4640.594450832234
6 . Score   order_dow  is  403.1715221922778
7 . Score   aisle_id  is  125.13937518414353
8 . Score   product_id  is  115.94167431770306
9 . Score   order_id  is  11.621544153480675
10 . Score   user_id  is  7.9379103280417045
['order_id' 'product_id' 'add_to_cart_order' 'user_id' 'order_number'
 'order_dow' 'order_hour_of_day' 'days_since_prior_order']


In [26]:
BestCol = ['order_number', 'add_to_cart_order', 'department_id', 'days_since_prior_order', 'order_hour_of_day', 'order_dow', 'product_id', 'aisle_id']

In [33]:
X = orders[BestCol]
Y = reord


In [34]:
pickle_out = open("Input.pickle","wb")
pickle.dump(X, pickle_out)
pickle_out.close()

In [35]:
pickle_out = open("Output.pickle","wb")
pickle.dump(Y, pickle_out)
pickle_out.close()