In [13]:
import pandas_profiling
import pandas as pd
from sklearn.model_selection import train_test_split

# The command below means that the output of multiple commands in a cell will be 
# output at once.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [7]:
# We start by importing the new data set
home_folder = '/Users/anthonymiyoro/Documents/code/DataTho/'

#Read dataset
train_modified = pd.read_csv(home_folder + 'train_modified.csv')
test_bundas = pd.read_csv(home_folder + 'bundas_test.csv')

## Splitting into train and test

To understand how well our model generalizes, we need to split our data into a test and training set. In order to get a true prediction, we can't show our model the test labels. Instead we just ask the model to score based upon the test explanatory variables.
In order to get to this, we split the dataframe into a test and train.


### X_train, y_train, X_test, y_test

However, we additionally need to seperate the explanatory features from our outcome feature.
To do this we need to create four subsets of data:


X_train: the explanatory features to train the algorithm


y_train: the outcome feature associated with the training features - In this case, these are the loan amounts


X_test: the explanatory features to test the algorithm


y_test: the true Y of the target features associated with the testing geatures - again, in this case, these are the loan amounts.

To do this we will use the sklearn model selection function train_test_split. If you look back to the top of this page, we imported this function at the beginning of the notebook.
The names used below (X_train, X_test, y_train, y_test) are the conventions used in industry for train and test data so you will continue to see them repeatedly.

In [9]:
print(train_modified.columns)

Index(['Item_ID', 'Weight', 'Visibility', 'Max_Price',
       'Store_Establishment_Year', 'Item_Store_Sales', 'FatContent_LF',
       'FatContent_Low_Fat', 'FatContent_Regular', 'FatContent_low_fat',
       'FatContent_reg', 'Category_Baking_Goods', 'Category_Breads',
       'Category_Breakfast', 'Category_Canned', 'Category_Dairy',
       'Category_Frozen_Foods', 'Category_Fruits_and_Vegetables',
       'Category_Hard_Drinks', 'Category_Health_and_Hygiene',
       'Category_Household', 'Category_Meat', 'Category_Others',
       'Category_Seafood', 'Category_Snack_Foods', 'Category_Soft_Drinks',
       'Category_Starchy_Foods', 'Store_Size_High', 'Store_Size_Medium',
       'Store_Size_Small', 'Store_Location_Type_Tier_1',
       'Store_Location_Type_Tier_2', 'Store_Location_Type_Tier_3',
       'Store_Type_Grocery_Store', 'Store_Type_Supermarket_Type1',
       'Store_Type_Supermarket_Type2', 'Store_Type_Supermarket_Type3',
       'Store_ID_OUT013', 'Store_ID_OUT018', 'Store_ID_OUT019'

### We then remove the Item_ID feature as it does not correlate to Sales

In [25]:
train_modified = train_modified.drop('Item_ID', axis=1)

In [26]:
train, test = train_test_split(train_modified, test_size=0.2)
len(train)
len(test)

4890

1223

Our scikit learn package requires that the explanatory variables are stored seperately from the outcome variables.

In [27]:
# create a training array = all the features but not the target
# and a training test array = the results of the target variable for the training array
rf_trainY = train_modified['Item_Store_Sales']
rf_trainX = train_modified.drop('Item_Store_Sales', axis=1)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(rf_trainX, rf_trainY, test_size=0.2, random_state=42)

len(X_train)
len(X_test)
len(y_train)
len(y_test)

4890

1223

4890

1223

### Decision Tree

Before we get into some of the more sophisticated models, let's first try an individual Decision Tree and see how it performs. 

After training the model, we will be able to assess it's performance by using sklearns useful method .score, which calculates the r2 value for the data provided. 

We will first print out the r2 score for the training data, and then will print out the r2 score for the test data.

In [29]:
from sklearn.tree import DecisionTreeRegressor
# Step 1: Initiating the DecisionTreeRegressor algorithm
decision_regressor = DecisionTreeRegressor(random_state=0)
# Step 2: Training the algorithm using the X_train dataset of features and y_train, the associated target features
decision_regressor.fit(X_train, y_train)
# Step 3: Calculating the score of the predictive power on the training and testing dataset.
dt_training_score = decision_regressor.score(X_train, y_train)
dt_testing_score = decision_regressor.score(X_test, y_test)
print("Train score: " + str(dt_training_score))
print("Test score: " + str(dt_testing_score))

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=0, splitter='best')

Train score: 1.0
Test score: 0.15780932934084002
