## Business context

Aida is a data analyst employed by Electronics Online (EO), an online electronics retailer. She was tasked to determine whether the company performed well and how accurate the forecasts are, based on historical data. Aida expects that the board of directors would want to know what predictions can be made about the stock price given new inputs, which requires a model that outputs continuous variables. Therefore, to assist, Aida decides to build a regression decision tree model. 

## Prepare workstation

In [1]:
# Import the necessary libraries, packages, and modules.
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn
import statsmodels.api as sm  
import math

from sklearn import metrics
from sklearn.model_selection import train_test_split

import warnings  
warnings.filterwarnings('ignore')  

# Read the provided CSV file/data set.
df = pd.read_csv('ecommerce.csv')

# View the DataFrame.
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Sale       506 non-null    float64
 1   por_OS     506 non-null    float64
 2   por_NON    506 non-null    float64
 3   recc       506 non-null    int64  
 4   avg_no_it  506 non-null    float64
 5   age        506 non-null    float64
 6   dis        506 non-null    float64
 7   diff_reg   506 non-null    int64  
 8   tax        506 non-null    int64  
 9   bk         506 non-null    float64
 10  lowstat    506 non-null    float64
 11  Median_s   506 non-null    float64
dtypes: float64(9), int64(3)
memory usage: 47.6 KB
None


Unnamed: 0,Sale,por_OS,por_NON,recc,avg_no_it,age,dis,diff_reg,tax,bk,lowstat,Median_s
0,0.63,18.0,2.31,0,6.575,65.2,4.09,1,296,396.9,4.98,24.0
1,2.73,0.0,7.07,0,6.421,78.9,4.9671,2,242,396.9,9.14,21.6
2,2.73,0.0,7.07,0,7.185,61.1,4.9671,2,242,392.83,4.03,34.7
3,3.24,0.0,2.18,0,6.998,45.8,6.0622,3,222,394.63,2.94,33.4
4,6.91,0.0,2.18,0,7.147,54.2,6.0622,3,222,396.9,5.33,36.2


## Build and fit the decision tree model

In [2]:
# Specify that the column Median_s 
# should be moved into a separate DataFrame.
cols = df.columns[df.columns != 'Median_s']  

# Specify 'X' as the independent variables 
# and 'y' as the dependent variable:
X = df[cols]
y = df['Median_s']


# Split the data training and testing 30/70:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42)

# Import the ‘DecisionTreeRegressor’ class from sklearn.
from sklearn.tree import DecisionTreeRegressor  

# Create the ‘DecisionTreeRegressor’ class 
# (which has many parameters; input only #random_state=0):
regressor = DecisionTreeRegressor(random_state=42)

# Fit the regressor object to the data set.
regressor.fit(X_train,y_train)  

## Determine the accuracy of the model

### as this is a statistically based test, we must employ a MAE, MSE, RMSE model

In [3]:
# Predict the response for the data test.
y_predict = regressor.predict(X_test)  

# Specify to print the MAE and MSE (to evaluate the accuracy of the new model):
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, y_predict))
print("Mean Squared Error: ", metrics.mean_squared_error(y_test, y_predict))
# Calculate the RMSE.
print("Root Mean Squared Error: ", 
     math.sqrt(metrics.mean_squared_error(y_test, y_predict))) 

Mean Absolute Error:  3.0651315789473683
Mean Squared Error:  23.632039473684216
Root Mean Squared Error:  4.861279612785529


## Conclusion

We determine the effectiveness of the model by substracting the RMSE from the MAE which in this case is a value of 1.79. The closer the number is to 0 the better. This determines that this is an effective model and statistically credible. 
In the context of the business case, we can determine that the current systems of forecasting based on historical data is effective and can be used as an effective means of predicting future outcomes. 