# What is this data?
- This dataset can be found here:  https://www.kaggle.com/srolka/ecommerce-customers
- Read the data dictionary!
- TLDR; This data was collected by a company who wanted to track the activity of thier users on the mobile and web platforms for later analysis.

In [None]:
#Import the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as py
import seaborn as sns
%matplotlib inline
# Read the file
customers = pd.read_csv('Ecommerce Customers.csv')

In [None]:
# Let’s checkout the data -
customers.head()

In [None]:
customers.describe()

## Exploratory Data Analysis

In [None]:
# First step with data is to analyze the data, explore what relationships exist and how those are correlated.
# Created a jointplot (using seaborn) to compare the Time on Website and Yearly Amount Spent columns. This is to check if the correlation makes sense?
sns.jointplot(x='Time on Website',y='Yearly Amount Spent', data=customers)

In [None]:
# The same for App data
sns.jointplot(x='Time on App',y='Yearly Amount Spent', data=customers)

In [None]:
# We can create a pairplot to explore the types of relationships across the entire data set. Notice the positive linear correlation between Yearly amount spent and length of membership.
sns.pairplot(data=customers)

In [None]:
# So we dig deep into this relationship by creating a linear plot (using seaborn’s lmplot) of Yearly Amount Spent vs. Length of Membership
sns.lmplot(x='Length of Membership',y='Yearly Amount Spent',data=customers)

## Recap - What did you learn?
- the essence of Linear Regression (identify a linear relationship between features in your data).
- how to explore data and visually identify features that have a linear relationship
- how to use seaborn to generate the linear regression line through your dataset

## Hands on!
- Load the cab_rides.csv dataset and explore its features for linear relationships.  Use seaborn to generate a linear regression line throught the dataset.
### Challenge
- Load the kendall_homes.csv dataset and explore its features for linear relationships.  Also use seaborn to generate the regression line.

# Bonus : Generate a Machine Learning Model that uses the ecommerce data to make predictions.

In [None]:
# Split out the features that we want to use to make predictions
X =customers[['Avg. Session Length','Time on App','Time on Website','Length of Membership']]
X.head()

In [None]:
# Split out the feature that we want to predict
Y=customers['Yearly Amount Spent']
Y.head()

In [None]:
# Let's split the data into 70% training and 30% testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=101)

In [None]:
# Now its time to train our model on our training data!  lm is your linear model!  
from sklearn.linear_model import LinearRegression
lm = LinearRegression() # Creating an Instance of LinearRegression model
lm.fit(X_train,Y_train)

In [None]:
# These are the coefficients that the model found:
# Avg. Session Length,Time on App,Time on Website,Length of Membership
print(lm.coef_)

In [None]:
# Now that we have fit our model, let’s evaluate its performance by predicting off the test values!
prediction = lm.predict(X_test)

In [None]:
#Let’s create a scatterplot of the real test values versus the predicted values to check the performance of our model
py.scatter(Y_test,prediction)

In [None]:
from sklearn import metrics
print('MAE= ', metrics.mean_absolute_error(Y_test,prediction) )
print('MSE= ', metrics.mean_squared_error(Y_test,prediction))
print('RMSE:', np.sqrt(metrics.mean_squared_error(Y_test, prediction)))

In [None]:
# Plotting a histogram of the residuals and make sure it looks normally distributed using plt.hist().
py.hist(prediction-Y_test,bins=50)

In [None]:
co=pd.DataFrame(lm.coef_,X.columns)
co.columns = ['Coefficient']
co

## Analysis and Conclusions

- Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent.
- Holding all other features fixed, a 1 unit increase in Time on App is associated with an increase of 38.59 total dollars spent.
- Holding all other features fixed, a 1 unit increase in Time on Website is associated with an increase of 0.19 total dollars spent.
- Holding all other features fixed, a 1 unit increase in Length of Membership is associated with an increase of 61.27 total dollars spent.

## Recommendations

- There are two ways to think about this: 
-- 1. Develop the Website to catch up to the performance of the mobile app, or 
-- 2. develop the App more since that is what is working better.
- Being a data person, we can present both the options to the company with the numbers and help them make a decision.