## Linear Regression

Here, I have analyzed an e-commerce company’s data from where I would have to decide whether to focus their efforts on their mobile app experience or their website experience.

Import section

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
%matplotlib inline

Get Data

In [None]:
df = pd.read_csv("Ecommerce Customers.csv")
df.head()

In [None]:
df.describe()

## Exploratory Data Analysis

In [None]:
# More time on site, more money spent.
sns.jointplot(df,x='Time on Website',y='Yearly Amount Spent')
sns.set_style('whitegrid')

In [None]:
sns.jointplot(df,x='Time on App',y='Yearly Amount Spent')

We can see that there is more correlation between “Yearly Amount Spent” and “Time on App” than “Time on Website”. 

Let's check all the features

In [None]:
sns.pairplot(df)

We can see here, that there is more correlation between “Yearly Amount spent” and “Length of Membership” than “Time on App”.

In [None]:
sns.lmplot(df,x="Length of Membership",y="Yearly Amount Spent")


## Training and Testing Data

In [None]:
from sklearn.model_selection import train_test_split

Dependent or resultant Variable y

In [None]:
y = df["Yearly Amount Spent"]
y

In [None]:
df.columns

All the independent variables in X

In [None]:
X = df[['Avg. Session Length', 'Time on App',
       'Time on Website', 'Length of Membership']]
X

Splitting the data into training and test data sets. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)


Importing Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

Fit the training data in the Linear Regression Model.  

In [None]:
lm.fit(X_train,y_train)

Get Coefficients

In [None]:
lm.coef_

Get Prediction by independent test data set (X_test)

In [None]:
predictions = lm.predict(X_test)
predictions

In [None]:
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.grid()

Import metrics to check Error Amount

In [None]:
from sklearn import metrics
import numpy as np

In [None]:
print(f"MAE: {metrics.mean_absolute_error(y_test,predictions)}")
print(f"RMSE: {np.sqrt(metrics.mean_squared_error(y_test,predictions))}")

We can see, there is not much error.

Let's quickly explore the residuals to make sure everything was okay with our data. 

In [None]:
sns.distplot((y_test-predictions),bins=50);

Create new data frame with coefficient

In [None]:
new_df = pd.DataFrame(lm.coef_,X.columns)
new_df.columns = ["Coefficients"]
new_df

## Conclusion
There are two ways to think about this: Develop the Website to catch up to the performance of the mobile app, or develop the app more since that is what is working better. This sort of answer really depends on the other factors going on at the company, you would probably want to explore the relationship between Length of Membership and the App or the Website before coming to a conclusion!

Now, we can check the correlation between “Length of Membership” with “Time on App” and “TIme on Website”. 

In [None]:
df.columns

In [None]:
y_dependent_var = df["Length of Membership"]
X_independent_var = df[['Avg. Session Length', 'Time on App',
       'Time on Website','Yearly Amount Spent']]

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X_independent_var,y_dependent_var,test_size=0.3,random_state=101)

In [None]:
lm2 = LinearRegression()

In [None]:
lm2.fit(X2_train,y2_train)

In [None]:
lm2.coef_

In [None]:
prediction2 = lm2.predict(X2_test)

In [None]:
plt.scatter(y2_test,prediction2)
plt.xlabel('Y2 Test')
plt.ylabel('Predicted2 Y')

In [None]:
print('MAE:', metrics.mean_absolute_error(y2_test, prediction2))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y2_test, prediction2)))

In [None]:
sns.distplot((y2_test-prediction2),bins=50);

In [None]:
coeffecients = pd.DataFrame(lm2.coef_,X_independent_var.columns)
coeffecients.columns = ['Coeffecient']
coeffecients