<a href="https://colab.research.google.com/github/byhqsr/DSAI-Professional-Training-in-Machine-Learning/blob/main/Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lesson, we're going to use linear regression to predict the tip a guest will provide to the restaurant after their meal based on numerous known variables.

The dataset for this example is the tips dataset from Seaborn (https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv). This dataset contains information about restaurant tips provided by diners. Note that the full dataset has 244 rows (diners).

Exploratory Data Analysis:
*   Heatmap and pairplot analysis to check X variables are not highly correlated to each other
*   Heatmap and pairplot analysis to check X variables are correlated to the y variable

Data Scrubbing:
*   One-hot encoding for the variables time, day, and sex
*   Delete variable smoker

Independent Variables:
*   total_bill
*   sex
*   day
*   time
*   size

Dependent Variable:
*   tip

Evaluation:
*   Mean absolute error

In [None]:
# 1) Import the following Python libraries: A) pandas B) train_test_split from Scikit-learn C) LinearRegression from Scikit-learn D) mean_absolute_error from Scikit-learn E) seaborn F) matplotlib
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error

In [None]:
# 2) Import dataset from the web: https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
df

In [None]:
# 3) Exploratory data analysis using heatmap and pairplot to check the correlation between variables
# Exploratory data analysis: heatmap
df_corr = df.corr()
sns.heatmap(df_corr,annot=True,cmap='coolwarm')

In [None]:
# Exploratory data analysis: pairplot
sns.pairplot(df)

In [None]:
# 4) Delete smoker variable
del df['smoker']

In [None]:
# 5) Convert non-numeric variables using one-hot encoding. These variables include: time, day, and sex
df = pd.get_dummies(df, columns=['time', 'day','sex'])

In [None]:
# 6) Assign the X and y variables
X = df.drop('tip',axis=1)
y = df['tip']

In [None]:
# 7) Shuffle the dataset and split the data into test/train sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

In [None]:
# 8) Assign LinearRegression as the model's algorithm
model = LinearRegression()

In [None]:
# 9) Link model to X and y variables using the fit function
model.fit(X_train, y_train)

In [None]:
# 10) Run algorithm on test data to make predictions
#Find y-intercept
model.intercept_

In [None]:
# Find x coefficients
model.coef_

In [None]:
# 11) Evaluate predictions by comparing the model's predictions and the actual outcome of the test data using mean absolute error
# Check prediction error for training data using MAE
mae_train = mean_absolute_error(y_train, model.predict(X_train))
print ("Training Set Mean Absolute Error: %.2f" % mae_train)

In [None]:
# Check prediction error for test data using MAE
mae_test = mean_absolute_error(y_test, model.predict(X_test))
print ("Test Set Mean Absolute Error: %.2f" % mae_test)

In [None]:
# 12) Make a prediction with the model using a sample data point and the predict function
# Data point to predict
sample_data_point = [
	40, #total_bill
	2, #size
	1, #time_dinner
	0, #time_lunch
	1, #day_fri
	0, #day_sat
	0, #day_sun
	0, #day_thur
	1, #sex_female
	0, #sex_male
]

In [None]:
# Make prediction
prediction = model.predict([sample_data_point])
prediction