## Regression Plots

In [None]:
# Our standard imports for visualization
import numpy as np
from numpy.random import randn
import pandas as pd

from scipy import stats

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Since Seaborn is the statistical visualization package, we will use it's *regplot()* function to "Plot data and a linear regression model fit."
https://seaborn.pydata.org/generated/seaborn.regplot.html

We'll start by loading a sample Seaborn dataset that contains information on tips received by restaurant staff.

In [None]:
tips = sns.load_dataset("tips")

tips.head()

Previewing the data, we can see we will probably want to use tip as our dependent variable, so choose that as *y*. Let's start by examining total_bill as the independent variable, *x*.

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips)
plt.show()

We see the standard scatter plot of the points, and we also see that by default Seaborn draws a regression line and a shaded 95% confidence interval for the regression line (we will discuss confidence intervals in the next course section).

We would probably like to know some basic information about the regression model, such as equation and *r* value. Seaborn does not show us that information, but we can easily calculate it with SciPy.

In [None]:
lin_reg = stats.linregress(x=tips["total_bill"], y=tips["tip"])
print(lin_reg)

Now we have all the values we need, if we want to put it into the standard equation for a line we just need to remember *y=mx+b*. We can see the type of the lin_reg variable is LinregressResult, but it is actually inherited from a tuple so we can just treat it like a tuple to select the values we need.

In [None]:
slope = lin_reg[0]
intercept = lin_reg[1]
r_value = lin_reg[2]
print(f"y = {slope}x + {intercept}")
print(f"Correlation Coefficient r: {r_value}")

This line is called the Least Squares Regression Line (LSRL) or Ordinary Least Squares (OLS) regression. That is because it minimizes the square of the residuals for all of the points to the line. Residuals are the prediction error for the actual points (the vertical distance from each point to the line). 

To visualize what the LSRL is minimizing when you have time (or if you finish this notebook early), check this applet: 
https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_en.html

First select a dataset from the drop down at the top. Next, select the two checkboxes on the right that say "Residuals" and "Squared Residuals". Use the sliders to adjust your line to try to minimize the total area of all the boxes (shown next to "sum"). When you think you have minimized the sum, select the "Best Fit Line" on the left and see how it compares to your line.

#### Make a Prediction

If we want to use the regression line to make a prediction for tip based on a bill of $37.50, we can make the calculation manually using this equation.

In [None]:
bill = 37.50

# TODO: Calculate the predicted tip using slope, intecept, and bill:
tip_pred = 

print(f"We predict a bill of ${bill} to generate a tip of ${tip_pred}.")

One quick change before we continue making plots, let's fix the decimals on that prediction statement by modifying the f-string.

In [None]:
print(f"We predict a bill of ${bill:.2f} to generate a tip of ${tip_pred:.2f}.")

If we want to change or remove the shaded confidence interval, we can do that with the ci argument. We can also print our equation and r value as a legend using a keyword argument.

In [None]:
# TODO: Fix the label so that Python uses an f-string to automatically
# insert the actual values for slope, intercept, and r_value
ax = sns.regplot(x="total_bill", y="tip", data=tips, ci=None, 
                 line_kws={'label':f"y = slope * x + intercept\nr: r_value"})
ax.legend()
plt.show()

We can continue modifying the plot with additional arguments. We can change properties of the points and the line separately.

In [None]:
ax = sns.regplot(x="total_bill", y="tip", data=tips, ci=None, 
                 scatter_kws={'color': 'olive'},
                 line_kws={'label':f"y = {slope:.4f}x + {intercept:.4f}\nr: {r_value:.4f}",
                           'linewidth': 1, "color": "darkorchid"})
ax.legend()
plt.show()

If we think that perhaps this data has a higher-order trend (i.e. polynomial instead of linear), we can specify what degree polynomial to fit with the *order* argument. Note that I'm removing the legend since that no longer describes the regression line.

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips, ci=None, order=4,
                 scatter_kws={'color': 'olive'},
                 line_kws={'linewidth': 1, "color": "darkorchid"})
plt.show()

Let's say that now we want to examine if there is any relationship involving the categorical variables in the data. Let's start by looking for a difference by sex.

To do this we will switch from the *regplot()* function to the *lmplot()* function, since *regplot()* is for a single regression model, and *lmplot()* is designed to display multiple models on a single plot. The arguments are mostly the same.
https://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot

We can color the points by passing our variable of choice to the *hue* argument. We will also change the markers to make them easier to separate visually.

In [None]:
sns.lmplot(x="total_bill", y="tip", data=tips, hue="sex",markers=["x","o"])
plt.show()

Take a moment to describe what you see in the above plot.

Now let's try comparing by day of the week and see if there are any trends visible.

In [None]:
# TODO: Change the order of the days in the legend so that they are reversed,
# from Sunday to Thursday (Look at the documentation for lmplot())
sns.lmplot("total_bill", "tip", tips, hue="day")
plt.show()

Describe this plot. Does it provide any additional information than the previous plots?