# Introduction to Linear Regression

In [None]:
# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# this allows plots to appear directly in the notebook
%matplotlib inline

## Example: Iris Data

Let's take a look at some data, ask some questions about that data, and then use linear regression to answer those questions!

In [None]:
# read data into a DataFrame
# Loading built-in Datasets:
data = sns.load_dataset("iris")

data.head()

What are the **features**?
- sepal_length
- sepal_width
- petal_length
- petal_width

What is the **response**?
- Species

In [None]:
# print the shape of the DataFrame
data.shape

There are 150 **observations** in the dataset.

Iris dataset actually has 50 samples from each of three species of Iris flower (Setosa, Virginica and Versicolor). 

Four features were measured (in centimeters) from each sample: Length and Width of the Sepals and Petals. Let us try to have a summarized view of this dataset:

In [None]:
data.describe()

In [None]:
data['species'].value_counts()

In [None]:
print('Iris-setosa')
setosa = data['species'] == 'setosa'
print(data[setosa].describe())
print('\nIris-versicolor')
versicolor = data['species'] == 'versicolor'
print(data[versicolor].describe())
print('\nIris-virginica')
virginica = data['species'] == 'virginica'
print(data[virginica].describe())

In [None]:
sns.set()
sns.swarmplot(x="species", y ="petal_length", data = data)

In [None]:
sns.distplot(a=data['petal_width'], bins=40, color='m')

In [None]:
sns.boxplot(x='species',y='sepal_width',data=data ,palette='YlGnBu')

In [None]:
sns.heatmap(data.corr(),cmap="YlGnBu", linecolor='white', linewidths=1)

In [None]:
print('Covariance:')
data.cov()

## Pair Plot

Pair Plot is used to view the pairwise relationship between all the variables in a dataset and the diagonal axes show the univariate distribution of the variable.

The example takes the entire dataset as input and distinguishes data on species with varying colors.

In [None]:
sns.pairplot(data, hue='species', palette="OrRd")

## Regression Plots

Regression Plot is used to map all the given data and plot a linear regression model fit for it.

In the example below, we have plotted the petal_width against the petal_length.

In [None]:
sns.regplot(x='petal_width', y='petal_length', data=data)

## Missing data

In [None]:
data.isnull().sum()

## Simple Linear Regression

Simple linear regression is an approach for predicting a **quantitative response** using a **single feature** (or "predictor" or "input variable"). It takes the following form:

$y = \beta_0 + \beta_1x$

What does each term represent?
- $y$ is the response
- $x$ is the feature
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for x

Together, $\beta_0$ and $\beta_1$ are called the **model coefficients**. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!

<img src="images/17/estimating_coefficients.png">

What elements are present in the diagram?
- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?
- $\beta_0$ is the **intercept** (the value of $y$ when $x$=0)
- $\beta_1$ is the **slope** (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:

<img src="images/17/slope_intercept.png">

Let's use **Statsmodels** to estimate the model coefficients for the advertising data:

In [None]:
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf

# create a fitted model in one line
lm = smf.ols(formula='petal_width ~ petal_length', data=data).fit()

# print the coefficients
lm.params

In [None]:
# print a summary of the fitted model
lm.summary()
