# Regression Analysis Using Python

## Linear Regression Basics

* Linear regression is a predictive modeling technique for predicting a numeric response variable based on one or more explanatory variables. 

* The term "regression" in predictive modeling generally refers to any modeling task that involves predicting a real number (as opposed classification, which involves predicting a category or class.). 

* The term "linear" in the name linear regression refers to the fact that the method models data with linear combination of the explanatory variables. 

* A linear combination is an expression where one or more variables are scaled by a constant factor and added together. 

* In the case of linear regression with a single explanatory variable, the linear combination used in linear regression can be expressed as:
 
 $$ response=intercept+constant∗explanatory $$
 
The right side if the equation defines a line with a certain y-intercept and slope times the explanatory variable. 

In other words, linear regression in its most basic form fits a straight line to the response variable. 
The model is designed to fit a line that minimizes the squared differences (also called errors or residuals.). 

* We won't go into all the math behind how the model actually minimizes the squared errors, but the end result is a line intended to give the "best fit" to the data. 
    * The theory behind minimization of sum of squares of the error is covered in earlier module.
    

* Since linear regression fits data with a line, it is most effective in cases where the response and explanatory variable have a linear relationship.


## Simple Linear Regression

In this data we will see if the employment in a country is function *Gross National Product (GNP)*

In [11]:
import numpy as np
import statsmodels.api as sm
import pandas as pd
 
# Read the data
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/longley.csv', index_col=0)

In [12]:
# dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 1947 to 1962
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   GNP.deflator  16 non-null     float64
 1   GNP           16 non-null     float64
 2   Unemployed    16 non-null     float64
 3   Armed.Forces  16 non-null     float64
 4   Population    16 non-null     float64
 5   Year          16 non-null     int64  
 6   Employed      16 non-null     float64
dtypes: float64(6), int64(1)
memory usage: 1.0 KB


In [13]:
#see the data
df.head()

Unnamed: 0,GNP.deflator,GNP,Unemployed,Armed.Forces,Population,Year,Employed
1947,83.0,234.289,235.6,159.0,107.608,1947,60.323
1948,88.5,259.426,232.5,145.6,108.632,1948,61.122
1949,88.2,258.054,368.2,161.6,109.773,1949,60.171
1950,89.5,284.599,335.1,165.0,110.929,1950,61.187
1951,96.2,328.975,209.9,309.9,112.075,1951,63.221


In [16]:
#dependent Variable
y = df.Employed

#Independent variable (single or one )
X = df.GNP
X = sm.add_constant(X)
X

Unnamed: 0,const,GNP
1947,1.0,234.289
1948,1.0,259.426
1949,1.0,258.054
1950,1.0,284.599
1951,1.0,328.975
1952,1.0,346.999
1953,1.0,365.385
1954,1.0,363.112
1955,1.0,397.469
1956,1.0,419.18


In [18]:
est = sm.OLS(y, X)
est

<statsmodels.regression.linear_model.OLS at 0x13593d9c348>

In [19]:
est = est.fit()


In [20]:
est.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,Employed,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.965
Method:,Least Squares,F-statistic:,415.1
Date:,"Wed, 05 Oct 2016",Prob (F-statistic):,8.36e-12
Time:,12:13:41,Log-Likelihood:,-14.904
No. Observations:,16,AIC:,33.81
Df Residuals:,14,BIC:,35.35
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,51.8436,0.681,76.087,0.000,50.382 53.305
GNP,0.0348,0.002,20.374,0.000,0.031 0.038

0,1,2,3
Omnibus:,1.925,Durbin-Watson:,1.619
Prob(Omnibus):,0.382,Jarque-Bera (JB):,1.215
Skew:,0.664,Prob(JB):,0.545
Kurtosis:,2.759,Cond. No.,1660.0


In [20]:
est.fittedvalues                             

1947    59.985670
1948    60.859238
1949    60.811558
1950    61.734058
1951    63.276226
1952    63.902601
1953    64.541557
1954    64.462565
1955    65.656549
1956    66.411057
1957    67.230828
1958    67.292583
1959    68.618661
1960    69.310128
1961    69.851290
1962    71.127429
dtype: float64

In [21]:
%matplotlib inline
import pylab 
import scipy.stats as stats
stats.probplot(est.resid)
pylab.show()

## Multiple Linear Regression

In [24]:
import statsmodels as ms


In [27]:
# Read Excel file
stud_reg = pd.read_csv('D:\\PythonFiles\\Regression Analysis\\2_Codes\\Students.csv')
print("The list of row indicies")
print(stud_reg.index)
print("The column headings")
print(stud_reg.columns)
X = stud_reg[ ['PLACE_RATE', 'NO_GRAD_STUD'] ]
y = stud_reg.APPLICATIONS
X = sm.add_constant(X)

print(X.head())
print(y.head())
est = sm.OLS(y, X)
est = est.fit()
est.summary()

 

The list of row indicies
RangeIndex(start=0, stop=1320, step=1)
The column headings
Index(['APPLICATIONS', 'PLACE_RATE', 'NO_GRAD_STUD'], dtype='object')
   const  PLACE_RATE  NO_GRAD_STUD
0    1.0          61         13742
1    1.0          50         14744
2    1.0          53         13588
3    1.0          55         13000
4    1.0          50         12500
0    4633
1    7075
2    5615
3    4806
4    5056
Name: APPLICATIONS, dtype: int64


0,1,2,3
Dep. Variable:,APPLICATIONS,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,24510000000.0
Date:,"Sun, 07 Jun 2020",Prob (F-statistic):,0.0
Time:,20:16:29,Log-Likelihood:,-515.49
No. Observations:,1320,AIC:,1037.0
Df Residuals:,1317,BIC:,1053.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,805.9738,0.096,8369.282,0.000,805.785,806.163
PLACE_RATE,-139.9912,0.001,-1.79e+05,0.000,-139.993,-139.990
NO_GRAD_STUD,0.8999,6.63e-06,1.36e+05,0.000,0.900,0.900

0,1,2,3
Omnibus:,26.66,Durbin-Watson:,2.024
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.884
Skew:,0.275,Prob(JB):,6.51e-06
Kurtosis:,2.636,Cond. No.,123000.0
