# Regressão(11) - Pense Estatística, 2º Edição.

## Universidade Federal de Alagoas - UFAL
## Centro de Tecnologia - CTEC
## Programa de Pós-Graduação Recursos Hídricos e Saneamento - PPGRHS
### Disciplina de Estatística

Clebson Farias

In [1]:
from __future__ import print_function, division

%matplotlib inline

import numpy as np
import pandas as pd

import thinkstats2
import thinkplot

In [2]:
#Dados Manso
dados = pd.read_csv("manso.csv", index_col=0, parse_dates=True)
date_start = pd.to_datetime("01/08/1981", dayfirst=True)
date_end = pd.to_datetime("31/12/1989", dayfirst=True)
dados = dados.loc[date_start:date_end]
dados.rename(index=str, 
             columns={"1455008": "COIMBRA_P", "66210000": "MANSO_JUS", "66231000": "COIMBRA_F"}, 
             inplace=True)
dados.index = pd.to_datetime(dados.index, errors='coerce')
dados.head()

Unnamed: 0,COIMBRA_P,MANSO_JUS,COIMBRA_F,MANSO
1981-08-01,0.0,,88.46,
1981-08-02,0.0,,87.233,
1981-08-03,0.0,,87.233,
1981-08-04,0.0,,86.011,
1981-08-05,0.0,,84.794,


In [3]:
dados_chuva = pd.Series(dados["COIMBRA_P"].groupby(pd.Grouper(freq='M')).sum(), name='Prec')
dados_vazao_obs = pd.Series(dados["COIMBRA_F"].groupby(pd.Grouper(freq='M')).mean(), name='Obs') 
dados_vazao_nat = pd.Series(dados["MANSO"].groupby(pd.Grouper(freq='M')).mean(), name='Nat')
dados_month = pd.DataFrame([dados_vazao_obs, dados_vazao_nat, dados_chuva]).T
dados_month.head()

Unnamed: 0,Obs,Nat,Prec
1981-08-31,82.929097,,0.0
1981-09-30,80.033133,,8.1
1981-10-31,109.033903,,67.6
1981-11-30,166.606933,,194.2
1981-12-31,216.804194,,160.4


## Regressão

**O objetivo da análise de regressão é descrever a relação entre um conjunto de variáveis**, chamado de variáveis dependentes, e outro conjunto de variáveis, chamado de variáveis independentes ou explicativas.

Se a relação entre a variável dependente e explanatória é linear, isto é, regressão linear.

\begin{equation}
    y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \epsilon
\end{equation}

onde,

- $x_{1}$ and $x_{2}$ = Variable Explanatory
- y = Variable Dependent
- $\beta_{0}$ = Intercept
- $\beta_{1}$ and $\beta_{2}$  = Parameter associated
- $\epsilon$ = Residual

Dada uma seguência de valores para y e seguência para $x_{1}$ and $x_{2}$, **podemos encontrar os parâmetros**, $\beta_{0}$, $\beta_{1}$, and $\beta_{2}$, que minimiza a soma $ε^{2}$. Este processo pe chamado de **mínimos quadrados ordinários**.

## StatsModels

- **StatsModels**, um pacote Python que fornece várias formas de regressão e outras análises. 

In [4]:
import statsmodels.formula.api as smf
formula = "Nat ~ Obs"
model = smf.ols(formula, data=dados_month)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Nat,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,21300.0
Date:,"Sat, 01 Jun 2019",Prob (F-statistic):,1.34e-112
Time:,23:54:15,Log-Likelihood:,-337.07
No. Observations:,96,AIC:,678.1
Df Residuals:,94,BIC:,683.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.1543,1.543,-0.100,0.921,-3.219,2.910
Obs,0.9953,0.007,145.954,0.000,0.982,1.009

0,1,2,3
Omnibus:,50.003,Durbin-Watson:,1.575
Prob(Omnibus):,0.0,Jarque-Bera (JB):,400.978
Skew:,-1.356,Prob(JB):,8.49e-88
Kurtosis:,12.638,Cond. No.,418.0


In [5]:
inter = results.params['Intercept']
slope = results.params['Obs']
print('Inter: ', inter, '\nSlope: ', slope)

Inter:  -0.1543438145463405 
Slope:  0.9952755902710447


\begin{equation}
    y = -0,1543 + 0,9953 · x_{1} + \epsilon
\end{equation}

In [6]:
slope_pvalue = results.pvalues['Obs']
slope_pvalue

1.3366723172701494e-112

In [7]:
r2 = results.rsquared
r2

0.9956067651557987

## Regressões Multiplas

Podemos ajustar um único modelo que inclui ambas as variáveis. Com a fórmula **MANSO ~ COIMBRA_F + COIMBRA_P**, obtemos:

In [8]:
formula = "Nat ~ Obs + Prec"
model = smf.ols(formula, data=dados_month)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Nat,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,10700.0
Date:,"Sat, 01 Jun 2019",Prob (F-statistic):,1.2300000000000001e-110
Time:,23:54:24,Log-Likelihood:,-336.36
No. Observations:,96,AIC:,678.7
Df Residuals:,93,BIC:,686.4
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.1467,1.540,-0.095,0.924,-3.205,2.912
Obs,1.0028,0.009,107.247,0.000,0.984,1.021
Prec,-0.0120,0.010,-1.179,0.241,-0.032,0.008

0,1,2,3
Omnibus:,53.663,Durbin-Watson:,1.595
Prob(Omnibus):,0.0,Jarque-Bera (JB):,492.955
Skew:,-1.439,Prob(JB):,9.04e-108
Kurtosis:,13.722,Cond. No.,502.0


In [9]:
inter = results.params['Intercept']
slope1 = results.params['Obs']
slope2 = results.params['Prec']
print('Inter: ', inter, '\nSlope1: ', slope1, '\nSlope2: ', slope2)

Inter:  -0.1467144371975948 
Slope1:  1.0028386674964787 
Slope2:  -0.012007305986208361


\begin{equation}
    y(x_{1}, x_{2}) = -0,14671 + 1,00284 · x_{1} - 0,012 · x_{2} + \epsilon
\end{equation}

In [10]:
slope1_pvalue = results.pvalues['Obs']
slope2_pvalue = results.pvalues['Prec']
print('p-Value (Slope1): ', slope1_pvalue, '\np-Value (Slope2): ', slope2_pvalue) 

p-Value (Slope1):  2.91197008807201e-99 
p-Value (Slope2):  0.24128261095545656


In [11]:
r2 = results.rsquared
r2

0.9956714960232174

Adicionar mais uma variável e fazer uma tabela resumindo os resultados.

 ## Relações não lineares