# Practic examples of Systools - Linear regression module¶

In [1]:
# importing
import sys
import os

sys.path.append(os.path.abspath(os.path.join('..')))

from systool import data, linreg
import pandas as pd

*linreg* is a module to perform linear regression over a dataset.  On transport planning we may want to perform regression using multiple variables and choose the best combination. This module enables you to test all combinations (using **loop_model**)  and then you can choose the best one to analyse the details of the results (using **fit_model**)


### First open data
Is a good practice to check what you are reading. You can use _.head()_ to take a look on the "head" of the DataFrame

In [2]:
df_trip = data.open_file('examples_databases\input_geracao.xlsx', kwargs={'sheet_name':'VIAGENS'})
df_trip.head()

Unnamed: 0,ZONA,ATRA,PROD
0,1,34768,45430
1,2,61501,58691
2,3,25362,35433
3,4,97055,96527
4,5,35215,66098


or use _.info()_ to also check the dtypes

In [3]:
df_data = data.open_file('examples_databases\input_geracao.xlsx', kwargs={'sheet_name':'DADOS'})
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   ZONAS       119 non-null    int64
 1   EMPREGOS    119 non-null    int64
 2   ENSINO      119 non-null    int64
 3   POP         119 non-null    int64
 4   DOMICILIOS  119 non-null    int64
 5   PEA         119 non-null    int64
dtypes: int64(6)
memory usage: 5.7 KB


In [4]:
df = df_trip.merge(df_data.rename(columns={'ZONAS':'ZONA'}), how='outer')
df.head()

Unnamed: 0,ZONA,ATRA,PROD,EMPREGOS,ENSINO,POP,DOMICILIOS,PEA
0,1,34768,45430,42585,327,483330,132420,473457
1,2,61501,58691,230621,697,855048,705679,141116
2,3,25362,35433,41957,264,325801,122293,91585
3,4,97055,96527,156704,673,1000086,829397,399884
4,5,35215,66098,60290,273,412396,348140,96418


## ATRACTION
### Loop for all possibilities

Perform a loop and find all possibilities that are significant
* User parameter *mask* to remove zones that you want to treat as default outliers (airports zones for example)
* *keepAll = True* returns even regressions that are not significant (with R^2 < CUT_R)

In [5]:
df_regs = linreg.loop_models(df, Xcols=['EMPREGOS','ENSINO','PEA'], Ycol='ATRA', mask=None, keepAll=True, CUT_R=0.5, force_intercept=False)
df_regs

Default regressions performed with 100.0% of data


Getting combinations for all variables: 100%|████████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]


6 possibilites of regression to do!
Get correlated pairs...


Droping combinations with correlated values...: 100%|██████████████████████████████████| 3/3 [00:00<00:00, 3003.08it/s]


3 non-correlated possibilities to do!


Making regressions...: 100%|█████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.65it/s]


    6 regressões decentes (R² > 0.5) foram geradas
    1 regressões passaram nos testes


TypeError: loop of ufunc does not support argument 0 of type float which has no callable rint method

### Choose better regression and get a report
Use result to choose a nice regression and look all statistical results (*plot=True*) on a fancy HTML saved on *path*. 

In [None]:
model, model_out = linreg.fit_model(x=df[['EMPREGOS']], y=df['ATRA'], intercept=False, plot=True, path=os.getcwd())
# model and model_out are *statsmodels 0.14.0 (+400)statsmodels.regression.linear_model.OLS* objects
# model is the compleate model, with all the zones send to fit_model
# model_out is the original model without outliers, chosen within a renge of removing upuntil 10% of records and choosing for the best one
# check out the HTML file for better understanding
model.params

## PRODUCTION
Repeate the process

In [None]:
df_regs = linreg.loop_models(df, Xcols=['POP', 'DOMICILIOS','PEA'], Ycol='PROD', mask=None, keepAll=False, CUT_R=0.5, force_intercept=False)
df_regs

In [None]:
# you can even check a regression that is not on the result above
model, model_out = linreg.fit_model(x=df[['POP', 'DOMICILIOS', 'PEA']], y=df['PROD'], intercept=True, plot=True, path=os.getcwd())
