#**Challenge Modelos Regressivos.**

##***Desafio: Estimando o Preço do Imóvel.***

Suponha que você trabalha em uma empresa do setor imobiliário. Após explorar as
principais variáveis que impactam o preco dos imóveis,na semana passada, pediram
para você, do time de dados, construir um modelo que estime o preço de um imóvel.
Para isso, você utilizará a mesma base de dados da semana anterior
(desafio_01_preco_imoveis.csv). Como você poderia ajudá-los?

A base contém as seguintes informações:

● id : código identificador do imóvel;

● date: data em que o imóvel foi cadastrado;

● price: preço do imóvel;

● bedrooms: número de quartos;

● bathrooms: número de banheiros;

● sqft_living: tamanho do imóvel (dentro da casa);

● sqft_lot: tamanho do lote / terreno;

● floors: andares;

● waterfront: flag se o imóvel tem vista para o mar. 1 caso positivo, 0 caso
contrário;

● view: indica o número de quartos com vista;

● condition: condição do imóvel, escala de 1 a 5;

● grade: nota do imóvel;

● sqft_above: tamanho da casa acima do solo (excluindo porão);

● sqft_basement: tamanho do porão;

● yr_built: ano em que a casa foi construída;

● yr_renovated: ano em que a casa foi renovada

● zipcode: "CEP" do imóvel;

● lat: latitude do imóvel;

● long: longitude do imóvel;

1 - Selecione as principais variáveis que você gostaria de incluir no modelo de
precificação do imóvel.

2 - Construa um modelo de regressão linear multivariada para estimar o preço do
imóvel. Lembre-se de: Analisar a tabela de regressão e os resíduos do modelo e
interprete os resultados.

3 - Suponha agora que você encontrou um modelo ideal. Explique como você faria
para colocar esse modelo em produção? Entenda por modelo em produção um
modelo que a cada apartamento novo inserido na base precifique-o com base no
seu modelo. 
Materiais de apoio:

https://medium.com/creditas-tech/terminei-a-modelagem-e-agora-parte-i-60423
2bb5114

https://medium.com/analytics-vidhya/deploying-linear-regression-ml-model-as-we
b-application-on-docker-3409f9464a27 (conteúdo em inglês: utilizar google
tradutor caso necessário)

https://docs.microsoft.com/en-us/sql/machine-learning/tutorials/python-ski-renta
l-linear-regression-deploy-model?view=sql-server-ver15 (conteúdo em inglês:
utilizar google tradutor caso necessário)

**Importando Bibliotecas**

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from   sklearn.linear_model import LinearRegression
from   sklearn.metrics import r2_score
import statsmodels.api as sm
import os

**Leitura dos dados**

In [2]:
df = pd.read_csv('desafio_01_preco_imoveis-230209-165044.csv')

**Visualizando os Dados**

In [3]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045


**Verificando o tamanho do Dataset**

In [4]:
df.shape

(21613, 19)

**Verificando as informações dos tipos de Dados**

In [5]:
type(df)

pandas.core.frame.DataFrame

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

**Estatisticas Descritivas**

In [7]:
df.describe().round(2)

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540088.14,3.37,2.11,2079.9,15106.97,1.49,0.01,0.23,3.41,7.66,1788.39,291.51,1971.01,84.4,98077.94,47.56,-122.21
std,2876566000.0,367127.2,0.93,0.77,918.44,41420.51,0.54,0.09,0.77,0.65,1.18,828.09,442.58,29.37,401.68,53.51,0.14,0.14
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.16,-122.52
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.47,-122.33
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.57,-122.23
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.68,-122.12
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.78,-121.32


**1 - Selecione as principais variáveis que você gostaria de incluir no modelo de precificação do imóvel.**

In [8]:
#matriz de correlaçao
matriz_correlacao = df.corr().round(4)
matriz_correlacao

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long
id,1.0,-0.0168,0.0013,0.0052,-0.0123,-0.1321,0.0185,-0.0027,0.0116,-0.0238,0.0081,-0.0108,-0.0052,0.0214,-0.0169,-0.0082,-0.0019,0.0208
price,-0.0168,1.0,0.3083,0.5251,0.702,0.0897,0.2568,0.2664,0.3973,0.0364,0.6674,0.6056,0.3238,0.054,0.1264,-0.0532,0.307,0.0216
bedrooms,0.0013,0.3083,1.0,0.5159,0.5767,0.0317,0.1754,-0.0066,0.0795,0.0285,0.357,0.4776,0.3031,0.1542,0.0188,-0.1527,-0.0089,0.1295
bathrooms,0.0052,0.5251,0.5159,1.0,0.7547,0.0877,0.5007,0.0637,0.1877,-0.125,0.665,0.6853,0.2838,0.506,0.0507,-0.2039,0.0246,0.223
sqft_living,-0.0123,0.702,0.5767,0.7547,1.0,0.1728,0.3539,0.1038,0.2846,-0.0588,0.7627,0.8766,0.435,0.318,0.0554,-0.1994,0.0525,0.2402
sqft_lot,-0.1321,0.0897,0.0317,0.0877,0.1728,1.0,-0.0052,0.0216,0.0747,-0.009,0.1136,0.1835,0.0153,0.0531,0.0076,-0.1296,-0.0857,0.2295
floors,0.0185,0.2568,0.1754,0.5007,0.3539,-0.0052,1.0,0.0237,0.0294,-0.2638,0.4582,0.5239,-0.2457,0.4893,0.0063,-0.0591,0.0496,0.1254
waterfront,-0.0027,0.2664,-0.0066,0.0637,0.1038,0.0216,0.0237,1.0,0.4019,0.0167,0.0828,0.0721,0.0806,-0.0262,0.0929,0.0303,-0.0143,-0.0419
view,0.0116,0.3973,0.0795,0.1877,0.2846,0.0747,0.0294,0.4019,1.0,0.046,0.2513,0.1676,0.2769,-0.0534,0.1039,0.0848,0.0062,-0.0784
condition,-0.0238,0.0364,0.0285,-0.125,-0.0588,-0.009,-0.2638,0.0167,0.046,1.0,-0.1447,-0.1582,0.1741,-0.3614,-0.0606,0.003,-0.0149,-0.1065


In [9]:
# variáveis mais correlacionadas com "preço" e ordenar as correlações em ordem decrescente(mais correlacionadas)
corr_with_price = matriz_correlacao['price'].sort_values(ascending=False).round(4)
corr_with_price

price            1.0000
sqft_living      0.7020
grade            0.6674
sqft_above       0.6056
bathrooms        0.5251
view             0.3973
sqft_basement    0.3238
bedrooms         0.3083
lat              0.3070
waterfront       0.2664
floors           0.2568
yr_renovated     0.1264
sqft_lot         0.0897
yr_built         0.0540
condition        0.0364
long             0.0216
id              -0.0168
zipcode         -0.0532
Name: price, dtype: float64

**2 - Construa um modelo de regressão linear multivariada para estimar o preço do imóvel. Lembre-se de: Analisar a tabela de regressão e os resíduos do modelo e interprete os resultados.**

In [10]:
#Importando o train_test_split da biblioteca scikit-learn
from sklearn.model_selection import train_test_split

In [11]:
# Criando uma Series (pandas) para armazenar o o preço dos imóveis (y)
y = df['price']

In [12]:
#Criando um DataFrame (pandas) para armazenar as variáveis explicativas (X)
X = df[['sqft_living', 'grade', 'sqft_above', 'bathrooms', 'view', 'sqft_basement',
        'bedrooms', 'lat', 'waterfront', 'floors', 'yr_renovated', 'sqft_lot', 'yr_built',
        'condition', 'long', 'id', 'zipcode']]

In [13]:
# Criando os datasets de treino e de teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=200)

In [14]:
#Importando LinearRegression e metrics da biblioteca scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [15]:
# Instanciando a classe LinearRegression()
modelo = LinearRegression()

In [16]:
# Utilizando o método fit() para estimar o modelo linear utilizando os dados de TREINO (y_train e X_train)
modelo.fit(X_train, y_train)

In [17]:
# Obtendo o coeficiente de determinação (R²) do modelo estimado com os dados de TREINO
print('R² = {}'.format(modelo.score(X_train, y_train).round(2)))

R² = 0.7


In [18]:
# print('R² = {}'.format(modelo.score(X_train, y_train).round(2)))
y_previsto = modelo.predict(X_test)

In [19]:
# Obtendo o coeficiente de determinação (R²) para as previsões do nosso modelo
print('R² = %s' % metrics.r2_score(y_test, y_previsto).round(2))

R² = 0.69


In [24]:
# Gerando as previsões do modelo para os dados de TREINO
y_previsto_train = modelo.predict(X_train)

In [20]:
# Obtendo resultado da regressao
resultado_regressao = sm.OLS(y, X).fit()

In [21]:
print(resultado_regressao.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.699
Model:                            OLS   Adj. R-squared:                  0.699
Method:                 Least Squares   F-statistic:                     3340.
Date:                Fri, 17 Mar 2023   Prob (F-statistic):               0.00
Time:                        18:19:15   Log-Likelihood:            -2.9464e+05
No. Observations:               21613   AIC:                         5.893e+05
Df Residuals:                   21597   BIC:                         5.894e+05
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
sqft_living     114.7283      2.132     53.819

In [27]:
residuo = y_train - y_previsto_train

In [28]:
# Obtendo os resíduos da regressão
df['Residuos'] = resultado_regressao.resid
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,Residuos
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,4565.555078
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,-199137.972233
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,-161297.584234
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,136362.429753
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,60376.973445


**3 - Suponha agora que você encontrou um modelo ideal. Explique como você faria para colocar esse modelo em produção? Entenda por modelo em produção um modelo que a cada apartamento novo inserido na base precifique-o com base no seu modelo.**

Modelo de um simulador que gera estimativas de preço a partir de um conjunto de informações de um imóvel.

In [32]:
id = 7129300520
date = 20141013
bedrooms = 3
bathrooms = 1.00
sqft_living = 1180
sqft_lot = 5650
floors = 1.0
waterfront = 0
view = 0
condition = 3	
grade = 7
sqft_above = 1180
sqft_basement = 0
yr_built = 1955	 
yr_renovated = 0
zipcode = 98178	
lat = 47.5112	
long = -122.257

entrada=df[['sqft_living', 'grade', 'sqft_above', 'bathrooms', 'view', 'sqft_basement',
        'bedrooms', 'lat', 'waterfront', 'floors', 'yr_renovated', 'sqft_lot', 'yr_built',
        'condition', 'long', 'id', 'zipcode']]

print('$ {0:.2f}'.format(modelo.predict(entrada)[0]))

$ 214870.48


**simulador interativo para Jupyter**

In [35]:
# Importando bibliotecas
from ipywidgets import widgets, HBox, VBox
from IPython.display import display

# Criando os controles do formulário
id = widgets.Text(description="Código")
date = widgets.Text(description="Data")
bedrooms = widgets.Text(description="Quartos")
bathrooms = widgets.Text(description="Banheiros")
sqft_living = widgets.Text(description="Tam imóvel")
sqft_lot = widgets.Text(description="Tam terreno")
floors = widgets.Text(description="Andares")
waterfront = widgets.Text(description="Vista mar")
view = widgets.Text(description="Núm quartos com vista")
condition = widgets.Text(description="Condição do imóvel")	
grade = widgets.Text(description="Nota do imóvel")
sqft_above = widgets.Text(description="Tam acima do solo")
sqft_basement = widgets.Text(description="Tam porão")
yr_built = widgets.Text(description="Ano construção") 
yr_renovated = widgets.Text(description="Ano renovação")
zipcode = widgets.Text(description="CEP do imóvel")
lat = widgets.Text(description="latitude do imóvel")
long = widgets.Text(description="longitude do imóvel")

botao = widgets.Button(description="Simular")

# Posicionando os controles
left = VBox([id, date, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view])
right = VBox([condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long])
inputs = HBox([left, right])

# Função de simulação
def simulador(sender):
    entrada=[[															
                float(id.value if id.value else 0), 
                float(date.value if date.value else 0), 
                float(bedrooms.value if bedrooms.value else 0), 
                float(bathrooms.value if bathrooms.value else 0), 
                float(sqft_living.value if sqft_living.value else 0), 
                float(sqft_lot.value if sqft_lot.value else 0),
                float(floors.value if floors.value else 0),
                float(waterfront.value if waterfront.value else 0),
                float(view.value if view.value else 0),
                float(condition.value if condition.value else 0),
                float(grade.value if grade.value else 0),
                float(sqft_above.value if sqft_above.value else 0),
                float(sqft_basement.value if sqft_basement.value else 0),
                float(yr_built.value if yr_built.value else 0),
                float(yr_renovated.value if yr_renovated.value else 0),
                float(zipcode.value if zipcode.value else 0),
                float(lat.value if lat.value else 0),
                float(long.value if long.value else 0)
                 ]]
    print('$ {0:.2f}'.format(modelo.predict(entrada)[0]))
    
# Atribuindo a função "simulador" ao evento click do botão
botao.on_click(simulador)    

In [36]:
display(inputs, botao)

HBox(children=(VBox(children=(Text(value='', description='Código'), Text(value='', description='Data'), Text(v…

Button(description='Simular', style=ButtonStyle())