In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Dummy Variables

Vamos iniciar criando variáveis dummies para nosso dataset.

Lendo os dados

In [2]:
df_diamonds = pd.read_csv('diamonds.csv')

Descrição das variáveis

- price price in US dollars (\$326--\$18,823)

- carat weight of the diamond (0.2--5.01)

- cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

- color diamond colour, from J (worst) to D (best)

- clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

- x length in mm (0--10.74)

- y width in mm (0--58.9)

- z depth in mm (0--31.8)

- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

- table width of top of diamond relative to widest point (43--95)

Criando dummies para a variável cut

Precisamos passar a coluna para ser aplicada, ou então podemos passar todo o dataframe e a função identifica as variaveis categóricas e aplica para todas.

In [3]:
df_diamonds_subset = df_diamonds.loc[:,['price', 'carat', 'cut']]

In [4]:
df_diamonds_subset = pd.get_dummies(df_diamonds_subset, drop_first=True)

In [5]:
df_diamonds_subset

Unnamed: 0,price,carat,cut_Good,cut_Ideal,cut_Premium,cut_Very Good
0,326,0.23,0,1,0,0
1,326,0.21,0,0,1,0
2,327,0.23,1,0,0,0
3,334,0.29,0,0,1,0
4,335,0.31,1,0,0,0
...,...,...,...,...,...,...
53935,2757,0.72,0,1,0,0
53936,2757,0.72,1,0,0,0
53937,2757,0.70,0,0,0,1
53938,2757,0.86,0,0,1,0


Ajustando o modelo

X será uma series contendo as variáveis preditoras

Y será uma series com os valores de price, nossa variável resposta

In [8]:
X = df_diamonds_subset.drop(['price'], axis = 1)
Y = df_diamonds_subset['price']

Aqui, precisamos garantir que exista intercepto

In [9]:
X = sm.add_constant(X)

Inicializando o modelo e startando o processo de busca dos parâmetros Betas

In [10]:
modelo_linear_multiplo = sm.OLS(Y, X)
modelo_linear_multiplo = modelo_linear_multiplo.fit()

Sumarizando os resultados na tabela

In [11]:
print(modelo_linear_multiplo.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.856
Method:                 Least Squares   F-statistic:                 6.437e+04
Date:                Tue, 10 Jan 2023   Prob (F-statistic):               0.00
Time:                        19:23:38   Log-Likelihood:            -4.7142e+05
No. Observations:               53940   AIC:                         9.429e+05
Df Residuals:                   53934   BIC:                         9.429e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const         -3875.4697     40.408    -95.908

Note que o modelo consegue entender que o corte Ideal é o que mais aumenta o preço

# Divisão em Treino e Teste

Para dividir nossos dados em Treino e Teste, usaremos a função train_test_split do sklearn.

Passamos o parâmetro random_state = 123 para reprodutibilidade e o parâmetro train_size para indicar quantos % da base serão destinados para treinamento.

In [12]:
X_train , X_test , y_train , y_test = train_test_split(X,Y, train_size=0.7 , random_state = 123)

In [14]:
X_train.head()

Unnamed: 0,const,carat,cut_Good,cut_Ideal,cut_Premium,cut_Very Good
13934,1.0,1.06,0,1,0,0
38054,1.0,0.4,0,1,0,0
34090,1.0,0.41,0,0,1,0
31919,1.0,0.3,0,1,0,0
17402,1.0,1.09,0,1,0,0


Veja que ele embaralha o dataset antes de fazer a divisão dos dados!