## Questão 2 - Previsão de Preço de Diamantes

Base de dados extraída de: https://www.kaggle.com/datasets/shivam2503/diamonds

**Context**

This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization.

**Content**

**price** price in US dollars (\$326--\$18,823)

**carat** weight of the diamond (0.2--5.01)

**cut** quality of the cut (Fair, Good, Very Good, Premium, Ideal)

**color** diamond colour, from J (worst) to D (best)

**clarity** a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**x length** in mm (0--10.74)

**y width** in mm (0--58.9)

**z depth** in mm (0--31.8)

**depth** total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

**table** width of top of diamond relative to widest point (43--95)

**Importing Necessary Packages**

In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import sklearn

**Creating the DataFrame**

In [10]:
data = pd.read_csv('diamonds.csv')
data.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


In [11]:
data = data.drop(columns= ['Unnamed: 0']) # Excluindo a coluna "Unnamed"
data

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


**Checking the DataFrame**

In [12]:
# Testando se tem Nulos
data.isnull().sum().sum()

0

In [13]:
#Visualizando melhor os dados
data.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


**Data Processing**

*Dados Categóricos*

In [14]:
# Analisando os valores categóricos

print(data['cut'].unique())
print(data['color'].unique())
print(data['clarity'].unique())

['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']
['E' 'I' 'J' 'H' 'F' 'G' 'D']
['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF']


In [15]:
# Transformando os valores categóricos em ordem do pior pro melhor

cut_mapping = {
    'Ideal': 4,
    'Premium': 3,
    'Very Good': 2,
    'Good': 1,
    'Fair': 0
}

color_mapping = {
    'D': 6,
    'E': 5,
    'F': 4,
    'G': 3,
    'H': 2,
    'I': 1,
    'J': 0                
}

clarity_mapping = {
    'IF': 7,
    'VVS1': 6,
    'VVS2': 5,
    'VS1': 4,
    'VS2': 3,
    'SI1': 2,
    'SI2': 1,
    'I1': 0
}


data['cut'] = data['cut'].map(cut_mapping)
data['color'] = data['color'].map(color_mapping)
data['clarity'] = data['clarity'].map(clarity_mapping)

In [16]:
# Checando se os atributos foram atualizados
data.sample(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
12264,1.0,3,2,3,59.3,61.0,5208,6.47,6.45,3.83
21527,1.06,4,6,4,62.0,57.0,9625,6.49,6.54,4.04
9418,1.01,1,3,2,63.5,60.0,4588,6.3,6.33,4.01
41015,0.51,1,4,2,64.0,57.0,1186,5.01,5.06,3.22
15530,1.0,1,4,3,63.1,57.0,6223,6.32,6.36,4.0


*Separando Treino e Teste*

In [22]:
from sklearn.model_selection import train_test_split # Separando dados entre treino e teste

X = data.drop(columns= ['price']) # Coluna price excluída do treino
y = data['price'] # Definindo price como nosso y

X.columns = ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z']

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2)

# splitting the data into train and test. 
# The data is split in a 80-20 ratio of train:test.

*Normalizando os dados*

In [23]:
# Normalizando os dados

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Building the Models**

In [27]:
# Importando as classes do scikit-learn

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [28]:
# Listando os modelos

LR = LinearRegression()
LR.fit(X_train, y_train)

LA = Lasso()
LA.fit(X_train, y_train)

DT = DecisionTreeRegressor()
DT.fit(X_train, y_train)

RF = RandomForestRegressor()
RF.fit(X_train, y_train)

KNN = KNeighborsRegressor()
KNN.fit(X_train, y_train)

GB = GradientBoostingRegressor()
GB.fit(X_train, y_train)

*Linear Regression*

In [39]:
from sklearn.metrics import mean_squared_error, r2_score

# Rodando o modelo
y_pred_LR = LR.predict(X_test)

# Calculando o MSE
mse_LR = mean_squared_error(y_test, y_pred_LR)

# Calculando o R²
r2_LR = r2_score(y_test, y_pred_LR)

# Mostrando os resultados
print('MSE e R² para Linear Regression: MSE = {:.2f}, R² = {:.4f}'.format(mse_LR, r2_LR))


MSE e R² para Linear Regression: MSE = 1505643.75, R² = 0.9047


*Lasso*

In [40]:
# Rodando o modelo
y_pred_LA = LA.predict(X_test)

# Calculando o MSE
mse_LA = mean_squared_error(y_test, y_pred_LA)

# Calculando o R²
r2_LA = r2_score(y_test, y_pred_LA)

# Mostrando os resultados
print('MSE e R² para Lasso: MSE = {:.2f}, R² = {:.4f}'.format(mse_LA, r2_LA))



MSE e R² para Lasso: MSE = 1504904.24, R² = 0.9047


*Decision Tree*

In [41]:
# Rodando o modelo
y_pred_DT = DT.predict(X_test)

# Calculando o MSE
mse_DT = mean_squared_error(y_test, y_pred_DT)

# Calculando o R²
r2_DT = r2_score(y_test, y_pred_DT)

# Mostrando os resultados
print('MSE e R² para Decision Tree: MSE = {:.2f}, R² = {:.4f}'.format(mse_DT, r2_DT))


MSE e R² para Decision Tree: MSE = 570438.65, R² = 0.9639


*Random Forest*

In [42]:
# Rodando o modelo
y_pred_RF = RF.predict(X_test)

# Calculando o MSE
mse_RF = mean_squared_error(y_test, y_pred_RF)

# Calculando o R²
r2_RF = r2_score(y_test, y_pred_RF)

# Mostrando os resultados
print('MSE e R² para Random Forest: MSE = {:.2f}, R² = {:.4f}'.format(mse_RF, r2_RF))


MSE e R² para Random Forest: MSE = 297870.41, R² = 0.9811


*KNN*

In [43]:
# Rodando o modelo
y_pred_KNN = KNN.predict(X_test)

# Calculando o MSE
mse_KNN = mean_squared_error(y_test, y_pred_KNN)

# Calculando o R²
r2_KNN = r2_score(y_test, y_pred_KNN)

# Mostrando os resultados
print('MSE e R² para KNeighbors: MSE = {:.2f}, R² = {:.4f}'.format(mse_KNN, r2_KNN))



MSE e R² para KNeighbors: MSE = 566475.53, R² = 0.9641


*Gradient Boosting*

In [44]:
# Rodando o modelo
y_pred_GB = GB.predict(X_test)

# Calculando o MSE
mse_GB = mean_squared_error(y_test, y_pred_GB)

# Calculando o R²
r2_GB = r2_score(y_test, y_pred_GB)

# Mostrando os resultados
print('MSE e R² para Gradient Boosting: MSE = {:.2f}, R² = {:.4f}'.format(mse_GB, r2_GB))

MSE e R² para Gradient Boosting: MSE = 398894.59, R² = 0.9747


*Comparando Resultados*

In [50]:
# Comparando os resultados

resultados = {
    'Modelo': ['Linear Regression', 'Lasso', 'Decision Tree', 'Random Forest', 'KNN', 'Gradient Boosting'],
    'MSE': [mse_LR, mse_LA, mse_DT, mse_RF, mse_KNN, mse_GB],
    'R²': [r2_LR, r2_LA, r2_DT, r2_RF, r2_KNN, r2_GB]
}

# Arredondando os valores para 4 casas decimais
for key in resultados.keys():
    if key != 'Modelo':
        resultados[key] = [round(value, 3) for value in resultados[key]]

# Criando um DataFrame de resultados
df_resultados = pd.DataFrame(resultados)
print(df_resultados)

              Modelo          MSE     R²
0  Linear Regression  1505643.749  0.905
1              Lasso  1504904.236  0.905
2      Decision Tree   570438.652  0.964
3      Random Forest   297870.410  0.981
4                KNN   566475.530  0.964
5  Gradient Boosting   398894.588  0.975
