## DATA SCIENCE PROJECT ON DIAMOND PRICE ANALYSIS

## BUSINESS CASE: BASED ON GIVEN FEATURES OF DIAMOND DATASET, WE NEED TO PREDICT THE PRICE OF DIAMONDS

#### MODEL CREATION & EVALUATION SUMMARY:
* Load and explore data
* Encode categorical features
* Feature scaling using StandardScaler
* Split training and testing data
* Model creation, prediction & evaluation
* Save the final model and scaler

### IMPORT NECESSARY LIBRARY

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neural_network import MLPRegressor
import joblib

### LOADING PREPROCESS DATA

In [None]:
df = pd.read_csv("89diamonds.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### DEFINE INDEPENDANT & DEPENDANT FEATURES

In [11]:
# Encode categorical variables
label_cols = ['cut', 'color', 'clarity']
le = LabelEncoder()
for col in label_cols:
    df[col] = le.fit_transform(df[col])

In [12]:
# Define features and target
X = df.drop('price', axis=1)
y = df['price']

### SPLIT TRAINING AND TESTING DATA

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### FEATURE SCALING USING STANDARD SCALER

In [14]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### MODEL CREATION, PREDICTION AND EVALUATION

#### AIM
* Create a sweet spot model (Low bias, Low variance)

#### ALGORITHM USED
* Artificial Neural Network [MLP Regressor]

### Artificial Neural Network [MLP Regressor]

In [15]:
mlp = MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train)



In [16]:
# Prediction
pred_mlp = mlp.predict(X_test_scaled)

In [18]:
# Evaluation
from sklearn.metrics import mean_squared_error
import numpy as np

# Use squared=True (default) and take sqrt manually
mse = mean_squared_error(y_test, pred_mlp)
rmse_mlp = np.sqrt(mse)

print("RMSE (manual):", rmse_mlp)


RMSE (manual): 49.51686925088389


In [19]:
print("MLP Regressor → R²:", r2_mlp, " RMSE:", rmse_mlp)

MLP Regressor → R²: 0.9998457603397052  RMSE: 49.51686925088389


### FINAL MODEL SELECTION AND SAVING

In [20]:
final_model = mlp
joblib.dump(final_model, 'diamond_price_ann_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']

### PREDICTION ON NEW SAMPLE

In [21]:
sample = X_test.iloc[[0]]
sample_scaled = scaler.transform(sample)
predicted_price = final_model.predict(sample_scaled)
print("Predicted Price for sample:", predicted_price[0])

Predicted Price for sample: 488.63532356632845


## CONCLUSION:
* Artificial Neural Network [Multilayer Perceptron] was chosen for this regression task.
* It performed well with high R² score and low RMSE on the test data.
* The model is saved and ready to be deployed for predicting diamond prices based on input features.