# Gemstone-price-prediction


## Problem Statement

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You are provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning different profits on different prize slots. You have to help the company in predicting the price for the stone on the basis of the details given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so as to have a better profit share. Also, provide them with the best 5 attributes that are most important.'


## Task you can perform

📌 Clean and preprocess the data

📌 Do Exploratory Data Analysis (EDA) to get some insight into data

📌 Do Feature Engineering

📌 Build a model i.e Regression Analysis

📌 Evaluate the model

📌 Go back to any of the previous steps unless the result is sufficient.

Dataset Link:https://www.kaggle.com/datasets/colearninglounge/gemstone-price-prediction

## **Importing** Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Downloading Dataset

In [None]:
train_df=pd.read_csv('/kaggle/input/playground-series-s3e8/train.csv')
train_df

In [None]:
test_df=pd.read_csv('/kaggle/input/playground-series-s3e8/test.csv')
test_df

In [None]:
train_df.info()

In [None]:
train_df.describe()

In [None]:
## Finding Null Values
train_df.isnull().sum()

In [None]:
train_df.head(2)

In [None]:
train_df.drop(['id'],axis=1,inplace=True)

In [None]:
## Printing All Unique Value of categorical Feature
print(train_df['cut'].unique())
print(train_df['color'].unique())
print(train_df['clarity'].unique())

## Exploratory Data Analysis

In [None]:
train_df['cut'].value_counts()

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(data=train_df,x='cut',y='price',palette = "Blues")
plt.title("Price Change With Cut Change",fontsize=14,pad=14)
plt.xlabel('Cut',fontsize=12)
plt.ylabel('Price',fontsize=12)

plt.tick_params(axis='x',which='major',labelsize=12,rotation=90)
plt.tick_params(axis='y',which='major',labelsize=12)



Premium and Fair Cut has highest Price

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(data=train_df,x='color',y='price',palette = "Oranges")
plt.title("Price Change With Different Color ",fontsize=14,pad=14)
plt.xlabel('Color',fontsize=12)
plt.ylabel('Price',fontsize=12)

plt.tick_params(axis='x',which='major',labelsize=12)
plt.tick_params(axis='y',which='major',labelsize=12)

- Color J has highest Price

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(data=train_df,x='clarity',y='price',palette ="Blues")
plt.title("Price Change With Different Clarity ",fontsize=14,pad=14)
plt.xlabel('Clarity',fontsize=12)
plt.ylabel('Price',fontsize=12)

plt.tick_params(axis='x',which='major',labelsize=12)
plt.tick_params(axis='y',which='major',labelsize=12)

- S12 clarity has highest Price

In [None]:
sns.pairplot(train_df)

## How Different Numeric Feature Related to Price

In [None]:
corrleation=train_df.corr()
corrleation

### Price has highely Correlated to carat means the price of gemstone mainly depend on the carat of gemstone and also how the cut along with different axis x,y,z.

#### It means as the quality of carat decrease price also decrease and also with the cutting

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(data=corrleation,annot=True,center=0)

## Preprocessing and Feature Engineering

In [None]:
train_df.info()

In [None]:
## Checking Null Values
train_df.isnull().sum()

In [None]:
## Checking Duplicate value
train_df.duplicated().sum()

## Identifying Input and Target Columns

In [None]:
train_df.columns

In [None]:
input_cols=train_df.columns[0:-1]
input_cols

In [None]:
target_col="price"
target_col

## Making Input and target dataframe

In [None]:
input_df=train_df[input_cols].copy()
input_df

In [None]:
target=train_df[target_col]
target

## Identifying Numeric and categorical Columns

In [None]:
numeric_cols=input_df.select_dtypes(include=['int64','float64']).columns.tolist()
numeric_cols

In [None]:
categorical_cols=input_df.select_dtypes('object').columns.tolist()
categorical_cols

## Scaling Numeric Columns

In [None]:
input_df[numeric_cols].describe().loc[['min','max']]

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler=MinMaxScaler()

In [None]:
scaler.fit(train_df[numeric_cols])

In [None]:
input_df[numeric_cols]=scaler.transform(input_df[numeric_cols])
input_df.describe().loc[['min','max']]

## Encoding Categorical Columns

In [None]:
input_df.shape

In [None]:
from sklearn.preprocessing import OneHotEncoder


In [None]:
encoder=OneHotEncoder(sparse=False,handle_unknown='ignore')

In [None]:
encoder.fit(input_df[categorical_cols])

In [None]:
encoded_cols=list(encoder.get_feature_names_out(categorical_cols))
encoded_cols

In [None]:
print(len(encoded_cols))

In [None]:
input_df[categorical_cols]

In [None]:
input_df[encoded_cols]=encoder.transform(input_df[categorical_cols])
input_df

## Training and Validation Set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_inputs,val_inputs,train_target,val_target=train_test_split(input_df[numeric_cols + encoded_cols],
                                                                 target,
                                                                 test_size=0.25,
                                                                 random_state=42)

In [None]:
train_inputs

In [None]:
train_target

In [None]:
val_inputs

In [None]:
val_target

## Training Models

In [None]:
from sklearn.linear_model import LinearRegression
linear_model=LinearRegression()
linear_model.fit(train_inputs,train_target)

In [None]:
print(linear_model.coef_,linear_model.intercept_)

## Making Prediction and Evaluating the Model

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
train_pred=linear_model.predict(train_inputs)

In [None]:
train_rmse=mean_squared_error(train_target,train_pred,squared=False)


In [None]:
print("The RMSE Loss for Training Data is $",train_rmse)

In [None]:
## Predicting On Validation Set
val_pred=linear_model.predict(val_inputs)
val_pred

In [None]:
val_rmse=mean_squared_error(val_target,val_pred,squared=False)

In [None]:
print("The RMSE Loss for Validation Data is $",val_rmse)

## Feature Importance

In [None]:
weights=linear_model.coef_

weights_df=pd.DataFrame({
    'columns':train_inputs.columns,
    'weight':weights,

}).sort_values('weight',ascending=False)


In [None]:
weights_df.head()

## Making Prediction On Test Data

In [None]:
test_df.info()

In [None]:
test_df.isnull().sum()

In [None]:
test_df.duplicated().sum()

## Model Preprocessing and Feature Importance

In [None]:
test_df[numeric_cols],test_df[categorical_cols]

In [None]:
## Scaling Numeric Columns and encoding Categorical Columns
test_df[numeric_cols]=scaler.transform(test_df[numeric_cols])
test_df[encoded_cols]=encoder.transform(test_df[categorical_cols])

In [None]:
test_inputs=test_df[numeric_cols + encoded_cols]

In [None]:
test_df.shape

In [None]:
test_pred=linear_model.predict(test_inputs)
test_pred

In [None]:
train_pred

In [None]:
submission_df=pd.read_csv('/kaggle/input/playground-series-s3e8/sample_submission.csv')
submission_df.head()

In [None]:
submission_df['price']=test_pred

In [None]:
submission_df.to_csv('submission_df',index=False)

In [None]:
submission_df