 # <h1 style='background:#682F2F; border:2; border-radius: 10px; font-size:250%; font-weight: bold; color:white'><center>DIAMOND PRICE PREDICTION: REGRESSION</center></h1> 
 

<h1 style='background:#682F2F; border:0; border-radius: 10px; color:white'><center>TABLE OF CONTENTS</center></h1>

### [**1. IMPORTING LIBRARIES**](#title-one)
    
### [**2. LOADING DATA**](#title-two)

### [**3. DATA ANALYSIS**](#title-three)

### [**4. DATA PREPROCESSING**](#title-four)

### [**5. MODEL BUILDING**](#title-five) 

### [**6. END**](#title-six)

<a id = "title-one"></a>
<h1 style='background:#682F2F; border:0; border-radius: 10px; color:white'><center>IMPORTING LIBRARIES</center></h1>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn. linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn import metrics 

<a id = "title-two"></a>
<h1 style='background:#682F2F; border:0; border-radius: 10px; color:white'><center>LOADING DATA</center></h1>


In [None]:
data_df = pd.read_csv("../input/diamonds/diamonds.csv")
data_df.sample(10)

# **<span style="color:#682F2F;"><center>LABELLED DIMENSIONS OF A DIAMOND</center></span>**


<div style="border-radius:10px;
            border : #682F2F solid;
            background-color:#FFF8DC;
           font-size:110%;
            text-align: left">
    
## <h2 style='border:0; color:#682F2F'><center>About the data (Description of attributes)</center></h2>

**This classic dataset contains the prices and other attributes of almost 54,000 diamonds. There are 10 attributes included in the dataset including the target ie. price.**

* **carat (0.2-5.01):** The carat is the diamond’s physical weight measured in metric carats. One carat equals 0.20 gram and is subdivided into 100 points. 
* **cut (Fair, Good, Very Good, Premium, Ideal):** The quality of the cut. The more precise the diamond is cut, the more captivating the diamond is to the eye thus of high grade. 
* **color (from J (worst) to D (best)):** The colour of gem-quality diamonds occurs in many hues. In the range from colourless to light yellow or light brown. Colourless diamonds are the rarest. Other natural colours (blue, red, pink for example) are known as "fancy,” and their colour grading is different than from white colorless diamonds. 
* **clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)):** Diamonds can have internal characteristics known as inclusions or external characteristics known as blemishes. Diamonds without inclusions or blemishes are rare; however, most characteristics can only be seen with magnification. 
* **depth (43-79)**: It is the total depth percentage which equals to z / mean(x, y) = 2 * z / (x + y). The depth of the diamond is its height (in millimetres) measured from the culet (bottom tip) to the table (flat, top surface) as referred in the labelled diagram above. 
* **table (43-95):** It is the width of the top of the diamond relative to widest point. It gives diamond stunning fire and brilliance by reflecting lights to all directions which when seen by an observer, seems lustrous. 
* **price ($$326 - $18826):** It is the price of the diamond in US dollars. **It is our very target column in the dataset.**
* **x (0 - 10.74):** Length of the diamond (in mm) 
* **y (0 - 58.9):** Width of the diamond (in mm) 
* **z (0 - 31.8):** Depth of the diamond (in mm) 

<a id = "title-three"></a>
<h1 style='background: #682F2F; border:0; border-radius: 10px; color:white'><center>DATA ANALYSIS</center></h1>

### **<span style="color:#682F2F;"><center>Checking for missing values & categorical variables</center></span>**

In [None]:
# Checking for missing values and categorical variables in the dataset
data_df.info()

### **<span style="color:#682F2F;">Note: </span>**
##### There are 53940 non-null values in all the attributes thus no missing values.
##### Datatype of features 'cut', 'color' & 'clarity' is "object" which needs to be converted into numerical variable (will be done in data preprocessing) before we feed the data to algorithms. 

### **<span style="color:#682F2F;"><center>Evaluating categorical features</center></span>**

In [None]:
plt.figure(figsize=(10,8))
cols = ["#A0522D","#A52A2A","#CD853F","#F4A460","#DEB887"]
ax = sns.violinplot(x="cut",y="price", data=data_df, palette=cols,scale= "count")
ax.set_title("Diamond Cut for Price", color="#774571", fontsize = 20)
ax.set_ylabel("Price", color="#4e4c39", fontsize = 15)
ax.set_xlabel("Cut", color="#4e4c39", fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
ax = sns.violinplot(x="color",y="price", data=data_df, palette=cols,scale= "count")
ax.set_title("Diamond Colors for Price", color="#774571", fontsize = 20)
ax.set_ylabel("Price", color="#4e4c39", fontsize = 15)
ax.set_xlabel("Color", color="#4e4c39", fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(13,8))
ax = sns.violinplot(x="clarity",y="price", data=data_df, palette=cols,scale= "count")
ax.set_title("Diamond Clarity for Price", color="#774571", fontsize = 20)
ax.set_ylabel("Price", color="#4e4c39", fontsize = 15)
ax.set_xlabel("Clarity", color="#4e4c39", fontsize = 15)
plt.show()

### **<span style="color:#682F2F;">Note: </span>**
##### "Ideal" diamond cuts are the most in the number while the "Fair" is the least. More diamonds of all of such cuts for lower price category. 
##### "J" color diamond which is worst are most rare however, "H" and "G" are more in number eventhough they're of inferior quality as well.
##### Diamonds of "IF" clarity which is best as well as "I1" which is worst are very rare and rest are mostly of in-between clarities. 

### **<span style="color:#682F2F;"><center>Descriptive Statistics</center></span>**

In [None]:
# Doing Univariate Analysis for statistical description and understanding of dispersion of data
data_df.describe().T

### **<span style="color:#682F2F;">Note: </span>**
##### "Price" as expected is right skewed, having more number of data points in left 
##### Under dimensional features of 'x', 'y' & 'z' - min value is 0 thus making such datapoints either a 1D or 2D diamond object which doesn't make much sense - so needs either to be imputed with appropriate value or dropped altogether.

In [None]:
#Doing Bivariate Analysis by examaning a pairplot  
ax = sns.pairplot(data_df, hue= "cut", palette = cols)

### **<span style="color:#682F2F;">Note: </span>**
##### There's a useless feature "unnamed" which is just an index and needs to be eliminated altogether. 
##### Features are having datapoints that are far from the rest of the dataset (outliers) which needs to be dealth with or else would affect our model.
##### "y" and "z" have some dimensional outliers in our dataset that needs to be eliminated.
##### Features "depth" & "table" should be capped after we confirm by examining the Line plots.

### **<span style="color:#682F2F;"><center>Checking for Potential Outliers</center></span>**

In [None]:
lm = sns.lmplot(x="price", y="y", data=data_df, scatter_kws={"color": "#BC8F8F"}, line_kws={"color": "#8B4513"})
plt.title("Line Plot on Price vs 'y'", color="#774571", fontsize = 20)
plt.show()

In [None]:
lm = sns.lmplot(x="price", y="z", data=data_df, scatter_kws={"color": "#BC8F8F"}, line_kws={"color": "#8B4513"})
plt.title("Line Plot on Price vs 'z'", color="#774571", fontsize = 20)
plt.show()

In [None]:
lm = sns.lmplot(x="price", y="depth", data=data_df, scatter_kws={"color": "#BC8F8F"}, line_kws={"color": "#8B4513"})
plt.title("Line Plot on Price vs 'depth'", color="#774571", fontsize = 20)
plt.show()

In [None]:
lm = sns.lmplot(x="price", y="table", data=data_df, scatter_kws={"color": "#BC8F8F"}, line_kws={"color": "#8B4513"})
plt.title("Line Plot on Price vs 'Table'", color="#774571", fontsize = 20)
plt.show()

### **<span style="color:#682F2F;">Note: </span>**
##### In the Line plots of above features, we can easily spot the outliers which we'll drop before feeding the data to the algorithm.

<a id = "title-four"></a>
<h1 style='background: #682F2F; border:0; border-radius: 10px; color:white'><center>DATA PREPROCESSING</center></h1>

### **<span style="color:#682F2F;"><center>Data Cleaning</center></span>**

In [None]:
# Removing the feature "Unnamed"
data_df = data_df.drop(["Unnamed: 0"], axis=1)
data_df.shape

In [None]:
# Removing the datapoints having min 0 value in either x, y or z features 
data_df = data_df.drop(data_df[data_df["x"]==0].index)
data_df = data_df.drop(data_df[data_df["y"]==0].index)
data_df = data_df.drop(data_df[data_df["z"]==0].index)
data_df.shape

### **<span style="color:#682F2F;"><center>Removing Outliers</center></span>**

In [None]:
# Dropping the outliers (since we have huge dataset) by defining appropriate measures across features 
data_df = data_df[(data_df["depth"]<75)&(data_df["depth"]>45)]
data_df = data_df[(data_df["table"]<80)&(data_df["table"]>40)]
data_df = data_df[(data_df["x"]<40)]
data_df = data_df[(data_df["y"]<40)]
data_df = data_df[(data_df["z"]<40)&(data_df["z"]>2)]
data_df.shape 

### **<span style="color:#682F2F;"><center>Encoding Categorical Variables</center></span>**

In [None]:
# Making a copy to keep original data in its form intact
data1 = data_df.copy()

# Applying label encoder to columns with categorical data
columns = ['cut','color','clarity']
label_encoder = LabelEncoder()
for col in columns:
    data1[col] = label_encoder.fit_transform(data1[col])
data1.describe()

### **<span style="color:#682F2F;">Note: </span>**
##### As categorical features have been converted into numerical columns, we are getting 5-point summary along with count, mean & std for them as well. 
##### Now, we may analyze correlation matrix after getting done with pre-processing for possible feature selection in order to make our dataset more cleaner, optimal before we feed it into algorithm.

### **<span style="color:#682F2F;"><center>Correlation Matrix</center></span>**

In [None]:
# Examining correlation matrix using heatmap
cmap = sns.diverging_palette(205, 133, 63, as_cmap=True)
cols = (["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
corrmat= data1.corr()
f, ax = plt.subplots(figsize=(15,12))
sns.heatmap(corrmat,cmap=cols,annot=True)

### **<span style="color:#682F2F;">Note: </span>**
##### Features "carat", "x", "y", "z" are highly correlated to our target variable, price. 
##### Features "cut", "clarity", "depth" are very low correlated (<|0.1|) thus may be removed though due to presence of only few selected features, we won't be doing that.

<a id = "title-five"></a>
<h1 style='background: #682F2F; border:0; border-radius: 10px; color:white'><center>MODEL BUILDING</center></h1>

In [None]:
# Defining the independent and dependent variables
X= data1.drop(["price"],axis =1)
y= data1["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=25)

In [None]:
# Building pipelins of standard scaler and model for varios regressors.

pipeline_lr=Pipeline([("scalar1",StandardScaler()),
                     ("lr",LinearRegression())])

pipeline_lasso=Pipeline([("scalar2", StandardScaler()),
                      ("lasso",Lasso())])

pipeline_dt=Pipeline([("scalar3",StandardScaler()),
                     ("dt",DecisionTreeRegressor())])

pipeline_rf=Pipeline([("scalar4",StandardScaler()),
                     ("rf",RandomForestRegressor())])


pipeline_kn=Pipeline([("scalar5",StandardScaler()),
                     ("kn",KNeighborsRegressor())])


pipeline_xgb=Pipeline([("scalar6",StandardScaler()),
                     ("xgb",XGBRegressor())])

# List of all the pipelines
pipelines = [pipeline_lr, pipeline_lasso, pipeline_dt, pipeline_rf, pipeline_kn, pipeline_xgb]

# Dictionary of pipelines and model types for ease of reference
pipeline_dict = {0: "LinearRegression", 1: "Lasso", 2: "DecisionTree", 3: "RandomForest",4: "KNeighbors", 5: "XGBRegressor"}

# Fit the pipelines
for pipe in pipelines:
    pipe.fit(X_train, y_train)

In [None]:
cv_results_rms = []
for i, model in enumerate(pipelines):
    cv_score = cross_val_score(model, X_train,y_train,scoring="neg_root_mean_squared_error", cv=12)
    cv_results_rms.append(cv_score)
    print("%s: %f " % (pipeline_dict[i], -1 * cv_score.mean()))

In [None]:
# Model prediction on test data with XGBClassifier which gave us the least RMSE 
pred = pipeline_xgb.predict(X_test)
print("R^2:",metrics.r2_score(y_test, pred))
print("Adjusted R^2:",1 - (1-metrics.r2_score(y_test, pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))

## **<span style="color:#682F2F;"> 98.27% accuracy with it. We can take the model into production. </span>**


