<a href="https://colab.research.google.com/github/elif-tr/ML-Model-for-Diamonds-Data/blob/main/Diamonds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer


In [3]:
#read in the data set
diamonds = pd.read_csv('diamonds.csv', index_col=None)

In [4]:
diamonds.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [5]:
diamonds.tail()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,53940,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


### Dropping Unnamed:0 Column
It seems that this column contains the index information of the data. I will drop this column to use pandas indexing starting fromo 0.

In [6]:
diamonds.drop('Unnamed: 0',axis = 1,  inplace=True)

In [7]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [8]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


### Data dictionary 

*  price - price in US dollars (\$326--\$18,823)

*  carat - weight of the diamond (0.2--5.01)

*  cut - quality of the cut (Fair, Good, Very Good, Premium, Ideal)

*  color - diamond colour, from D (best) to J (worst)

*  clarity - a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

*  x - length in mm (0--10.74)

*  y - width in mm (0--58.9)

*  z - depth in mm (0--31.8)

*  depth - total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

*  table -width of top of diamond relative to widest point (43--95)



### Analysis of data - Steps need to be taken 

*   From the initial look, our data does not seem to have any missing variables therefore no imputation will be needed
*   We have numerical and categorical variables in the data set so we would need to take care of the categorical variables before modeling 
*   Some of our columns seem to be on different scale of values which we would need to take care of before modeling



#### Define our X and y variables

In [10]:
y_full_data = diamonds['price']
X_full_data = diamonds.drop('price', axis = 1)

#### Write a function that will split our data into train,validation and test sets

In [11]:
def train_val_test_split(X, y, test_size = 0.2, val_size = 0.25, random_state = 42):
  '''
  Function that splits the data into train/test/validation sets
  :param X: Input features for our model 
  :param y: Response feature for our model
  :param test_size: percentage of data points from the input X,y that we want to keep for test sets
  :param val_size: percentage of data points from the training dataset we want to keep for validation

  '''
  X_train_val, X_test, y_train_val, y_test = train_test_split(X_full_data, y_full_data, test_size = test_size, random_state = random_state)
  X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=val_size, random_state = random_state)
  return (X_train, y_train), (X_val, y_val), (X_test, y_test)

## Write a function that will create and return our pipeline

In [12]:
def get_predictor_pipeline(data_processor, training_model, data_processor_key = 'preprocessor', model_key = 'linear'):
  '''
  Function that creates and returns the entire pipeline for the model

  :param data_processor: preprocesses the data to prepare it for modeling
  :param training_model: the model to use for training our data 
  '''
  lm = Pipeline(steps = [
               (data_processor_key, data_processor),
               (model_key, training_model)
])
  return lm

### Write a function that will produce our entire pipeline 

In [13]:
def diamonds_pdf(data_frame):
  '''
  Function that takes in the data frame and processes the entire pipeline on it, trains the model and the produces
  a dictionary for the score and its value. 
  
  :param data_frame: data frame that needs to be processed:

  '''
  #Seperate X and y values for data
  X = data_frame.drop('price', axis = 1)
  y = data_frame['price']
  
  #function call to split the data
  (X_train, y_train), (X_val, y_val), (X_test, y_test) = train_val_test_split(X, y)
 
  # transformers for our pipeline
  n_transformer = Pipeline(steps=[('scale', StandardScaler())])
  c_transformer = Pipeline(steps=[('encode', OneHotEncoder())])

  #determine categorical and numerical columns for transformation 
  numerical_columns = ['carat', 'depth', 'table', 'x', 'y', 'z' ]
  categorical_columns = ['cut', 'color', 'clarity']

  #Set the preprocessor for the data 
  preprocessor =  ColumnTransformer(
    transformers = [
                    ('num', n_transformer, numerical_columns),
                    ('cat', c_transformer, categorical_columns) ]
  )

  #call the function for the full pipeline 
  full_pipeline = get_predictor_pipeline(preprocessor, LinearRegression())

  #Fit the full pipeline to our training data 
  full_pipeline.fit(X_train, y_train)

  # Store the score and value in a dictionary 
  score_dict = {"Score" : round(full_pipeline.score(X_test, y_test), 2)}

  return score_dict


In [14]:
diamonds_pdf(diamonds)


{'Score': 0.92}