# Diamonds Analysis - DSIC 100 Project Proposal
###### Group 5: Clare Pan, Wendy Phung, Jessie Sheng, Jason Wang

## Introduction
Diamond is a mineral composed of pure carbon, in which each carbon atom is attached to four other carbon atoms, making the diamond a very strong structure. The hardness of diamond and its high dispersion of light make it useful for industrial applications and desirable as jewelry. Even the smallest difference between two diamonds could make one much more valuable than the other. For this reason, multiple organizations are using the “four Cs”, which are color, cut, clarity, and carat, to certify and grade diamonds. In this project, we will use a data set from Kaggle to predict the diamonds price using the four Cs.

## Preliminary exploratory data analysis

#### Loading the libraries

In [38]:
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)
library(cowplot)

#### Reading dataset from web into R

In [39]:
url <- "https://raw.githubusercontent.com/cpan0/project_proposal/main/diamonds.csv"
diamonds <- read_csv(url)
diamonds <- diamonds %>% 
    select(carat, cut, color, clarity, price)   # selecting the necessary variables/columns
head(diamonds)

Parsed with column specification:
cols(
  carat = [32mcol_double()[39m,
  cut = [31mcol_character()[39m,
  color = [31mcol_character()[39m,
  clarity = [31mcol_character()[39m,
  depth = [32mcol_double()[39m,
  table = [32mcol_double()[39m,
  x = [32mcol_double()[39m,
  y = [32mcol_double()[39m,
  z = [32mcol_double()[39m,
  price = [32mcol_double()[39m
)



carat,cut,color,clarity,price
<dbl>,<chr>,<chr>,<chr>,<dbl>
0.23,Ideal,E,SI2,326
0.21,Premium,E,SI1,326
0.23,Good,E,VS1,327
0.29,Premium,I,VS2,334
0.31,Good,J,SI2,335
0.24,Very Good,J,VVS2,336


In [39]:
url <- "https://raw.githubusercontent.com/cpan0/project_proposal/main/diamonds.csv"
diamonds <- read_csv(url)
diamonds <- diamonds %>% 
    select(carat, cut, color, clarity, price)   # selecting the necessary variables/columns
head(diamonds)

Parsed with column specification:
cols(
  carat = [32mcol_double()[39m,
  cut = [31mcol_character()[39m,
  color = [31mcol_character()[39m,
  clarity = [31mcol_character()[39m,
  depth = [32mcol_double()[39m,
  table = [32mcol_double()[39m,
  x = [32mcol_double()[39m,
  y = [32mcol_double()[39m,
  z = [32mcol_double()[39m,
  price = [32mcol_double()[39m
)



carat,cut,color,clarity,price
<dbl>,<chr>,<chr>,<chr>,<dbl>
0.23,Ideal,E,SI2,326
0.21,Premium,E,SI1,326
0.23,Good,E,VS1,327
0.29,Premium,I,VS2,334
0.31,Good,J,SI2,335
0.24,Very Good,J,VVS2,336


#### Splitting the dataset into training (75%) and testing (25%) datasets based on cut

In [40]:
set.seed(1)

diamonds_split <- initial_split(diamonds, prop = 0.75, strata = cut)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split) 

glimpse(diamonds_train)
glimpse(diamonds_test)

Rows: 40,456
Columns: 5
$ carat   [3m[90m<dbl>[39m[23m 0.23, 0.23, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.23, 0.31, 0…
$ cut     [3m[90m<chr>[39m[23m "Ideal", "Good", "Good", "Very Good", "Very Good", "Very Good…
$ color   [3m[90m<chr>[39m[23m "E", "E", "J", "J", "I", "H", "E", "H", "J", "J", "E", "E", "…
$ clarity [3m[90m<chr>[39m[23m "SI2", "VS1", "SI2", "VVS2", "VVS1", "SI1", "VS2", "VS1", "VS…
$ price   [3m[90m<dbl>[39m[23m 326, 327, 335, 336, 336, 337, 337, 338, 340, 344, 345, 345, 3…
Rows: 13,484
Columns: 5
$ carat   [3m[90m<dbl>[39m[23m 0.21, 0.29, 0.30, 0.22, 0.23, 0.23, 0.23, 0.23, 0.26, 0.32, 0…
$ cut     [3m[90m<chr>[39m[23m "Premium", "Premium", "Good", "Premium", "Very Good", "Very G…
$ color   [3m[90m<chr>[39m[23m "E", "I", "J", "F", "F", "F", "F", "D", "D", "H", "F", "I", "…
$ clarity [3m[90m<chr>[39m[23m "SI1", "VS2", "SI1", "SI1", "VS1", "VS1", "VS1", "VS1", "VS2"…
$ price   [3m[90m<dbl>[39m[23m 326, 334, 339, 342, 357, 402, 402, 

#### Exploratory data analysis (summary)

##### Range, mean and standard deviation of diamond carat in the training data

In [41]:
diamonds_carat_min_max <- diamonds_train %>% 
    summarize(min_carat = min(carat),
             max_carat = max(carat),
             mean_carat = mean(carat),
             sd_carat = sd(carat))
diamonds_carat_min_max

min_carat,max_carat,mean_carat,sd_carat
<dbl>,<dbl>,<dbl>,<dbl>
0.2,4.5,0.7958088,0.4728762


##### Number of each cuts in the training data