# Read diamonds Dataset with R

## Introduction
This notebook imports the `diamonds` dataset. This entails:
2. Reading the datafile (into a dataframe)
3. Checking the datatypes (of each column in the dataframe)
4. Setting these datatypes (if they were not initially read correctly)

The sections of this notebook (listed below, except the Setup section) correspond to each step. 
Note that the columns of the diamonds dataset are all initially read correctly. 
Other notebooks require more work to set the column datatypes correctly.

## Contents
1. Setup
2. Read datafile
3. Check column types
4. Set column types

## 1. Setup

The notebook `Include` 
- contains some references 
- loads libraries
- defines the function `get_filepaths` in R and Python to facilitate locating the datafile

Display the notebook results to see these references and the libraries.

Set the base directory in which we expect to find the datafile `diamonds.csv`.

In [7]:
%r
diamonds_filepath = '/dbfs/mnt/datalab-datasets/file-samples/diamonds.csv'

In [8]:
%python
diamonds_filepath = '/dbfs/mnt/datalab-datasets/file-samples/diamonds.csv'

## 3. Read dataset using `read_csv` from `readr` (R)

The code below uses the `read_csv` function from the `readr` library.
For details on the function, see:  
- https://readr.tidyverse.org/reference/read_delim.html

The code also uses the pipe operator from the `magrittr` library.
For details on this operator, see:
- `/Data Lab notebooks/R/Essentials/Pipes`

In [11]:
%r 
library(magrittr)
library(readr)
diamonds_filepath %>%
read_csv() %>%
str()

Notice that the four numeric columns are all read correctly. The `Name` column though should be a factor. 

The following cell uses the `mutate` command to change that variable into a factor.

In [13]:
%r 
library(magrittr)
library(readr)
library(dplyr)
diamonds_filepath %>%
read_csv() %>%
dplyr::mutate(cut=parse_factor(cut,unique(cut)),
              color=parse_factor(cut,unique(color)),
              clarity=parse_factor(cut,unique(clarity))
             ) %>%
str

The datafile has been read into a dataframe with the correct column types.

__The End__