# Read diamonds Dataset in Python with pandas

## Introduction
This notebook imports the `diamonds` dataset. This entails:
2. Reading the datafile (into a dataframe)
3. Checking the datatypes (of each column in the dataframe)
4. Setting these datatypes (if they were not initially read correctly)

The sections of this notebook (listed below, except the Setup section) correspond to each step. 
Note that the columns of the diamonds dataset are all initially read correctly. 
Other notebooks require more work to set the column datatypes correctly.

## Contents
1. Setup
2. Read datafile
3. Check column types
4. Set column types

## 1. Setup

In [5]:
%r
diamonds_filepath = '/dbfs/mnt/datalab-datasets/file-samples/diamonds.csv'

In [6]:
%python
diamonds_filepath = '/dbfs/mnt/datalab-datasets/file-samples/diamonds.csv'

## 2. Read using `read_csv` from `pandas` (Python)

The pandas module makes available many functions for work with (including reading) dataframes in Python.
In particular, we use the `read_csv` function from the pandas. 
For details on the function see:
- http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.read_csv.html

Notice the version of pandas used on the cluster.

In [9]:
%python
import pandas as pd
pd.__version__

The code cell below uses the `read_csv` function to read the `diamonds.csv` datafile into a pandas dataframe.

The `info()` method is used to display the datatypes of the columns.

In [11]:
%python
import pandas as pd
pd.read_csv(diamonds_filepath) \
  .info()

Notice that the four numeric columns are all read correctly as type `float64`. 

The `Name` column though is a categorical variable and should be read as type `category`, not a `object`.

In [13]:
%python
import pandas as pd
pd.read_csv(diamonds_filepath,
            dtype={'cut':'category',
                   'color':'category',
                   'clarity':'category'}
           ) \
  .info()

__The End__