# Read diamonds Dataset in Python with pandas

## Introduction
This notebook imports the `diamonds` dataset. This entails:
2. Reading the datafile (into a dataframe)
3. Checking the datatypes (of each column in the dataframe)
4. Setting these datatypes (if they were not initially read correctly)

The sections of this notebook (listed below, except the Setup section) correspond to each step. 
Note that the columns of the diamonds dataset are all initially read correctly. 
Other notebooks require more work to set the column datatypes correctly.

## Contents
1. Setup
2. Read datafile
3. Check column types
4. Set column types

## 1. Setup

In [0]:
diamonds_filepath = 'https://raw.githubusercontent.com/datalab-datasets/file-samples/master/diamonds.csv'

## 2. Read using `read_csv` from `pandas` (Python)

The pandas module makes available many functions for work with (including reading) dataframes in Python.
In particular, we use the `read_csv` function from the pandas. 
For details on the function see:
- http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.read_csv.html

Notice the version of pandas used on the cluster.

In [6]:
import pandas as pd
pd.__version__

'0.25.3'

The code cell below uses the `read_csv` function to read the `diamonds.csv` datafile into a pandas dataframe.

The `info()` method is used to display the datatypes of the columns.

In [7]:
import pandas as pd
pd.read_csv(diamonds_filepath) \
  .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
Unnamed: 0    53940 non-null int64
carat         53940 non-null float64
cut           53940 non-null object
color         53940 non-null object
clarity       53940 non-null object
depth         53940 non-null float64
table         53940 non-null float64
price         53940 non-null int64
x             53940 non-null float64
y             53940 non-null float64
z             53940 non-null float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


Notice that the four numeric columns are all read correctly as type `float64`. 

The `Name` column though is a categorical variable and should be read as type `category`, not a `object`.

In [8]:
import pandas as pd
pd.read_csv(diamonds_filepath,
            dtype={'cut':'category',
                   'color':'category',
                   'clarity':'category'}
           ) \
  .drop(columns=['Unnamed: 0']) \
  .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null category
color      53940 non-null category
clarity    53940 non-null category
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


__The End__