# Read Iris Dataset in Python with pandas

## Introduction
This notebook imports the `iris` dataset. This entails:
2. Reading the datafile (into a dataframe)
3. Checking the datatypes (of each column in the dataframe)
4. Setting these datatypes (if they were not initially read correctly)

The sections of this notebook (listed below, except the Setup section) correspond to each step. 
Note that the columns of the Iris dataset are all initially read correctly. 
Other notebooks require more work to set the column datatypes correctly.

## Contents
1. Setup
2. Read datafile
3. Check column types
4. Set column types

## 1. Setup

In [0]:
iris_filepath = 'https://raw.githubusercontent.com/datalab-datasets/file-samples/master/iris.csv'

## 2. Read using `read_csv` from `pandas` (Python)

The pandas module makes available many functions for work with (including reading) dataframes in Python.
In particular, we use the `read_csv` function from the pandas. 
For details on the function see:
- http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.read_csv.html

Notice the version of pandas used on the cluster.

In [10]:
import pandas as pd
pd.__version__

'0.25.3'

The code cell below uses the `read_csv` function to read the `iris.csv` datafile into a pandas dataframe.

The `info()` method is used to display the datatypes of the columns.

In [11]:
import pandas as pd
pd.read_csv(iris_filepath) \
  .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLength    150 non-null float64
SepalWidth     150 non-null float64
PetalLength    150 non-null float64
PetalWidth     150 non-null float64
Name           150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Notice that the four numeric columns are all read correctly as type `float64`. 

The `Name` column though is a categorical variable and should be read as type `category`, not a `object`.

In [13]:
import pandas as pd
pd.read_csv(iris_filepath,
            dtype={'Name':'category'}) \
  .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLength    150 non-null float64
SepalWidth     150 non-null float64
PetalLength    150 non-null float64
PetalWidth     150 non-null float64
Name           150 non-null category
dtypes: category(1), float64(4)
memory usage: 5.0 KB


__The End__