# Read Per Diem Dataset in Python with pandas

## Introduction
This notebook imports the Per Diem dataset. This entails:
2. Reading the datafile (into a dataframe)
3. Checking the datatypes (of each column in the dataframe)
4. Setting these datatypes (if they were not initially read correctly)

The sections of this notebook (listed below, except the Setup section) correspond to each step. 
Note that the columns of the diamonds dataset are all initially read correctly. 
Other notebooks require more work to set the column datatypes correctly.

## Contents
1. Setup
2. Read datafile
3. Check column types
4. Set column types

## 1. Setup

In [5]:
%sh ls /dbfs/mnt/datalab-datasets/per_diem/*.xls

In [6]:
%python
per_diem_filepath = '/dbfs/mnt/datalab-datasets/per_diem/February2018PD.xls'

## 2. Read using `read_csv` from `pandas` (Python)

The pandas module makes available many functions for work with (including reading) dataframes in Python.
In particular, we use the `read_csv` function from the pandas. 
For details on the function see:
- http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.read_csv.html

Notice the version of pandas used on the cluster.

The code cell below uses the `read_csv` function to read the `diamonds.csv` datafile into a pandas dataframe.

The `info()` method is used to display the datatypes of the columns.

In [10]:
%python
import pandas as pd
pd.read_excel(per_diem_filepath) \
  .info()

__Exercise:__ create a function called `get_per_diem_dataframe` that has one input parameter `filepath` and returns the pandas dataframe returned by the `read_excel` function for the given value of `filepath`. For instance, 

`get_per_diem_dataframe('/dbfs/mnt/datalab-datasets/per_diem/February2018PD.xls')` 

would return the dataframe 

`pd.read_excel('/dbfs/mnt/datalab-datasets/per_diem/February2018PD.xls')`

__Exercise:__ modify the function `get_per_diem_dataframe` so that the column names of the dataframe it returns are in lower case with spaces replaced with the underscore (`_`).

__Exercise:__ find the number of rows in the resulting dataframe from the `get_per_diem_dataframe` function, for one of the files.

__Exercise:__ using the pandas `concat` function, union two dataframes produced by the `get_per_diem_dataframe` function (with two files as input).

This means the resulting (unioned) dataframe will have the same number of columns and its rows include rows from each dataframe.

In [15]:
%python from glob import glob
per_diem_filepath_list = glob('/dbfs/mnt/datalab-datasets/per_diem/*.xls')

__Exercise:__ the code cell above creates a list of filepaths (one filepath for each spreadsheet in the directory).
Use a list comprehension or the `map` function to create a list of dataframes (one for each filepath) using the ` get_per_diem_dataframe` function.

__Exercise:__ create a single dataframe from all of the dataframes (in the list created above) using the pandas `concat` function.
This single dataframe should have the same columns as the individual dataframes.

__The End__