# `pandas` - Read data files

This notebook demonstrates the `read_csv`, `read_excel`, and `read_json` functions from pandas.

## Contents
1. Setup
1. `read_csv` & fixing errors
1. `read_excel`
1. `read_json`

## 1. Setup

Load the `pandas` Python library.

In [6]:
%python
import pandas as pd

In [7]:
%sh ls -hot /dbfs/mnt/datalab-datasets/538_data/mad-men

In [8]:
%sh ls -hot /dbfs/mnt/datalab-datasets/538_data/mad-men/*

Notice that the output above changes if the asterisk is removed/included in the filepath of the `ls` command. 

The asterisk is required in order to produce a vector of full filepaths.

## 2. `read_csv` & fixing errors

Attempt to read the `show-data.csv` file with the pandas `read_csv` function. 

It produces an error, which we will fix below.

In [12]:
%python # error encountered
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv')
mad_men_show_data_pdf

Notice the error above is also repeat below when simply displaying the file with the `cat` command.

In [14]:
%sh # error encountered
cat /dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv

In this case, when encountering an unknown error, I search for posts about the error on Google. 

Search for: "read_csv pandas UnicodeDecodeError utf-8 codec can't decode byte  in position invalid continuation byte". 

Notice I added "read_csv" and "pandas" to the search string. 
I removed `0xd5` and `3032` as these are specific to our file and won't help the search find other posts that solve the same problem, but with a different file. 

The search found: https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python

This links suggests trying `encoding = "ISO-8859-1"` as a parameter to `read_csv` (it works, see below)

In [16]:
%python
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

The command did work. 

Notice the column names have non-alphabetic characters (including spaces).
The next step is to read in the columns with better names.

The names parameter specifies the column names to use (in the dataframe returned by `read_csv`).

In [19]:
%python
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    header=0, 
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

Notice the column names are better, but that we needed to specify, with the `header` parameter, that row `0` containes the column names.

The column types are in the right hand column of the output from the `info` method.
If a column is read in as a `float64` or `int64` then we can assume it was read correctly. 

To check for columns that were not read correctly we only need to select columns with `object` type. 
The next command does this.

In [22]:
mad_men_show_data_pdf \
  .select_dtypes(include=['object']) \
  .head()

Notice that `show_end` and `score_y` should be numbers.

The following command specifies both `show_start` and `show_end` as integers. We leave `score_y` until later.

In [24]:
%python # error encountered
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                             'status', 'char_end', 'years_since', 'lead', 
                                             'support', 'shows', 'score', 'score_y',
                                             'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.int64, 'show_end': np.int64},
                                    header=0, 
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

An error is encountered, indicating that `PRESENT` is an invalid value for an integer. 

So we specify `PRESENT` to indicate a missing value. This should work.

In [26]:
%python
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                             'status', 'char_end', 'years_since', 'lead', 
                                             'support', 'shows', 'score', 'score_y',
                                             'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.int64, 'show_end': np.int64, 
                                           'score_y': np.float64},
                                    header=0, 
                                    na_values=['PRESENT'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

But this does not work. So we check Google again for the error.

Searching for "ValueError: Integer column has NA values in column 3" yields this link:
- https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int

The post indicates that integers cannot have missing values in pandas and that they should be retyped to `np.float32`. 
Both columns (`show_start` and `show_end`) where coding this way for consistency. 
It's awkward, but it works.

In [29]:
%python
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 'show_end': np.float32},
                                    header=0, 
                                    na_values=['PRESENT'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

Recall that`score_y` should be a `float`. The command below specifies this in the `dtype` parameter.

In [31]:
%python
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32},
                                    header=0, 
                                    na_values=['PRESENT'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

The error indicates that `#DIV/0!` is a value in the `score_y` field. So this value is included in the `na_values` parameter (below).

In [33]:
%python
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32},
                                    header=0, 
                                    na_values=['PRESENT','#DIV/0!'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

Great. All columns are read with acceptable datatypes. The following command displays the first few records for `object` columns.

In [35]:
%python
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32},
                                    header=0, 
                                    na_values=['PRESENT','#DIV/0!'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf \
  .select_dtypes(include=['object']) \
  .head(5)

One final modification is to check on the `status` column. It is likely that it should have datatype of `category`.

First, check the unique values of this column.

In [37]:
mad_men_show_data_pdf.status.value_counts()

Now change the datatype for the `status` column with the `dtype` parameter.

In [39]:
%python
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/dbfs/mnt/datalab-datasets/538_data/mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32,
                                           'status': 'category'},
                                    header=0, 
                                    na_values=['PRESENT','#DIV/0!'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf \
  .info()

Great. Now, all columns really are read with acceptable datatypes.

## 3. `read_excel`

List the spreadsheets in the `per_diem` directory.

In [43]:
%sh ls -hot /dbfs/mnt/datalab-datasets/per_diem/*.xls

The following command uses the pandas function `read_excel` to read in the spreadsheet.

In [45]:
%python
import pandas as pd
may_2017_pdf = pd.read_excel('/dbfs/mnt/datalab-datasets/per_diem/May2017PD.xls')
may_2017_pdf.info()

Notice the column names (they are in mixed case and contain spaces) and notice the columns with type `object`. 

We fix the column names with the `names` parameter.

In [47]:
%python
import pandas as pd
may_2017_pdf = pd.read_excel('/dbfs/mnt/datalab-datasets/per_diem/May2017PD.xls',
                             names=['country_name', 'location', 'season_code', 'season_start_date', 
                                    'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                                    'effective_date', 'footnote_reference', 'location_code']
                            )
may_2017_pdf.info()

In [48]:
may_2017_pdf \
  .select_dtypes(include=['object']) \
  .head()

The first three columns should have datatype `category`. Make is so number 1.

In [50]:
%python
import pandas as pd
may_2017_pdf = pd.read_excel('/dbfs/mnt/datalab-datasets/per_diem/May2017PD.xls',
                             names=['country_name', 'location', 'season_code', 'season_start_date', 
                                    'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                                    'effective_date', 'footnote_reference', 'location_code'],
                             dtypes={'country_name': 'category',
                                     'location': 'category',
                                     'season_code': 'category'}
                            )
may_2017_pdf.info()

In [51]:
%python
import pandas as pd
may_2017_pdf = pd.read_excel('/dbfs/mnt/datalab-datasets/per_diem/May2017PD.xls',
                             names=['country_name', 'location', 'season_code', 'season_start_date', 
                                    'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                                    'effective_date', 'footnote_reference', 'location_code'],
                             dtype={'country_name': 'category',
                                     'location': 'category',
                                     'season_code': 'category'}
                            )
may_2017_pdf.info()

Checking this message on Google reveals that we should use the `converters` parameter. See [stackoverflow](https://stackoverflow.com/questions/32591466/python-pandas-how-to-specify-data-types-when-reading-an-excel-file).

Further checking [the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html) reveals that the `dtype` parameter was added in version `0.20.0` and we are running `0.19.2`. (See below.)

In [53]:
pd.__version__

There seems to be little documentation on the `converters` parameter.  Because of this I use the `assign` method to change the type of the column after reading the spreadsheet with `read_excel` as follows.

In [55]:
%python
import pandas as pd
may_2017_pdf = \
pd.read_excel('/dbfs/mnt/datalab-datasets/per_diem/May2017PD.xls',
              names=['country_name', 'location', 'season_code', 'season_start_date', 
                     'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                     'effective_date', 'footnote_reference', 'location_code']
             ) \
  .assign(country_name=lambda df: df.country_name.astype('category'),
          location    =lambda df: df.location.astype('category'),
          season_code =lambda df: df.season_code.astype('category')
         )
        
may_2017_pdf.info()

See the documentation on the `assign` method:
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html

Make one last check on the columns of type `object`.

In [58]:
%python
may_2017_pdf \
  .select_dtypes(include=['object']) \
  .head()

That looks like a reasonable `object` column. The `may_2017_pdf` dataframe is ready to be analyzed.

## 4. `read_json`

List the JSON files in the `/dbfs/mnt/datalab-datasets/JSON` directory.

In [62]:
%sh ls -hot /dbfs/mnt/datalab-datasets/JSON

The first example uses the `zips.json` file. It only contains three records, which are displayed below.

In [64]:
%sh head -n 3 /dbfs/mnt/datalab-datasets/JSON/zips.json

Read the file using `read_json`. The `lines=True` parameter indicates that there should be one JSON object per line. These JSON objects correspond to rows/records. The key names correspond to column names.

In [66]:
%python
zips_pdf = pd.read_json('/dbfs/mnt/datalab-datasets/JSON/zips.json',
                        lines=True)
zips_pdf.info()

List the first five records of the dataframe. Notice the `loc` column. It seems to contain a list, but has the `object` datatype.

In [68]:
zips_pdf.head()

In [69]:
%python 
zips_pdf.loc

The `loc` indexer masks the `loc column` so it should be renamed. The `read_json` function has no `names` parameter so we use the `rename` method. See the documentation for details on the `rename` method: 
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

In [71]:
%python
zips_pdf = \
pd.read_json('/dbfs/mnt/datalab-datasets/JSON/zips.json',
             lines=True) \
  .rename(columns={'loc':'lon_lat'})
zips_pdf.info()

The `city` and `state` columns should have datatype `category`. The `dtype` parameter didn't work so the `assign` method is used.

In [73]:
%python
zips_pdf = \
pd.read_json('/dbfs/mnt/datalab-datasets/JSON/zips.json',
             lines=True) \
  .rename(columns={'loc':'lon_lat'}) \
  .assign(city =lambda df: df.city .astype('category'),
          state=lambda df: df.state.astype('category')
         )

zips_pdf.info()

The `lon_lat` column should be split into two columns, but that will have to wait until later.

__The End__