# `pandas` - Read data files

This notebook demonstrates the `read_csv`, `read_excel`, and `read_json` functions from pandas.

## Contents
1. Setup
1. `read_csv` & fixing errors
1. `read_excel`
1. `read_json`

## 1. Setup

In [14]:
%%sh
git clone https://github.com/datalab-datasets/538-mad-men

Cloning into '538-mad-men'...


In [15]:
%ls  /content/538-mad-men/*

/content/538-mad-men/performer-scores.csv  /content/538-mad-men/show-data.csv
/content/538-mad-men/README.md


Load the `pandas` Python library.

In [0]:
%python
import pandas as pd

In [19]:
%ls -hot /content/538-mad-men/

total 48K
-rw-r--r-- 1 root 8.0K Jun 18 14:52 performer-scores.csv
-rw-r--r-- 1 root 2.0K Jun 18 14:52 README.md
-rw-r--r-- 1 root  35K Jun 18 14:52 show-data.csv


In [20]:
%ls -hot /content/538-mad-men/*

-rw-r--r-- 1 root 8.0K Jun 18 14:52 /content/538-mad-men/performer-scores.csv
-rw-r--r-- 1 root 2.0K Jun 18 14:52 /content/538-mad-men/README.md
-rw-r--r-- 1 root  35K Jun 18 14:52 /content/538-mad-men/show-data.csv


Notice that the output above changes if the asterisk is removed/included in the filepath of the `ls` command. 

The asterisk is required in order to produce a vector of full filepaths.

## 2. `read_csv` & fixing errors

Attempt to read the `show-data.csv` file with the pandas `read_csv` function. 

It produces an error, which we will fix below.

In [21]:
import pandas as pd
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv')
mad_men_show_data_pdf

UnicodeDecodeError: ignored

Notice the error above is also repeat below when simply displaying the file with the `cat` command.

In [24]:
%%sh 
cat /content/538-mad-men/show-data.csv

Performer,Show,Show Start,Show End,Status?,CharEnd,Years Since,#LEAD,#SUPPORT,#Shows,Score,Score/Y,lead_notes,support_notes,show_notesSteven Hill,Law & Order,1990,2010,END,2000,15,0,0,0,0,0,,,Kelli Williams,The Practice,1997,2014,END,2003,12,0,1,6,6.25,0.520833333,,Any Day Now (2012),"Medical Investigation, Season 1; Lie To Me, Season 1-3; Army Wives, Season 6-7"LisaGay Hamilton,The Practice,1997,2014,END,2003,12,2,0,2,4,0.333333333,"Life of a King, 2014; Go For Sisters, 2013",,"Men of a Certain Age, Season 1-2"Lara Flynn Boyle,The Practice,1997,2014,END,2003,12,0,0,0,0,0,,,Dylan McDermott,The Practice,1997,2014,END,2004,11,2,7,6,9.75,0.886363636,"Olympus Has Fallen, 2013; Freezer, 2014","The Messengers, 2007; Unbeatable Harold, 2009; Burning Palms, 2011; The Perks of Being a Wallflower, 2012; Nobody Walks, 2012; Behaving Badly, 2014; Automata, 2014","Big Shots, Season 1; Dark Blue, Season 1-2; American Horror Story, Season 1; Hostages, Season 1; Stalker, Season 1"Camryn Manheim,

In this case, when encountering an unknown error, I search for posts about the error on Google. 

Search for: "read_csv pandas UnicodeDecodeError utf-8 codec can't decode byte  in position invalid continuation byte". 

Notice I added "read_csv" and "pandas" to the search string. 
I removed `0xd5` and `3032` as these are specific to our file and won't help the search find other posts that solve the same problem, but with a different file. 

The search found: https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python

This links suggests trying `encoding = "ISO-8859-1"` as a parameter to `read_csv` (it works, see below)

In [25]:
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
Performer        248 non-null object
Show             248 non-null object
Show Start       248 non-null int64
Show End         248 non-null object
Status?          248 non-null object
CharEnd          248 non-null int64
Years Since      248 non-null int64
#LEAD            248 non-null int64
#SUPPORT         248 non-null int64
#Shows           248 non-null int64
Score            248 non-null float64
Score/Y          248 non-null object
lead_notes       89 non-null object
support_notes    135 non-null object
show_notes       138 non-null object
dtypes: float64(1), int64(6), object(8)
memory usage: 29.1+ KB


The command did work. 

Notice the column names have non-alphabetic characters (including spaces).
The next step is to read in the columns with better names.

The names parameter specifies the column names to use (in the dataframe returned by `read_csv`).

In [26]:
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    header=0, 
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
performer        248 non-null object
show             248 non-null object
show_start       248 non-null int64
show_end         248 non-null object
status           248 non-null object
char_end         248 non-null int64
years_since      248 non-null int64
lead             248 non-null int64
support          248 non-null int64
shows            248 non-null int64
score            248 non-null float64
score_y          248 non-null object
lead_notes       89 non-null object
support_notes    135 non-null object
show_notes       138 non-null object
dtypes: float64(1), int64(6), object(8)
memory usage: 29.1+ KB


Notice the column names are better, but that we needed to specify, with the `header` parameter, that row `0` containes the column names.

The column types are in the right hand column of the output from the `info` method.
If a column is read in as a `float64` or `int64` then we can assume it was read correctly. 

To check for columns that were not read correctly we only need to select columns with `object` type. 
The next command does this.

In [27]:
mad_men_show_data_pdf \
  .select_dtypes(include=['object']) \
  .head()

Unnamed: 0,performer,show,show_end,status,score_y,lead_notes,support_notes,show_notes
0,Steven Hill,Law & Order,2010,END,0.0,,,
1,Kelli Williams,The Practice,2014,END,0.520833333,,Any Day Now (2012),"Medical Investigation, Season 1; Lie To Me, Se..."
2,LisaGay Hamilton,The Practice,2014,END,0.333333333,"Life of a King, 2014; Go For Sisters, 2013",,"Men of a Certain Age, Season 1-2"
3,Lara Flynn Boyle,The Practice,2014,END,0.0,,,
4,Dylan McDermott,The Practice,2014,END,0.886363636,"Olympus Has Fallen, 2013; Freezer, 2014","The Messengers, 2007; Unbeatable Harold, 2009;...","Big Shots, Season 1; Dark Blue, Season 1-2; Am..."


Notice that `show_end` and `score_y` should be numbers.

The following command specifies both `show_start` and `show_end` as integers. We leave `score_y` until later.

In [29]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                             'status', 'char_end', 'years_since', 'lead', 
                                             'support', 'shows', 'score', 'score_y',
                                             'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.int64, 'show_end': np.int64},
                                    header=0, 
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

ValueError: ignored

An error is encountered, indicating that `PRESENT` is an invalid value for an integer. 

So we specify `PRESENT` to indicate a missing value. This should work.

In [32]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                             'status', 'char_end', 'years_since', 'lead', 
                                             'support', 'shows', 'score', 'score_y',
                                             'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.int64, 'show_end': np.int64, 
                                           'score_y': np.float64},
                                    header=0, 
                                    na_values=['PRESENT'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

ValueError: ignored

But this does not work. So we check Google again for the error.

Searching for "ValueError: Integer column has NA values in column 3" yields this link:
- https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int

The post indicates that integers cannot have missing values in pandas and that they should be retyped to `np.float32`. 
Both columns (`show_start` and `show_end`) where coding this way for consistency. 
It's awkward, but it works.

In [33]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 'show_end': np.float32},
                                    header=0, 
                                    na_values=['PRESENT'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
performer        248 non-null object
show             248 non-null object
show_start       248 non-null float32
show_end         219 non-null float32
status           248 non-null object
char_end         248 non-null int64
years_since      248 non-null int64
lead             248 non-null int64
support          248 non-null int64
shows            248 non-null int64
score            248 non-null float64
score_y          248 non-null object
lead_notes       89 non-null object
support_notes    135 non-null object
show_notes       138 non-null object
dtypes: float32(2), float64(1), int64(5), object(7)
memory usage: 27.2+ KB


Recall that`score_y` should be a `float`. The command below specifies this in the `dtype` parameter.

In [34]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32},
                                    header=0, 
                                    na_values=['PRESENT'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

ValueError: ignored

The error indicates that `#DIV/0!` is a value in the `score_y` field. So this value is included in the `na_values` parameter (below).

In [35]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32},
                                    header=0, 
                                    na_values=['PRESENT','#DIV/0!'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
performer        248 non-null object
show             248 non-null object
show_start       248 non-null float32
show_end         219 non-null float32
status           248 non-null object
char_end         248 non-null int64
years_since      248 non-null int64
lead             248 non-null int64
support          248 non-null int64
shows            248 non-null int64
score            248 non-null float64
score_y          245 non-null float32
lead_notes       89 non-null object
support_notes    135 non-null object
show_notes       138 non-null object
dtypes: float32(3), float64(1), int64(5), object(6)
memory usage: 26.2+ KB


Great. All columns are read with acceptable datatypes. The following command displays the first few records for `object` columns.

In [37]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32},
                                    header=0, 
                                    na_values=['PRESENT','#DIV/0!'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf \
  .select_dtypes(include=['object']) \
  .head(5)

Unnamed: 0,performer,show,status,lead_notes,support_notes,show_notes
0,Steven Hill,Law & Order,END,,,
1,Kelli Williams,The Practice,END,,Any Day Now (2012),"Medical Investigation, Season 1; Lie To Me, Se..."
2,LisaGay Hamilton,The Practice,END,"Life of a King, 2014; Go For Sisters, 2013",,"Men of a Certain Age, Season 1-2"
3,Lara Flynn Boyle,The Practice,END,,,
4,Dylan McDermott,The Practice,END,"Olympus Has Fallen, 2013; Freezer, 2014","The Messengers, 2007; Unbeatable Harold, 2009;...","Big Shots, Season 1; Dark Blue, Season 1-2; Am..."


One final modification is to check on the `status` column. It is likely that it should have datatype of `category`.

First, check the unique values of this column.

In [38]:
mad_men_show_data_pdf.status.value_counts()

END     202
LEFT     29
End      17
Name: status, dtype: int64

Now change the datatype for the `status` column with the `dtype` parameter.

In [40]:
import numpy as np
mad_men_show_data_pdf = pd.read_csv('/content/538-mad-men/show-data.csv', 
                                    names=['performer', 'show', 'show_start', 'show_end',
                                           'status', 'char_end', 'years_since', 'lead', 
                                           'support', 'shows', 'score', 'score_y',
                                           'lead_notes', 'support_notes', 'show_notes'],
                                    dtype={'show_start': np.float32, 
                                           'show_end': np.float32,
                                           'score_y': np.float32,
                                           'status': 'category'},
                                    header=0, 
                                    na_values=['PRESENT','#DIV/0!'],
                                    encoding = 'ISO-8859-1')
mad_men_show_data_pdf \
  .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 15 columns):
performer        248 non-null object
show             248 non-null object
show_start       248 non-null float32
show_end         219 non-null float32
status           248 non-null category
char_end         248 non-null int64
years_since      248 non-null int64
lead             248 non-null int64
support          248 non-null int64
shows            248 non-null int64
score            248 non-null float64
score_y          245 non-null float32
lead_notes       89 non-null object
support_notes    135 non-null object
show_notes       138 non-null object
dtypes: category(1), float32(3), float64(1), int64(5), object(5)
memory usage: 24.6+ KB


Great. Now, all columns really are read with acceptable datatypes.

## 3. `read_excel`

List the spreadsheets in the `per_diem` directory.

In [41]:
%%sh
git clone https://github.com/datalab-datasets/per_diem

Cloning into 'per_diem'...


In [43]:
%ls /content/per_diem/*.xls

/content/per_diem/April2017PD.xls     /content/per_diem/June2017PD.xls
/content/per_diem/April2018PD.xls     /content/per_diem/June2018PD.xls
/content/per_diem/August2017PD.xls    /content/per_diem/March2017PD.xls
/content/per_diem/August2018PD.xls    /content/per_diem/March2018PD.xls
/content/per_diem/December2017PD.xls  /content/per_diem/May2017PD.xls
/content/per_diem/February2017PD.xls  /content/per_diem/May2018PD.xls
/content/per_diem/February2018PD.xls  /content/per_diem/November2017PD.xls
/content/per_diem/January2017PD.xls   /content/per_diem/October2017PD.xls
/content/per_diem/January2018PD.xls   /content/per_diem/September2017PD.xls
/content/per_diem/July2017PD.xls      /content/per_diem/September2018PD.xls
/content/per_diem/July2018PD.xls


The following command uses the pandas function `read_excel` to read in the spreadsheet.

In [44]:
import pandas as pd
may_2017_pdf = pd.read_excel('/content/per_diem/May2017PD.xls')
may_2017_pdf.info()

UsageError: Line magic function `%python` not found (But cell magic `%%python` exists, did you mean that instead?).


Notice the column names (they are in mixed case and contain spaces) and notice the columns with type `object`. 

We fix the column names with the `names` parameter.

In [45]:
import pandas as pd
may_2017_pdf = pd.read_excel('/content/per_diem/May2017PD.xls',
                             names=['country_name', 'location', 'season_code', 'season_start_date', 
                                    'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                                    'effective_date', 'footnote_reference', 'location_code']
                            )
may_2017_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1105 entries, 0 to 1104
Data columns (total 11 columns):
country_name          1105 non-null object
location              1105 non-null object
season_code           1105 non-null object
season_start_date     1105 non-null datetime64[ns]
season_end_date       1105 non-null datetime64[ns]
lodging               1105 non-null int64
meals_incidentals     1105 non-null int64
per_diem              1105 non-null int64
effective_date        1105 non-null datetime64[ns]
footnote_reference    340 non-null object
location_code         1105 non-null int64
dtypes: datetime64[ns](3), int64(4), object(4)
memory usage: 95.0+ KB


In [0]:
may_2017_pdf \
  .select_dtypes(include=['object']) \
  .head()

The first three columns should have datatype `category`. Make is so number 1.

In [46]:
import pandas as pd
may_2017_pdf = pd.read_excel('/content/per_diem/May2017PD.xls',
                             names=['country_name', 'location', 'season_code', 'season_start_date', 
                                    'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                                    'effective_date', 'footnote_reference', 'location_code'],
                             dtypes={'country_name': 'category',
                                     'location': 'category',
                                     'season_code': 'category'}
                            )
may_2017_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1105 entries, 0 to 1104
Data columns (total 11 columns):
country_name          1105 non-null object
location              1105 non-null object
season_code           1105 non-null object
season_start_date     1105 non-null datetime64[ns]
season_end_date       1105 non-null datetime64[ns]
lodging               1105 non-null int64
meals_incidentals     1105 non-null int64
per_diem              1105 non-null int64
effective_date        1105 non-null datetime64[ns]
footnote_reference    340 non-null object
location_code         1105 non-null int64
dtypes: datetime64[ns](3), int64(4), object(4)
memory usage: 95.0+ KB


In [48]:
import pandas as pd
may_2017_pdf = pd.read_excel('/content/per_diem/May2017PD.xls',
                             names=['country_name', 'location', 'season_code', 'season_start_date', 
                                    'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                                    'effective_date', 'footnote_reference', 'location_code'],
                             dtype={'country_name': 'category',
                                     'location': 'category',
                                     'season_code': 'category'}
                            )
may_2017_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1105 entries, 0 to 1104
Data columns (total 11 columns):
country_name          1105 non-null category
location              1105 non-null category
season_code           1105 non-null category
season_start_date     1105 non-null datetime64[ns]
season_end_date       1105 non-null datetime64[ns]
lodging               1105 non-null int64
meals_incidentals     1105 non-null int64
per_diem              1105 non-null int64
effective_date        1105 non-null datetime64[ns]
footnote_reference    340 non-null object
location_code         1105 non-null int64
dtypes: category(3), datetime64[ns](3), int64(4), object(1)
memory usage: 123.2+ KB


Checking this message on Google reveals that we should use the `converters` parameter. See [stackoverflow](https://stackoverflow.com/questions/32591466/python-pandas-how-to-specify-data-types-when-reading-an-excel-file).

Further checking [the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html) reveals that the `dtype` parameter was added in version `0.20.0` and we are running `0.19.2`. (See below.)

In [49]:
pd.__version__

'0.24.2'

There seems to be little documentation on the `converters` parameter.  Because of this I use the `assign` method to change the type of the column after reading the spreadsheet with `read_excel` as follows.

In [51]:
import pandas as pd
may_2017_pdf = \
pd.read_excel('/content/per_diem/May2017PD.xls',
              names=['country_name', 'location', 'season_code', 'season_start_date', 
                     'season_end_date', 'lodging', 'meals_incidentals', 'per_diem', 
                     'effective_date', 'footnote_reference', 'location_code']
             ) \
  .assign(country_name=lambda df: df.country_name.astype('category'),
          location    =lambda df: df.location.astype('category'),
          season_code =lambda df: df.season_code.astype('category')
         )
        
may_2017_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1105 entries, 0 to 1104
Data columns (total 11 columns):
country_name          1105 non-null category
location              1105 non-null category
season_code           1105 non-null category
season_start_date     1105 non-null datetime64[ns]
season_end_date       1105 non-null datetime64[ns]
lodging               1105 non-null int64
meals_incidentals     1105 non-null int64
per_diem              1105 non-null int64
effective_date        1105 non-null datetime64[ns]
footnote_reference    340 non-null object
location_code         1105 non-null int64
dtypes: category(3), datetime64[ns](3), int64(4), object(1)
memory usage: 133.3+ KB


See the documentation on the `assign` method:
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html

Make one last check on the columns of type `object`.

In [52]:
may_2017_pdf \
  .select_dtypes(include=['object']) \
  .head()

Unnamed: 0,footnote_reference
0,192.0
1,192.0
2,
3,
4,2.0


That looks like a reasonable `object` column. The `may_2017_pdf` dataframe is ready to be analyzed.

## 4. `read_json`

List the JSON files in the `/dbfs/mnt/datalab-datasets/JSON` directory.

In [54]:
%%sh
git clone https://github.com/datalab-datasets/file-samples

Cloning into 'file-samples'...


In [57]:
%ls /content/file-samples/*.json

/content/file-samples/dict_of_lists.json
/content/file-samples/each_line.json
/content/file-samples/enron.json
/content/file-samples/list_of_dicts.json
/content/file-samples/one_dictionary.json
/content/file-samples/one_list.json
/content/file-samples/one_list_with_metadata.json
/content/file-samples/simple_dict.json
/content/file-samples/simple_list.json
/content/file-samples/stocks.json
/content/file-samples/world_bank.json
/content/file-samples/zips.json


The first example uses the `zips.json` file. It only contains three records, which are displayed below.

In [59]:
%%sh 
head -n 3 /content/file-samples/zips.json

{ "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA", "_id" : "01001" }
{ "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA", "_id" : "01002" }
{ "city" : "BARRE", "loc" : [ -72.10835400000001, 42.409698 ], "pop" : 4546, "state" : "MA", "_id" : "01005" }


Read the file using `read_json`. The `lines=True` parameter indicates that there should be one JSON object per line. These JSON objects correspond to rows/records. The key names correspond to column names.

In [62]:
zips_pdf = pd.read_json('/content/file-samples/zips.json',
                        lines=True)
zips_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29467 entries, 0 to 29466
Data columns (total 5 columns):
_id      29467 non-null int64
city     29467 non-null object
loc      29467 non-null object
pop      29467 non-null int64
state    29467 non-null object
dtypes: int64(2), object(3)
memory usage: 1.1+ MB


List the first five records of the dataframe. Notice the `loc` column. It seems to contain a list, but has the `object` datatype.

In [63]:
zips_pdf.head()

Unnamed: 0,_id,city,loc,pop,state
0,1001,AGAWAM,"[-72.622739, 42.070206]",15338,MA
1,1002,CUSHMAN,"[-72.51565, 42.377017]",36963,MA
2,1005,BARRE,"[-72.108354, 42.409698]",4546,MA
3,1007,BELCHERTOWN,"[-72.410953, 42.275103]",10579,MA
4,1008,BLANDFORD,"[-72.936114, 42.182949]",1240,MA


In [64]:
zips_pdf.loc

<pandas.core.indexing._LocIndexer at 0x7f4cf47cd0e8>

The `loc` indexer masks the `loc column` so it should be renamed. The `read_json` function has no `names` parameter so we use the `rename` method. See the documentation for details on the `rename` method: 
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

In [65]:
zips_pdf = \
pd.read_json('/content/file-samples/zips.json',
             lines=True) \
  .rename(columns={'loc':'lon_lat'})
zips_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29467 entries, 0 to 29466
Data columns (total 5 columns):
_id        29467 non-null int64
city       29467 non-null object
lon_lat    29467 non-null object
pop        29467 non-null int64
state      29467 non-null object
dtypes: int64(2), object(3)
memory usage: 1.1+ MB


The `city` and `state` columns should have datatype `category`. The `dtype` parameter didn't work so the `assign` method is used.

In [66]:
zips_pdf = \
pd.read_json('/content/file-samples/zips.json',
             lines=True) \
  .rename(columns={'loc':'lon_lat'}) \
  .assign(city =lambda df: df.city .astype('category'),
          state=lambda df: df.state.astype('category')
         )

zips_pdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29467 entries, 0 to 29466
Data columns (total 5 columns):
_id        29467 non-null int64
city       29467 non-null category
lon_lat    29467 non-null object
pop        29467 non-null int64
state      29467 non-null category
dtypes: category(2), int64(2), object(1)
memory usage: 1.5+ MB


The `lon_lat` column should be split into two columns, but that will have to wait until later.

__The End__