# Parsing CSV & JSON Files

Due to advances in technologies for data storage, data from various sources is always stored in different formats
and file types. 
Some data formats store data in a way that can be easily handled by a machine, such as CSV, JSON, and XML.
Those formats are usually referred to as machine-readable formats.
In contrast, some other data formats or file types store data in a way meant to be read by a human 
using front-end desktop tools.
Those formats or file types are often referred to as hard-to-parse formats.
We will use a series of examples to demonstrate how to extract data stored in 
both machine-readable and hard-to-parse formats,
and then store the extracted data in formats that can be easily adopted by the downstream data wranngling tasks.
This chapter will cover how to read the common machine-readable formats:
* **CSV**: Comma Separated Values
* **JSON**: JavaScript Object Notation

In most cases, the two formats togeather with XML are the best available resource while you are scraping data from
the web or requesting data directly from an organization or agency. 
They are more easily used and ingested by programming languages, like Python.
Our suggestion is that you should try your best to get data in these formats, before you start looking
into other formats that might be hard to parse, like PDFs.

There are many ways of reading and storing data in those formats, 
which depends on the programming language you use.
Here we are going to focus on Python.
Searching the Internet, you will find there are a lot of online tutorials on handling data stored in different
data formats with Python.
We suggest the following:
* "*Data Loading, Storage, and File Formats*", Chapter 6 of "**Python for Data Analysis**": This chapter covers reading files in a variety of formats, loading data from databases and interacting with Internet via APIs. Please read pages 155-166, and download and run the Python scripts from [the author's github site](https://github.com/pydata/pydata-book). 📖

The dataset used in this chapter was downloaded from
[data.gov.au](https://data.melbourne.vic.gov.au/Transport-Movement/Melbourne-bike-share/tdvh-n9dv). 
It is available in the following formats: CSV, JSON, XML, RDF, etc.
The first two formats are used, i.e., the following two files
* Melbourne_bike_share.csv
* Melbourne_bike_share.json

In the following sections, you will learn how to scrape data from the two 
example files, and store the extracted data into Pandas DataFrame. 

### Example scenario
Assume that you are going to analyze and predict bicycle hubway station status to answer the following questions:
* What do usage patterns look like with respect to specific stations and how that translates to imbalances in the system?
* Can we integrate these explanatory variables and these usage patterns into a predictive algorithm that would predict empty and full stations in the near future?
* What form should that algorithm take?
* How do environmental variables affect the future state of Hubway stations?

See <a href="http://cs109hubway.github.io/classp/"><font color="red">Predicting Hubway Stations status in Boston</font></a> for more discussion.

The first step we have to do is to acquire the hub station data and as well as weather data. Here, for demonstration purpose, we use the Melbourne bike share data published by the government. The files have been downloaded and come along with this notebook.

* * *

## 1. Parsing CSV file
A CSV is a Comma Separated Values file, which allows data to be saved in a tabular format.
Each row of the file is a data record; each column is a field (or an attribute).
Each data record consists of one or more fields, separated by commas.
As one of the most popular file formats,
it is supported by any spreadsheet programs, such as 
Microsoft Excel, Open Office Calc, and Google Spreadsheets,
Because of its simplicity,
it differs from other spreadsheet file types, such as Excel, in that one can only store a single sheet in a file. 
It cannot be used to store cell, columns or row styling, figures and formulas.
To make our CSV file, i.e., Melbourne_bike_share.csv, easier to view here, 
a sample of the data with trimmed down records is shown below.
You should see something similar to this when you open the excel file in your text editor,
![csv1.png](./csv1.png)

Note that tabs can also be used to separate values of different fields.
This type of files is usually called TSV, Tab Separated Values. 
Sometimes TSVs get classified as CSVs.
The only difference between CSVs and TSVs is the delimiter.
Essentially, the two types of files will act the same in Python and most of the other
programming languages. 
It is worth mentioning that they often take the form of a text file containing information 
separated by commas.
This section will show you how to use Pandas 
[read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to
load our CSV file, and how to tidy the loaded data a bit.
Before we start importing our CSV file, it might be good for you to read [Pandas tutorial
on reading CSV files](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) 📖.

### Alternative approach to inspect your data

In [2]:
with open("./Melbourne_bike_share.csv", 'r') as f:
    for line in f.readlines()[:3]: # 前三行
        print (line)

ID,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates

2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,"(-37.814022, 144.939521)"

4,Federation Square - Flinders St / Swanston St - City,60001,12,10,28/01/2016 12:30:05 PM +0000,"(-37.817523, 144.967814)"



In [1]:
with open("./Melbourne_bike_share.csv", 'r') as f:
    for line in f.readlines()[-3:]: # 倒數三行
        print (line)

53,Victoria Market - Elizabeth St / Victoria St - City,60049,15,10,28/01/2016 12:30:06 PM +0000,"(-37.806091, 144.959017)"

55,Coventry St / Clarendon St - South Melbourne,60050,7,4,28/01/2016 12:30:06 PM +0000,"(-37.831776, 144.960818)"

57,Fitzroy Street - St Kilda,60052,19,12,28/01/2016 12:30:06 PM +0000,"(-37.858655, 144.978818)"



### 1.1. Importing CSV data
Importing CSV files with Pandas <a href='http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html'><font color = "blue">read_csv()</font></a> function and converting the data into a form Python can understand 
is simple. 
It only takes a couple of lines of code.
The imported data will be stored in Pandas DataFrame.

In [2]:
import pandas as pd
csvdf = pd.read_csv("./Melbourne_bike_share.csv")
type(csvdf)

pandas.core.frame.DataFrame

Or you can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html"><font color='blue'>read_table()</font></a> function

In [3]:
csvdf_1 = pd.read_table("./Melbourne_bike_share.csv", sep=",")
type(csvdf_1)

pandas.core.frame.DataFrame

Now, the data should be loaded into Python. 
Let's have a look at the first 5 records in the dataset.
There are a coupe of ways to retrieve these records.
For example, you can use 
* <font color='blue'>csvdf.head(n = 5)</font>: It will return first `n` rows in a DataFrame, n = 5 by default.
* <font color='blue'>csvdf[:5]</font>: It uses the slicing method to retrieve the first 5 rows

Refer to "[Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html)"
for how to slice, dice, and generally get and set subsets of pandas objects.
Here, we use the `head` function.

In [4]:
# csvdf.head()
# csvdf.loc[:4]
csvdf[:5]

Unnamed: 0,ID,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates
0,2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,"(-37.814022, 144.939521)"
1,4,Federation Square - Flinders St / Swanston St ...,60001,12,10,28/01/2016 12:30:05 PM +0000,"(-37.817523, 144.967814)"
2,5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,28/01/2016 12:30:05 PM +0000,"(-37.84782, 144.948196)"
3,6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,28/01/2016 12:30:05 PM +0000,"(-37.810702, 144.964417)"
4,7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,28/01/2016 12:30:05 PM +0000,"(-37.813088, 144.967437)"


In [5]:
# csvdf.tail()
csvdf[-5:]

Unnamed: 0,ID,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates
45,51,ANZ - Collins St - Docklands,60044,9,10,28/01/2016 12:30:06 PM +0000,"(-37.821568, 144.944488)"
46,52,Flagstaff Gardens - Peel St - West Melbourne,60048,6,5,28/01/2016 12:30:06 PM +0000,"(-37.809216, 144.955223)"
47,53,Victoria Market - Elizabeth St / Victoria St -...,60049,15,10,28/01/2016 12:30:06 PM +0000,"(-37.806091, 144.959017)"
48,55,Coventry St / Clarendon St - South Melbourne,60050,7,4,28/01/2016 12:30:06 PM +0000,"(-37.831776, 144.960818)"
49,57,Fitzroy Street - St Kilda,60052,19,12,28/01/2016 12:30:06 PM +0000,"(-37.858655, 144.978818)"


Currently, the row indices are integers automatically generated by Pandas.
Suppose you want to set IDs as row indices and delete the ID column.
Resetting the row indices can be easily done with the following DataFrame function
```python
    DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```
See its [API webpage](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html) 
for the detailed usage.
The keys are going to be the IDs in the first column. 
By setting `inplace = True`, the corresponding change is done inplace and won't return a new DataFrame object.

In [6]:
#len(csvdf.ID.unique())

In [7]:
# 把ID那column的值當作index
csvdf.set_index(csvdf.ID, inplace = True)
csvdf.head()

Unnamed: 0_level_0,ID,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,"(-37.814022, 144.939521)"
4,4,Federation Square - Flinders St / Swanston St ...,60001,12,10,28/01/2016 12:30:05 PM +0000,"(-37.817523, 144.967814)"
5,5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,28/01/2016 12:30:05 PM +0000,"(-37.84782, 144.948196)"
6,6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,28/01/2016 12:30:05 PM +0000,"(-37.810702, 144.964417)"
7,7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,28/01/2016 12:30:05 PM +0000,"(-37.813088, 144.967437)"


To remove the ID column that is now redundant, you use DataFrame `drop` function and set `inplace = True`
```python
    DataFrame.drop(labels, axis=0, level=None, inplace=False, errors='raise')
```

In [8]:
# 把冗贅的column去除
csvdf.drop('ID', 1, inplace = True)
csvdf.head()

Unnamed: 0_level_0,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,"(-37.814022, 144.939521)"
4,Federation Square - Flinders St / Swanston St ...,60001,12,10,28/01/2016 12:30:05 PM +0000,"(-37.817523, 144.967814)"
5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,28/01/2016 12:30:05 PM +0000,"(-37.84782, 144.948196)"
6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,28/01/2016 12:30:05 PM +0000,"(-37.810702, 144.964417)"
7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,28/01/2016 12:30:05 PM +0000,"(-37.813088, 144.967437)"


Instead of using the above method of setting row indices to IDs, you can specify which column to 
be used as row indices while reading the CSV file. See the API reference page for
[pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).
To do so, you can use the <font color='blue'>index_col</font> argument of <font color='blue'>read_csv()</font>.

In [9]:
# 讀進資料就指定index
csvdf = pd.read_csv("./Melbourne_bike_share.csv", index_col = "ID")
csvdf.head()

Unnamed: 0_level_0,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,"(-37.814022, 144.939521)"
4,Federation Square - Flinders St / Swanston St ...,60001,12,10,28/01/2016 12:30:05 PM +0000,"(-37.817523, 144.967814)"
5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,28/01/2016 12:30:05 PM +0000,"(-37.84782, 144.948196)"
6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,28/01/2016 12:30:05 PM +0000,"(-37.810702, 144.964417)"
7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,28/01/2016 12:30:05 PM +0000,"(-37.813088, 144.967437)"


Similarly, with the <font color='blue'>read_table()</font> function, you can also set the value of <font color='blue'> index_col</font> to "ID".

### 1.2. Manipulating the Data

So far, you have learned a little bit about the Melbourne_bike_share data.
Let's further process the data by splitting the coordinates into latitude and longitude.
First figure out what type of data we're dealing with, i.e., the data type of the "Coordinates" column.

In [10]:
type(csvdf['Coordinates']) 
# type(csvdf.Coordinates)

pandas.core.series.Series

The data type of this column is Pandas Series, i.e., 
a one-dimensional labeled array capable of holding any data type.
Next, in order to split the coordinates, you should know the data type of those coordinates. Are they strings?
Let's check them by printing the first element in the Series and its type.

In [11]:
print (csvdf['Coordinates'].iloc[0])
type(csvdf['Coordinates'].iloc[0]) 

(-37.814022, 144.939521)


str

Those coordinates are indeed strings. Thus, to extract both latitude and longitude, you 
can either use regular expressions introduced in the previous chapter or common string operations.

To use regular expressions, the key is figuring out the patterns of characters. Then
according to those patterns, you formulate your regular expressions.
Looking at the first couple of coordinates in the Series object, i.e.:
```
    (-37.814022, 144.939521)
    (-37.817523, 144.967814)
    (-37.84782, 144.948196)
```
You will find that latitudes are always negative real values, and longitudes are positive real values.
That is because Australia lies between latitudes 9° and 44°S, and longitudes 112° and 154°E.
The regular expression is
```
    r"-?\d+\.?\d*"
```
![](./regex1.jpg)
It contains four parts
* "-?": optionally matches a single '-'.
* "\d+": matches one or more digits.
* "\\.?": optionally matches a single dot.
* "\d*": matches zero or more digits.

The following code extracts all real values matching this regular expression.
The <font color="blue">re.findall()</font> returns all matched values in a Python list.

In [12]:
import re
str1 = csvdf['Coordinates'].iloc[0] # csvdf.Coordinates
re.findall(r"-?\d+\.?\d*", str1)

['-37.814022', '144.939521']

Using common string operations might be simpler than using regular expressions. 
<font color="blue">str.split()</font> is the function used here to extract both latitudes and longitudes.
However, you should choose a proper delimiter to split a string.
First, split the string by ',':

In [23]:
s = csvdf['Coordinates'].iloc[1].split(', ') # assuming they're all '(x, y)'
print ('lat = ', s[0], ' long = ', s[1])
print(s[0])
print(s[1])

lat =  (-37.817523  long =  144.967814)
(-37.817523
144.967814)


The printout shows that the latitude contains '(', and the longitude contains ')'.
You should consider removing both the left and the right parentheses. 
Of course, the `split` function can be used again. 
Note that the goal here is to remove the leading and trailing parentheses.
Python string class provides two functions to do the two operations,
which are:
* <font color="blue">string.lstrip()</font>: returns a copy of the string with leading characters removed
* <font color="blue">string.rstrip()</font>: returns a copy of the string with trailing characters removed.

Let's try the two functions.

In [21]:
print (s[0].lstrip('('))
print (s[1].rstrip(')'))

-37.817523
144.967814


The latitude and longitude in the first coordinate have been successfully extracted.
Next, we are going to apply the extracting process to every coordinate in the DataFrame.
There are multiple ways of doing that. 
The most straightforward way is to write a FOR loop to iterate over all the coordinates,
and apply the above scripts to each individual coordinate. 
Two Pandas Series can be then used to store latitudes and longitudes.
However, we are going to show you how to use some advanced Python programming functionality.

Pandas Series class implements an [`apply()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html) method that applies a given function
to all values in a Series object, and returns a new one.
Please note that this function can only works on single values. 
To apply <font color="blue">str.split()</font> to every coordinate and
get latitudes and longitudes, you can use the following two lines of code:

In [24]:
csvdf['lat'] = csvdf['Coordinates'].apply(lambda x: x.split(', ')[0])
csvdf['lon'] = csvdf['Coordinates'].apply(lambda x: x.split(', ')[1])
csvdf.head()

Unnamed: 0_level_0,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,Coordinates,lat,lon
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,"(-37.814022, 144.939521)",(-37.814022,144.939521)
4,Federation Square - Flinders St / Swanston St ...,60001,12,10,28/01/2016 12:30:05 PM +0000,"(-37.817523, 144.967814)",(-37.817523,144.967814)
5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,28/01/2016 12:30:05 PM +0000,"(-37.84782, 144.948196)",(-37.84782,144.948196)
6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,28/01/2016 12:30:05 PM +0000,"(-37.810702, 144.964417)",(-37.810702,144.964417)
7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,28/01/2016 12:30:05 PM +0000,"(-37.813088, 144.967437)",(-37.813088,144.967437)


The first line extracts all the latitudes and store them in a column in our DataFrame.
The second line extracts all the longitudes.
You might wonder what "lambda" is in the code. 
It is a Python keyword used to construct small anonymous functions at runtime. (See [Section 4.7.5. Lambda Expressions](https://docs.python.org/2/tutorial/controlflow.html) 📖 )
You can use a similar approach to remove the heading and trailing parentheses.

In [25]:
csvdf['lat'] = csvdf['lat'].apply(lambda x: x.lstrip('(')) # 移除 (
csvdf['lon'] = csvdf['lon'].apply(lambda x: x.rstrip(')')) # 移除 )
csvdf.drop('Coordinates', 1, inplace = True) # 移除 Coordinates
csvdf.head()

Unnamed: 0_level_0,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,lat,lon
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,Harbour Town - Docklands Dve - Docklands,60000,9,14,28/01/2016 12:30:05 PM +0000,-37.814022,144.939521
4,Federation Square - Flinders St / Swanston St ...,60001,12,10,28/01/2016 12:30:05 PM +0000,-37.817523,144.967814
5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,28/01/2016 12:30:05 PM +0000,-37.84782,144.948196
6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,28/01/2016 12:30:05 PM +0000,-37.810702,144.964417
7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,28/01/2016 12:30:05 PM +0000,-37.813088,144.967437


So far, we have split the "Coordinates" column into two columns, i.e., "lat" and 'lon' in the DataFrame,
and dumped the "Coordinates" column.
The last step is to infer better type for object columns. 
All the numerical values and dates are encoded as strings in the current DataFrame.
We would like to convert those values to types that they are supposed to have.

In [26]:
csvdf = csvdf.convert_objects(convert_numeric = True) 
csvdf.dtypes

  """Entry point for launching an IPython kernel.


Featurename      object
TerminalName      int64
NBBikes           int64
NBEmptydoc        int64
UploadDate       object
lat             float64
lon             float64
dtype: object

However, dates are still strings, which means the `convert_object` function cannot convert data strings to datatime
object.
Here you need to force them to be converted to datatime object with [`pd.to_datetime`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html).

In [27]:
csvdf['UploadDate'] = pd.to_datetime(csvdf['UploadDate'])
print (csvdf.dtypes)
csvdf

Featurename             object
TerminalName             int64
NBBikes                  int64
NBEmptydoc               int64
UploadDate      datetime64[ns]
lat                    float64
lon                    float64
dtype: object


Unnamed: 0_level_0,Featurename,TerminalName,NBBikes,NBEmptydoc,UploadDate,lat,lon
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,Harbour Town - Docklands Dve - Docklands,60000,9,14,2016-01-28 12:30:05,-37.814022,144.939521
4,Federation Square - Flinders St / Swanston St ...,60001,12,10,2016-01-28 12:30:05,-37.817523,144.967814
5,Plum Garland Reserve - Beaconsfield Pde - Albe...,60002,16,1,2016-01-28 12:30:05,-37.84782,144.948196
6,State Library - Swanston St / Little Lonsdale ...,60003,9,2,2016-01-28 12:30:05,-37.810702,144.964417
7,Bourke Street Mall - 205 Bourke St - City,60004,9,2,2016-01-28 12:30:05,-37.813088,144.967437
8,Melbourne Uni - Tin Alley - Carlton,60005,2,17,2016-01-28 12:30:05,-37.79625,144.960858
9,RMIT - Swanston St / Franklin St - City,60006,9,2,2016-01-28 12:30:05,-37.807699,144.963095
10,St Paul's Cathedral - Swanston St / Flinders S...,60007,4,7,2016-01-28 12:30:05,-37.817189,144.967409
11,MSAC - Aughtie Dve - Albert Park,60008,9,18,2016-01-28 12:30:05,-37.842395,144.961868
12,Fitzroy Town Hall - Moor St - Fitzroy,60009,3,4,2016-01-28 12:30:05,-37.801813,144.979209


Finally, you have loaded the given CSV file into Python with Pandas. 
You have also tidied the data a bit by getting latitudes and longitudes out
from the strings.

Besides `read_csv`, there are other parsing functions in pandas for 
reading tabular data as a DataFrame object. They include
* [`read_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html): Reads general delimited file into DataFrame. The default delimiter is '\t'.
* [`read_fwf`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html): Reads a table of fixed-width formatted lines into DataFrame.
* [`read_clipboard`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html): Reads text from clipboard and passes to read_table. See read_table for the full argument list.
* * *

## 2. Parsing JSON files

JSON (JavaScript Object Notation) is one of the most commonly used formats 
for transferring data between web services and other applications via HTTP requests.
Nowadays, many sites have JSON-enabled APIs and 
JSON is quickly becoming the encoding protocol of choice.
As a light weighted data-interchange format inspired by JavaScript, 
it is clean, easy to read, and easy to parse.
Here is a simple example adapted from [Wikipedia page on JSON](https://en.wikipedia.org/wiki/JSON)
```
[
{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
   }
}
]

```

From the above example, you will see that each data record looks like a [Python dictionary](https://docs.python.org/2/tutorial/datastructures.html#dictionaries). 
A JSON file usually contains a list of dictionaries, which is defined by '[' and ']'.
In each of those dictionaries,
there is a key-value pair for each row and the key and value are separated by a colon.
Different key-value pairs are separated by commas.
Note that a value can also be a dictionary, see "address" in the example.
The basic types are object, array, value, string and number.
If you would like to know more about JSON, please refer to 
* [Introducing to JSON](http://www.json.org/): the JSON org website gives a very good diagrammatic explanation 
of JSON 📖.
* [Introduction to JSON](https://www.youtube.com/watch?v=WWa0cg_xMC8): a 15-minutes Youtube video on JSON, recommended for visual learners.

(Of course, you can also go and find your own materials on JSON by searching the Internet.)

In the rest of this section, we will start from an simple example, walking through steps of acquiring JSON Data from Google Maps Elevation API and normalizing those data into a flat table. Then, we revisit the dataset mentioned in the previous section (except that it is now in JSON format), parsing the data and store them in a Pandas DataFrame object.
Before we start, it might be good for you to view one of the following tutorials on parsing JSON files:
* [Working with JSON data](http://wwwlyndacom.ezproxy.lib.monash.edu.au/Python-tutorials/Working-JSON-data/122467/142575-4.html): A Lynda tutorial on parsing JSON data. You need a Monash account to access this website.
[here](http://resources.lib.monash.edu.au/eresources/lynda-guide.pdf) is the lynda settup guide.
* A [Youtube video](https://www.youtube.com/watch?v=9Xt2e9x4xwQ ) on extracting data from JSON files (**optional**).

### 2.1 Acquiring JSON Data From The Internet
This section will start with showing you how to acquire a small chunk of JSON data
from Internet via HTTP requests and load it into Python with `json` library. 
The example we used is inspired by the question asked in [Stack Overflow](http://stackoverflow.com/questions/21104592/json-to-pandas-dataframe).
In the example, the goal is to extract elevation data from a 
[Google Maps Elevation API](https://developers.google.com/maps/web-services/overview) along
a path specified by latitude and longitude, and convert the JSON data
into a Pandas DataFrame object, which could look similar to (but the actual values might vary!)

||elevation|location.lat|location.lng|resolution|
|------|------|------|------|------|
|0|243.346268|42.974049|-81.205203|19.087904|
|1|244.131866|42.974298|-81.195755|19.087904|


The first step is to make a HTTP request to get the data from the Google Maps API.
Here we are going to use [`urllib2`](https://docs.python.org/2/library/urllib2.html) library.
It defines a set of functions and classes that help in opening URLs.

In order to run the following code, please following the instruction on https://developers.google.com/maps/documentation/elevation/start
to request a API key.

In [None]:
locations = "42.974049,-81.205203|42.974298,-81.195755"
try:
    from urllib2 import Request, urlopen # for python 2
except ImportError:
    from urllib.request import urlopen, Request # for python 3

api_key = "YOUR API-KEY" #use your own API key her
request = Request("https://maps.googleapis.com/maps/api/elevation/json?locations="+locations+"&key="+api_key)

response = urlopen(request)
elevations = response.read()
#elevations.splitlines()

In the above code, we have:
1. Imports Request class and the <font color="blue">urlopen() </font> function from `urllibs` module.
2. Defines a path with the coordinates of the start and end points
3. Creates a URL Request object. Note that you can change the output format by replacing '/json' with '/xml'.
4. Opens the URL, and returns a file-like object.
5. Reads data returned from the HTTP request.

The returned data is actually stored in a string. 
You can check it out using Python's built-in function `type`, 
```python
    type(elevations)
```
What does the data look like?
In stead of printing the data in one single string, one can use
```python
    elevations.splitlines()
```
to print the data as a list of lines in the string, breaking
at line boundaries, i.e., '\n'. 
The printout you get should look like
```
['{',
 '   "results" : [',
 '      {',
 '         "elevation" : 243.3462677001953,',
 '         "location" : {',
 '            "lat" : 42.974049,',
 '            "lng" : -81.205203',
 '         },',
 '         "resolution" : 19.08790397644043',
 '      },',
 '      {',
 '         "elevation" : 244.1318664550781,',
 '         "location" : {',
 '            "lat" : 42.974298,',
 '            "lng" : -81.19575500000001',
 '         },',
 '         "resolution" : 19.08790397644043',
 '      }',
 '   ],',
 '   "status" : "OK"',
 '}']
```
It is easy to dump the data into a JSON file, which just takes three lines of code:
```python
    import json
    with open("elevations.json", "w") as outfile:
         json.dump(elevations, outfile)
```

To read the acquired JSON data, you can use the `json` module as follows:

In [None]:
import json
data = json.loads(elevations)
print (type(data))
data

It loads the data into a Python dictionary.
The data we want is stored in the first entry.
The value of this entry is a list of two dictionaries, each of which corresponds to a record.
see [JSON encoder and decoder](https://docs.python.org/2/library/json.html) for more on reading
JSON files.

As mentioned earlier in this section, 
we will convert the JSON data into Pandas DataFrame.
Therefore, Pandas functions on reading JSON are to be used.
If you would like to know about those functions, you can read Pandas tutorial on [Reading JSON](http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader) (**optional**).
Let's first try the <font color="blue">read_json()</font> function.

In [None]:
df = pd.read_json(elevations)
df

Unfortunately, the DataFrame returned by `read_json` is not the one we want.
You might wonder why the `read_json` function did not return the DataFrame we want.
There is a straight forward answer.
Let's try to build a DataFrame from `data` returned by 
```
    data = json.loads(elevations)
```
What do you get?

In [None]:
pd.DataFrame(data)

You have got a DataFrame that is exactly the same as the one returned by `read_json`.
This is due to Pandas' way of constructing a DataFrame from a dictionary. 
See [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
for constructing a DataFrame from a dictionary
and "Object Creation" in [10 Mintues to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) 📖.
It is not hard to figure out that dictionary keys 
are used as column 
labels, and values of whatever data types are put as column values.

What we want is to flatten out JSON object into a flat table.
Fortunately, Pandas provides a JSON normalization function [(<font color="blue">json_normalize()</font>)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html)
that takes a dict or list of dicts and normalize semi-structured data into a flat table. 

In [None]:
from pandas.io.json import json_normalize
json_normalize(data['results'])

Eventually, the <font color="blue">json_normalize()</font> function returns the DataFrame we want.
However flattening objects with embedded arrays/lists is not as trivial.
See [Flattening JSON objects in Python](https://gist.github.com/amirziai/2808d06f59a38138fa2d)
for more information.

### 2.2. Parsing the "Melbourne_bike_share.json"  File
Now that you have learned how to use `json` module and Pandas together to parse a simple JSON file.
In this section we will walk you through the process of extracting bike hub station statistical data from "Melbourne_bike_share.json". Then produce the same DataFrame as the one in Section 1.

Remember that the first step is always to glance through the JSON file with your favorite editor.
Below is the first 20 lines from our JSON file.

<img src = "./json20.png" width = "700", hight = "800">

This JSON file is much more complex that the one used in the previous section
It might take a bit of time to figure out that this file is a dictionary of 
two large dictionaries, one with key "meta", and another with "data".
The "meta" dictionary contains all the meta information, including column names.
The "data" dictionary actually contains the data we want.
In the following subsection, we will show you how to extract records from the "data"
dictionary, while leaving the task of extracting column labels from the "meta" dictionary as an exercise.
Similarly, our JSON data can be read into Python as follows.

In [None]:
import json
with open("./Melbourne_bike_share.json") as json_file:
    json_data = json.load(json_file)
print (type(json_data))
json_data['meta']['view']

The loaded JSON data has been saved in a Python dictionary with two entries, one for "data" and another for "meta".
Using `json_normalize`, you can flatten the "data" dictionary into a table and save it in a DataFrame.

In [None]:
df = json_normalize(json_data,'data')
df.head()

We seem to have a lot of extra columns.
The data we want starts at column 8.
Therefore, dump all the irrelevant preceding columns.

In [None]:
try:
    df.drop(xrange(8), axis=1, inplace=True)
except:
    df.drop(range(8), axis=1, inplace=True)

df.head()

Renaming all the columns with the field names given by the CSV file. 
You can programmatically extract field names from the "meta" dictionary.
We will leave it for you to do as an exercise.
Similar to parsing CSV file, IDs are unique and can be set to row indices. 

In [None]:
df.columns = ['id','featurename','terminalname','nbbikes','nbemptydoc','uploaddate','coordinates']
df.set_index(df.id, inplace= True)
df.drop('id', 1, inplace = True)
df.head()

What's in the last two columns?
"uploaddate" is supposed to have a standard datetime format in the column,
and coordinates should be pairs of latitude and longitude.
Both of them should be real numbers.
At the moment, a datetime is encoded as a 64-digit integer (i.e., datetimes in milliseconds since epoch),
and a coordinate is a Python list as
```python
 [u'{"address":"","city":"","state":"","zip":""}',
 u'-37.814022',
 u'144.939521',
 None,
 False]
```
Let's first convert those integers into standard datetime.
The following Python code converts 
one of these integers into a standard datetime using Python
[`datatime`](https://docs.python.org/2/library/datetime.html) module:
```python
    import datatime
    date = datetime.datetime.fromtimestamp(df.iloc[0,4])
    print data
```
The output is 
```
    2016-01-28 23:45:05
```
Similar to the way of splitting coordinates in Section 2.1, 
one can use `pandas.Series.apply` to invoke  `datetime.datetime.fromtimestamp`
on each individual integer in the column. 
Please try this method by yourself.

Instead, we will show you a pandas specific way of converting 
timestamp values in milliseconds into standard datetime.
Here we use Pandas [`to_datetime`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
function.

In [None]:
df['uploaddate'] = pd.to_datetime(df['uploaddate'], unit='s')
df.head()

Note that the unit argument must be explicitly specified. It can take values on (D,s,ms,us,ns).
Without specifying its value, `1453985105`, for example, will be converted to some strange date as
```
    Timestamp('1970-01-01 00:00:01.453985105')
```
You can compare the converted dates with those in the DataFrame constructed from our CSV file.
For example,

In [None]:
print (csvdf.iloc[0,4]) # the csv date
print (df.iloc[0,4]) 

The difference is due to that two files were downloaded one after another.
However, the time format is the same.

The last step is to extract latitudes and longitudes into two columns.
Each coordinate in the last column of the DataFrame is a Python list.
The second and the third entries are latitude and longitude respectively.
It is very easy to get the two entries into a list.
We will apply the following anonymous function to all the coordinates one after another
```python
    lambda col: col[i]
```
where i = 1 or 2. While i = 1, it returns latitudes; i = 2, it returns longitudes.

In [None]:
df['lat'] = df['coordinates'].apply(lambda col: col[1]) # arrrrgh
df['lon'] = df['coordinates'].apply(lambda col: col[2])
df.head()

Now, dump the "coordinates" columns and change data type of each column.

In [None]:
df.drop('coordinates', 1, inplace = True)
df = df.convert_objects(convert_numeric=True) 
df

## 3. Summary

Files in either CSV or JSON format are the easiest ones to preview, understand and parse. 
In this chapter，you have learned about how to pull data out from files stored in those two formats
using Pandas. You should now be familiar with these two formats.

## Exercises
1. To further parse the Excel file, try the following 
    1. Split the "Featurename" into bike hub station's street name, and suburb name, then store them in three columns.
    2. Extract date and time from the "UploadDate" columns, store them in two different columns.
2.  Section 3.2 has shown you how to extract data from the given JSON file. However, it did not show
how to programmatically extract column labels from the meta data. The task here is to extract all
the column labels from the metadata using either <font color='blue'>json_normalize()</font> function or the way you prefer.