# JSON Parsing

JavaScript Object Notation (JSON) is a increasingly popular data interchange format. There are a variety of specialty derivatives of JSON, such as GeoJSON.

JSON objects are constructed of key => value pairs, textual encoding example :  ` { "key1" : "value1", "key2" : "value2" } `

Values can also be lists of objects in the form of arrays, where the `{...}` represents an object in the array : 
`[ {...}, {...} ] `

In fact, this notebook is encoded as a JSON file.  If you open it in a regular text editor, you can see the structure of it.

As an example of how to parse JSON, we'll use the Syrian IDP JSON file.  Note that we use the `json` module for this rather than `BeautifulSoup`.

In [None]:
import json

file_data = json.load(open('/dsa/data/all_datasets/Syria_IDPSites_2015LateJun_HIU_DoS.geojson', encoding='latin-1'))
json.dumps(file_data, sort_keys=True, indent=4, separators=(',', ': '))


You will notice the data looks JSON-ish above.

How does it look as a python object?

In [None]:
import json

file_data = json.load(open('/dsa/data/all_datasets/Syria_IDPSites_2015LateJun_HIU_DoS.geojson', encoding='latin-1'))
print(file_data)

Basically the same.  Something to remember is the equivalent names between JSON and Python

JSON Object = Python (Dict) Dictionary, i.e., name - value pairs
   * Read more about Dict here : https://docs.python.org/3.3/library/stdtypes.html#dict
 
JSON Array = Python list
   * Read more about List here : https://docs.python.org/3.3/library/stdtypes.html#lists


A common operation is to produce JSON formatted files from other data, such as CSV files.  This can be done with the `csv` and `json` modules, but it is simpler to read and write the data using `pandas`.

In [None]:
# Read the file using pandas
import pandas as pd
filepath = '/dsa/data/all_datasets/SyriaIDPSites2015LateJunHIUDoS.csv'
df = pd.read_csv(filepath, encoding='latin-1')
print(df)

In [None]:
# Write the file using pandas
jsonfile = 'MyOutput.json'
df.to_json(jsonfile)

In [None]:
# Confirm that the JSON file results in the same dataframe
df_json = pd.read_json(jsonfile)
print(df.shape)
print(df_json.shape)
print(df_json)       # Columns and rows are in a different order, but looks the same.

Besides file I/O, JSON is commonly used as a configuration file format and a data transmission format (e.g., JSON-RPC or pushing/pulling data between a Python program and a web service.

Here, we use the `json` module to read a geoJSON file because `pandas` can't read those files.  However, there is now a `GeoPandas` module that *can* [read geoJson](http://geopandas.org/io.html).  This may be a worthwhile alternative.

In [None]:
import json
import pandas
file_data = json.load(open('/dsa/data/all_datasets/Syria_IDPSites_2015LateJun_HIU_DoS.geojson', encoding='latin-1'))
df = pandas.DataFrame(file_data['features'])
df.head()

## Flatten your pandas
Looking at the output above, we can see that some columns in the DataFrame (e.g. geometry and properties) are actually embedded dictionaries. This is a common occurrence when dealing with hierarchical data. You will encounter this in many data scraping / parsing scenarios.

#### Dig into your data
The key is to interrogate your data and iterate your code until you have flattened the data 

In [None]:
# Show a single entry of the geometry column
df.geometry.iloc[0]

# Show a single entry of the properties column
df.properties.iloc[0]

### How to do it
Look at the API for Pandas (http://pandas.pydata.org/pandas-docs/stable/api.html) and the code above and try to create a new data frame that completely flattens the record for each feature.

The efficient way to do this is to build a list of flattened rows, then instantiate the DataFrame from the list of rows. The alternative is the continual construction of DataFrames from two DataFrames.

So, we want to create an empty list, then for each flattened row we will append it as a dictionary into the list.

See : https://docs.python.org/3/tutorial/datastructures.html 
  * Note the list.append(X) function.  

See : https://docs.python.org/2/library/stdtypes.html#dict.update 
  * Note the dictionary merge "update(X) function


**First run this code block.**  It has a `break` statement in it to only run the first iteration, and it is annotated with several print statements to show what is happening.  This code only flattens 'properties'.  It is left to you to do the same with 'geometry'.

In [None]:
list_of_rows = []
for index, row in df.iterrows():

    # Some columns are actually dictionaries for column values
    print('Here\'s what one row looks like:\n{}\n'.format(row))
    print('Type of "row": {}\n'.format(type(row)))
   
    rowDict = row.to_dict()   # convert the row object, which is a pandas Series, into a dictionary
    
    print('rowDict: \n{}\n'.format(rowDict))
    print('Type of "rowDict": {}\n'.format(type(rowDict)))
    
    # Note that the properties and geometry keys refer to dictionaries themselves.
    # Pull out properties and geometry, and append them to the row  using the "update" method.
    
    properties = rowDict.pop('properties')   # remove the properties field into a variable named properties
    
    rowDict.update(properties)     # merge in the flattened properties
    
    print('rowDict: \n{}\n'.format(rowDict))
    
    break  # Stop here before looping to the next row
    
    # Append the newly flattened row to the list of rows
    list_of_rows.append(rowDict)
          
df2 = pandas.DataFrame(list_of_rows)
df2.head()

### <span style="background:yellow;">Your Turn</span>

Now that you have seen the possibilities with the example above, completely flatten the JSON into a panda data frame by flattening the 'geometry' field as well as 'properties', then compute the average longitude and latitude using the coordinates array within the geometry column.

In [None]:
import json
import pandas
file_data = json.load(open('/dsa/data/all_datasets/Syria_IDPSites_2015LateJun_HIU_DoS.geojson', encoding='latin-1'))
df = pandas.DataFrame(file_data['features'])

list_of_rows = []
for index, row in df.iterrows():
    # Do Flattening
    # Replace pass with code
    pass



df2 = pandas.DataFrame(list_of_rows)
df2.head()

# SAVE YOUR NOTEBOOK