<font size="+3"><strong> Working with JSON files</strong></font>

In this project, we'll be looking at tracking corporate bankruptcies in Poland. To do that, we'll need to get data that's been stored in a `JSON` file, explore it, and turn it into a DataFrame that we'll use to train our model.

In [3]:
import gzip
import json

import pandas as pd

# Prepare Data

## Explore

**Task 1:** Using a context manager, open the file `poland-bankruptcy-data-2009.json` and load it as a dictionary with the variable name `poland_data`.


A **context manager** allows you to allocate and release resources precisely when you want to. The most widely used example of context managers is the `with` statement. Suppose you have two related operations which you would like to execute as a pair, with a block of code in between. Context managers allow you to do specifically that.

In [4]:
# Open file and load JSON
with open("data/poland-bankruptcy-data-2009.json","r") as read_file:
    poland_data = json.load(read_file)

print(type(poland_data))

<class 'dict'>


In [5]:
# Print `poland_data` keys
poland_data.keys()

dict_keys(['schema', 'data', 'metadata'])

In [6]:
poland_data["schema"]["fields"]

[{'name': 'company_id', 'type': 'integer'},
 {'name': 'Attr_1', 'type': 'number'},
 {'name': 'Attr_2', 'type': 'number'},
 {'name': 'Attr_3', 'type': 'number'},
 {'name': 'Attr_4', 'type': 'number'},
 {'name': 'Attr_5', 'type': 'number'},
 {'name': 'Attr_6', 'type': 'number'},
 {'name': 'Attr_7', 'type': 'number'},
 {'name': 'Attr_8', 'type': 'number'},
 {'name': 'Attr_9', 'type': 'number'},
 {'name': 'Attr_10', 'type': 'number'},
 {'name': 'Attr_11', 'type': 'number'},
 {'name': 'Attr_12', 'type': 'number'},
 {'name': 'Attr_13', 'type': 'number'},
 {'name': 'Attr_14', 'type': 'number'},
 {'name': 'Attr_15', 'type': 'number'},
 {'name': 'Attr_16', 'type': 'number'},
 {'name': 'Attr_17', 'type': 'number'},
 {'name': 'Attr_18', 'type': 'number'},
 {'name': 'Attr_19', 'type': 'number'},
 {'name': 'Attr_20', 'type': 'number'},
 {'name': 'Attr_21', 'type': 'number'},
 {'name': 'Attr_22', 'type': 'number'},
 {'name': 'Attr_23', 'type': 'number'},
 {'name': 'Attr_24', 'type': 'number'},
 {'na

In [11]:
poland_data["metadata"]#['articleLink']

{'title': 'Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction',
 'authors': 'Zieba, M., Tomczak, S. K., & Tomczak, J. M.',
 'journal': 'Expert Systems with Applications',
 'publicationYear': 2016,
 'dataYear': 2009,
 'articleLink': 'doi:10.1016/j.eswa.2016.04.001',
 'datasetLink': 'https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data'}

In [8]:
poland_data["schema"].keys()

dict_keys(['fields', 'primary_key', 'pandas_version'])

In [9]:
# Continue Exploring `poland_data`
poland_data["data"][0]

{'company_id': 1,
 'Attr_1': 0.17419,
 'Attr_2': 0.41299,
 'Attr_3': 0.14371,
 'Attr_4': 1.348,
 'Attr_5': -28.982,
 'Attr_6': 0.60383,
 'Attr_7': 0.21946,
 'Attr_8': 1.1225,
 'Attr_9': 1.1961,
 'Attr_10': 0.46359,
 'Attr_11': 0.21946,
 'Attr_12': 0.53139,
 'Attr_13': 0.14233,
 'Attr_14': 0.21946,
 'Attr_15': 592.24,
 'Attr_16': 0.6163,
 'Attr_17': 2.4213,
 'Attr_18': 0.21946,
 'Attr_19': 0.12272,
 'Attr_20': 37.573,
 'Attr_21': 0.9969,
 'Attr_22': 0.2951,
 'Attr_23': 0.097402,
 'Attr_24': 0.75641,
 'Attr_25': 0.46359,
 'Attr_26': 0.50669,
 'Attr_27': 1.9737,
 'Attr_28': 0.32417,
 'Attr_29': 5.9473,
 'Attr_30': 0.22493,
 'Attr_31': 0.12272,
 'Attr_32': 100.82,
 'Attr_33': 3.6203,
 'Attr_34': 0.71453,
 'Attr_35': 0.2951,
 'Attr_36': 1.8079,
 'Attr_37': 123140.0,
 'Attr_38': 0.46359,
 'Attr_39': 0.16501,
 'Attr_40': 0.21282,
 'Attr_41': 0.041124,
 'Attr_42': 0.16501,
 'Attr_43': 95.682,
 'Attr_44': 58.109,
 'Attr_45': 0.94621,
 'Attr_46': 0.90221,
 'Attr_47': 44.941,
 'Attr_48': 0.26003,

This dataset includes all the information we need to figure whether or not a Polish company went bankrupt in 2009. There's a bunch of features included in the dataset, each of which corresponds to some element of a company's balance sheet. You can explore the features by looking at the `data dictionary`. Most importantly, we also know whether or not the company went bankrupt. That's the last key-value pair.

**Task 2:** Calculate the number of companies included in the dataset.


In [15]:
# Calculate number of companies
len(poland_data["data"])

9977

**Task 3:** Calculate the number of features associated with `"company_1"`.

In [20]:
# Calculate number of features
len(poland_data["data"][0])

66

Since we're dealing with data stored in a JSON file, which is common for semi-structured data, we can't assume that all companies have the same features. So let's check!

**Task 4:** Iterate through the companies in `poland_data["data"]` and check that they all have the same number of features.


In [23]:
# Iterate through companies
for item in poland_data["data"]:
    if len(item) != 66:
        print("alert")

**Task 5:** Using a context manager, open the file `poland-bankruptcy-data-2009.json.gz` and load it as a dictionary with the variable name `poland_data_gz`. 


In [25]:
# Open compressed file and load contents
with gzip.open("data/poland-bankruptcy-data-2009.json.gz","r") as read_file:
    poland_data_gz = json.load(read_file)

print(type(poland_data_gz))

<class 'dict'>


**Task 6:** Explore `poland_data_gz` to confirm that is contains the same data as `data`, in the same format. 

In [26]:
# Explore `poland_data_gz`
print(poland_data_gz.keys())
print(len(poland_data_gz["data"]))
print(len(poland_data_gz["data"][0]))

dict_keys(['schema', 'data', 'metadata'])
9977
66


**Task 7:** Create a DataFrame `df` that contains the all companies in the dataset, indexed by `"company_id"`. Remember the principles of *tidy data* that you learned in Project 1, and make sure your DataFrame has shape `(9977, 65)`. 

In [27]:
df = pd.DataFrame.from_dict(poland_data_gz["data"]).set_index("company_id")
print(df.shape)
df.head()

(9977, 65)


Unnamed: 0_level_0,Attr_1,Attr_2,Attr_3,Attr_4,Attr_5,Attr_6,Attr_7,Attr_8,Attr_9,Attr_10,...,Attr_56,Attr_57,Attr_58,Attr_59,Attr_60,Attr_61,Attr_62,Attr_63,Attr_64,bankrupt
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.17419,0.41299,0.14371,1.348,-28.982,0.60383,0.21946,1.1225,1.1961,0.46359,...,0.16396,0.37574,0.83604,7e-06,9.7145,6.2813,84.291,4.3303,4.0341,False
2,0.14624,0.46038,0.2823,1.6294,2.5952,0.0,0.17185,1.1721,1.6018,0.53962,...,0.027516,0.271,0.90108,0.0,5.9882,4.1103,102.19,3.5716,5.95,False
3,0.000595,0.22612,0.48839,3.1599,84.874,0.19114,0.004572,2.9881,1.0077,0.67566,...,0.007639,0.000881,0.99236,0.0,6.7742,3.7922,64.846,5.6287,4.4581,False
4,0.024526,0.43236,0.27546,1.7833,-10.105,0.56944,0.024526,1.3057,1.0509,0.56453,...,0.048398,0.043445,0.9516,0.14298,4.2286,5.0528,98.783,3.695,3.4844,False
5,0.18829,0.41504,0.34231,1.9279,-58.274,0.0,0.23358,1.4094,1.3393,0.58496,...,0.17648,0.32188,0.82635,0.073039,2.5912,7.0756,100.54,3.6303,4.6375,False


## Import

**Task 8:** Create a `wrangle` function that takes the name of a compressed file as input and returns a tidy DataFrame. After you confirm that your function is working as intended, submit it to the grader. 

In [28]:
def wrangle(filename):
    # Open compressed file , load into dict
    with gzip.open(filename,"r") as read_file:
        poland_data_gz = json.load(read_file)
    df=pd.DataFrame.from_dict(poland_data_gz["data"]).set_index("company_id")    
    
    return df

In [29]:
df = wrangle("data/poland-bankruptcy-data-2009.json.gz")
print(df.shape)
df.head()

(9977, 65)


Unnamed: 0_level_0,Attr_1,Attr_2,Attr_3,Attr_4,Attr_5,Attr_6,Attr_7,Attr_8,Attr_9,Attr_10,...,Attr_56,Attr_57,Attr_58,Attr_59,Attr_60,Attr_61,Attr_62,Attr_63,Attr_64,bankrupt
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.17419,0.41299,0.14371,1.348,-28.982,0.60383,0.21946,1.1225,1.1961,0.46359,...,0.16396,0.37574,0.83604,7e-06,9.7145,6.2813,84.291,4.3303,4.0341,False
2,0.14624,0.46038,0.2823,1.6294,2.5952,0.0,0.17185,1.1721,1.6018,0.53962,...,0.027516,0.271,0.90108,0.0,5.9882,4.1103,102.19,3.5716,5.95,False
3,0.000595,0.22612,0.48839,3.1599,84.874,0.19114,0.004572,2.9881,1.0077,0.67566,...,0.007639,0.000881,0.99236,0.0,6.7742,3.7922,64.846,5.6287,4.4581,False
4,0.024526,0.43236,0.27546,1.7833,-10.105,0.56944,0.024526,1.3057,1.0509,0.56453,...,0.048398,0.043445,0.9516,0.14298,4.2286,5.0528,98.783,3.695,3.4844,False
5,0.18829,0.41504,0.34231,1.9279,-58.274,0.0,0.23358,1.4094,1.3393,0.58496,...,0.17648,0.32188,0.82635,0.073039,2.5912,7.0756,100.54,3.6303,4.6375,False
