# Transform a batch of JSON files into a single CSV file

This tutorial uses the Python module [pandas (Python Data Analysis Library)](https://pandas.pydata.org) to open a batch of JSON files and transform the contents into a single CSV

## 1. Install pandas

If you do not have pandas installed yet, choose ONE of the following commands. (Uncomment one of these)

In [1]:
# ! conda install pandas --yes

# OR

# ! pip install pandas --yes

## 2. Import modules

Next, we will import a few Python modules.

In [2]:
import csv
import json
import os
import pandas as pd

print("modules imported")

modules imported


## 3. Declare the file paths and names

Next, declare your file paths and names. For this tutorial, we are going to open 3 JSON files that are in the local folder called `sample-jsons`.

Then, we enter the name `pandas-output` for the CSV file that will be created

In [3]:
json_path = r"sample-jsons" # point to the folder path
csv_name = "pandas-output" # name for the csv to be created

print("file paths and names declared")

file paths and names declared


## 4 .Create an empty list

Before we run a Python loop, we need to create an empty list that will store the information. We give it a name of `jsonMetadata` and set it as equal to empty (`= []`) 

When we print the list, we see that it is empty.

In [4]:
jsonMetadata = [] # empty list

print(jsonMetadata)

[]


## 5. Open the JSON files and add them to a Python List

The code uses `os.walk`. This will open each JSON file, read the metadata,, and add it to a list called `jsonMetadata` (the one we created in the last step).

When we print the list, we can see all of the metadata from the JSONs. Each file is within brackets[].

In [5]:
for path, dir, files in os.walk(json_path):
    for filename in files:
    	if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            jsonMetadata.append(loaded)
            print(jsonMetadata)

[{'geoblacklight_version': '1.0', 'dc_identifier_s': '9d298d5f-6eb6-453a-a266-992aa38db665', 'dc_title_s': 'LiDAR-Derived Countywide DEM for Iron County, WI 2019', 'dc_description_s': 'This data represents a LiDAR-derived countywide Digital Elevation Model (DEM) for Iron County, Wisconsin in 2019. A DEM represents the bare-Earth surface, removing all natural and built features. This dataset contains a single file covering the geographic extent of the entire county.', 'dc_rights_s': 'Public', 'dct_provenance_s': 'WisconsinView', 'layer_id_s': '', 'layer_slug_s': '9d298d5f-6eb6-453a-a266-992aa38db665', 'layer_geom_type_s': 'Raster', 'layer_modified_dt': '2022-01-22T20:12:43Z', 'dc_format_s': 'GeoTIFF', 'dc_language_s': 'English', 'dct_isPartOf_sm': ['Wisconsin Elevation Data', 'Coastal'], 'dc_creator_sm': ['U.S. Geological Survey'], 'dc_publisher_sm': '', 'dc_type_s': 'Dataset', 'dc_subject_sm': ['Elevation'], 'dct_spatial_sm': [''], 'dct_temporal_sm': ['2019'], 'solr_year_i': 2019, 'dct

## 6. Convert the List into a pandas DataFrame

Here is where **pandas** finally comes in. We convert the list (jsonMetadata) into a special object called a *pandas DataFrame*. Here, we use the convention of calling the DataFrame `df`. We will print out the DataFrame so you can see how it is restructured by the names of the metadata fields.

In [6]:
df = pd.DataFrame(jsonMetadata)
print(df)

  geoblacklight_version                       dc_identifier_s  \
0                   1.0  9d298d5f-6eb6-453a-a266-992aa38db665   
1                   1.0  669cfb43-a931-4036-b0d1-1d6494a76939   
2                   1.0  903a388a-248b-416d-8dec-ff0258d84023   

                                          dc_title_s  \
0  LiDAR-Derived Countywide DEM for Iron County, ...   
1  LiDAR-Derived Breaklines (QL1) for Florence Co...   
2  LiDAR-Derived Intensity Images (QL2) for Ashla...   

                                    dc_description_s dc_rights_s  \
0  This data represents a LiDAR-derived countywid...      Public   
1  This data represents LiDAR-derived breaklines ...      Public   
2  This data represents LiDAR-derived intensity i...      Public   

  dct_provenance_s layer_id_s                          layer_slug_s  \
0    WisconsinView             9d298d5f-6eb6-453a-a266-992aa38db665   
1    WisconsinView             669cfb43-a931-4036-b0d1-1d6494a76939   
2    WisconsinView          

## 7. Drop one of the columns

Now that all of the metadata from the JSON files is loaded into a pandas DataFrame, we can manipulate it in various ways. For example, let's say we do not want to include the column called `geoblacklight_version` in our final output. We can call the `.drop` method. When the DataFrame is printed out again, the first column is gone!

In [7]:
df = df.drop(columns=['geoblacklight_version'])
print(df)

                        dc_identifier_s  \
0  9d298d5f-6eb6-453a-a266-992aa38db665   
1  669cfb43-a931-4036-b0d1-1d6494a76939   
2  903a388a-248b-416d-8dec-ff0258d84023   

                                          dc_title_s  \
0  LiDAR-Derived Countywide DEM for Iron County, ...   
1  LiDAR-Derived Breaklines (QL1) for Florence Co...   
2  LiDAR-Derived Intensity Images (QL2) for Ashla...   

                                    dc_description_s dc_rights_s  \
0  This data represents a LiDAR-derived countywid...      Public   
1  This data represents LiDAR-derived breaklines ...      Public   
2  This data represents LiDAR-derived intensity i...      Public   

  dct_provenance_s layer_id_s                          layer_slug_s  \
0    WisconsinView             9d298d5f-6eb6-453a-a266-992aa38db665   
1    WisconsinView             669cfb43-a931-4036-b0d1-1d6494a76939   
2    WisconsinView             903a388a-248b-416d-8dec-ff0258d84023   

  layer_geom_type_s     layer_modified_dt dc

## 8. Write the DataFrame to a CSV file

We can perform other data conversions or analysis at this step as well, such as changing the column names, rearranging them, or other manipulations. For now, we will write the DataFrame to a CSV to look at.

In [8]:
df.to_csv("{}.csv".format(csv_name))

## 9. Inspect the new CSV file

In practice, you will likely open a generated CSV file in a spreadsheet editor to prepare the metadata for publishing. However, let's take a look a it within this Notebook using the pandas `.read_csv` function.

In [9]:
new_csv = pd.read_csv("pandas-output.csv")
new_csv.head(3) #displays the first 3 rows for us

Unnamed: 0.1,Unnamed: 0,dc_identifier_s,dc_title_s,dc_description_s,dc_rights_s,dct_provenance_s,layer_id_s,layer_slug_s,layer_geom_type_s,layer_modified_dt,...,dct_spatial_sm,dct_temporal_sm,solr_year_i,dct_issued_s,dct_references_s,solr_geom,thumbnail_path_ss,uw_supplemental_s,uw_notice_s,uw_deprioritize_item_b
0,0,9d298d5f-6eb6-453a-a266-992aa38db665,"LiDAR-Derived Countywide DEM for Iron County, ...",This data represents a LiDAR-derived countywid...,Public,WisconsinView,,9d298d5f-6eb6-453a-a266-992aa38db665,Raster,2022-01-22T20:12:43Z,...,[''],['2019'],2019,,"{""http://schema.org/url"":""https://www.sco.wisc...","ENVELOPE(-90.55228446,-89.92830784,46.59004326...",,,,False
1,1,669cfb43-a931-4036-b0d1-1d6494a76939,LiDAR-Derived Breaklines (QL1) for Florence Co...,This data represents LiDAR-derived breaklines ...,Public,WisconsinView,,669cfb43-a931-4036-b0d1-1d6494a76939,Line,2022-01-22T20:12:43Z,...,[''],['2019'],2019,,"{""http://schema.org/url"":""https://www.sco.wisc...","ENVELOPE(-88.68421375,-88.05822859,46.02121159...",,,,False
2,2,903a388a-248b-416d-8dec-ff0258d84023,LiDAR-Derived Intensity Images (QL2) for Ashla...,This data represents LiDAR-derived intensity i...,Public,WisconsinView,,903a388a-248b-416d-8dec-ff0258d84023,Raster,2022-01-22T20:12:43Z,...,[''],['2019'],2019,,"{""http://schema.org/url"":""https://www.sco.wisc...","ENVELOPE(-90.92761352,-90.3000452,47.08077476,...",,,,False


*For a more complex version of this script, see the Recipes section.*