# Transform a batch of JSON files into a single CSV file

This tutorial uses the Python module [pandas (Python Data Analysis Library)](https://pandas.pydata.org) to open a batch of JSON files and transform the contents into a single CSV

## About pandas

Pandas is a Python library that contains many functions for analyzing data. For the GeoBTAA workflows, we are most interested in how it eases transformations between JSON and CSV files:

*CSV files*: Pandas can easily read and write CSV files using its `read_csv()` and `to_csv()` methods, respectively. These methods can handle many CSV formats, including different delimiter characters, header options, and data types. Once the CSV data is loaded into a Pandas DataFrame, it can be easily manipulated and analyzed using Pandas' powerful data manipulation tools, such as filtering, grouping, and aggregation.

*JSON data*: Pandas can also read and write JSON data using its `read_json()` and `to_json()` methods. These methods can handle various JSON formats, such as normal JSON objects, JSON arrays, and JSON lines. Once the JSON data is loaded into a Pandas DataFrame, it can be easily manipulated and analyzed using the same data manipulation tools used for CSV data.

*pandas DataFrame* A DataFrame is similar to a Python list or dictionary, but it has rows and columns, similar to a spreadsheet. This makes it a simpler task to convert between JSON and CSV. To review these Python terms, refer to the glossary.

## 1. Install pandas

If you do not have pandas installed yet, choose ONE of the follow commands. (Uncomment one of these)

In [None]:
# ! conda install pandas --yes

# OR

# ! pip install pandas --yes

## 2. Import modules

Next, we will import a few Python modules.

In [None]:
import csv
import json
import os
import pandas as pd

print("modules imported")

## 3. Declare the file paths and names

Next, declare your file paths and names. For this tutorial, we are going to open 3 JSON files that are in the local folder called `sample-jsons`.

Then, we enter the name `pandas-output` for the CSV file that will be created

In [None]:
json_path = r"sample-jsons" # point to the folder path
csv_name = "pandas-output" # name for the csv to be created

## 4 .Create an empty list

Before we run a Python loop, we need to create an empty list that will store the information. We give it a name of `jsonMetadata` and set it as equal to empty (`= []`) 

In [None]:
jsonMetadata = [] # empty list

## 5. Open the JSON files and add them to a Python List

The code uses `os.walk`. This will open each JSON file, read the metadata,, and add it to a list called `jsonMetadata` (the one we created in the last step).

In [None]:
for path, dir, files in os.walk(json_path):
    for filename in files:
    	if filename.endswith(".json"):
            file_path = os.path.join(path, filename)
            json_file_open = open(file_path, 'rb')
            data = json_file_open.read().decode('utf-8', errors='ignore')
            loaded = json.loads(data)
            jsonMetadata.append(loaded)

## 6. Convert the List into a pandas DataFrame

Here is where **pandas** finally comes in. We convert the list (jsonMetadata) into a special object called a *pandas DataFrame*. Here, we use the convention of calling the DataFrame `df`. We will print out the DataFrame so you can see how it is structured.

In [None]:
df = pd.DataFrame(jsonMetadata)
print(df)

## 7. Drop one of the columns

Now that all of the metadata from the JSON files is loaded into a pandas DataFrame, we can manipulate it in various ways. For example, let's say we do not want to include the column called `geoblacklight_version` in our final output. We can call the `.drop` method. When the DataFrame is printed out again, the first column is gone!

In [None]:
df = df.drop(columns=['geoblacklight_version'])
print(df)

## 8. Write the DataFrame to a CSV file

We can perform other data conversions or analysis at this step as well, such as changing the column names, rearranging them, or other manipulations. For now, we will write the DataFrame to a CSV to look at.

In [None]:
df.to_csv("{}.csv".format(csv_name))

## 9. Inspect the new CSV file

In practice, you will likely open a generated CSV file in a spreadsheet editor to prepare the metadata for publishing. However, let's take a look a it within this Notebook using the pandas `.read_csv` function.

In [None]:
new_csv = pd.read_csv("pandas-output.csv")
new_csv.head(5) #displays the first 5 rows for us

*For a more complex version of this script, see the Recipes section.*