<center><img src="image.png" width=500></center>
<p>

You've recently started a new position as a Data Engineer at an energy company. Previously, analysts on other teams had to manually retrieve and clean data every quarter to understand changes in the sales and capability of different energy types. This process normally took days and was something that most analytsts dreaded. Your job is to automate this process by building a data pipeline. You'll write this data pipeline to pull data each month, helping to provide more rapid insights and free up time for your data consumers.

You will achieve this using the `pandas` library and its powerful parsing features. You'll be working with two raw files; `electricity_sales.csv` and `electricity_capability_nested.json`. 
    
Below, you'll find a data dictionary for the `electricity_sales.csv` dataset, which you'll be transforming in just a bit. Good luck!

| Field | Data Type |
| :---- | :-------: |
| period  | `str`        |
| stateid | `str` |
| stateDescription | `str` |
| sectorid | `str` |
| sectorName | `str` |
| price | `float` |
| price-units | `str` |

In [1]:
import pandas as pd
import json

First, define an `extract_tabular_data()` function to ingest tabular data. This function will take a single parameter, `file_path`. If `file_path` ends with .csv, use the `pd.read_csv()` function to extract the data. If `file_path` ends with `.parquet`, use the `pd.read_parquet()` function to extract the data. Otherwise, raise an exception and print the message: "Warning: Invalid file extension. Please try with .csv or .parquet!".

In [2]:
def extract_tabular_data(file_path: str):
    """Extract data from a tabular file_format, with pandas."""
    if file_path.endswith('.csv'): 
        return pd.read_csv(file_path)
    elif file_path.endswith('.parquet'): 
        return pd.read_parquet(file_path)
    else: 
        raise Exception("Warning: Invalid file extension. Please try with .csv or .parquet!")

Create another function with the name `extract_json_data()`, which takes a `file_path`. Use the `json_normalize()` function from the **pandas** library to flatten the nested JSON data, and return a pandas **DataFrame**.

In [3]:
def extract_json_data(file_path):
    """Extract and flatten data from a JSON file."""
    with open(file_path, 'r') as f:
        data = json.load(f)
    return pd.json_normalize(data)

Next, we'll need to build a function to transform the electricity sales data. To do that, we'll create a function called `transform_electricity_sales_data()` which takes a single parameter `raw_data`. `raw_data` should be of type **pd.DataFrame**. The `transform_electricity_sales_data()` needs to fullfil some requirements that are described below in the docstring following the function definition.

In [4]:
def transform_electricity_sales_data(raw_data: pd.DataFrame):
    """
    Transform electricity sales to find the total amount of electricity sold
    in the residential and transportation sectors.
    
    To transform the electricity sales data, you'll need to do the following:
    - Drop any records with NA values in the `price` column. Do this inplace.
    - Only keep records with a `sectorName` of "residential" or "transportation".
    - Create a `month` column using the first 4 characters of the values in `period`.
    - Create a `year` column using the last 2 characters of the values in `period`.
    - Return the transformed `DataFrame`, keeping only the columns `year`, `month`, `stateid`, `price` and `price-units`.
    """
    raw_data.dropna(subset=['price'], inplace=True)
    filtered_data = raw_data[raw_data['sectorName'].isin(['residential', 'transportation'])].copy()
    filtered_data['month'] = filtered_data['period'].str[:4]
    filtered_data['year'] = filtered_data['period'].str[-2:]
    return filtered_data[['year', 'month', 'stateid', 'price', 'price-units']]

To load a DataFrame to a file, we'll define one more function called `load()`, which takes a DataFrame and a `file_path`. If the file_path ends with `.csv`, load the DataFrame to a CSV file. If instead the `file_path` ends with `.parquet`, load the DataFrame to a Parquet file. Otherwise, raise an exception that outputs a message in this format: "Warning: {filepath} is not a valid file type. Please try again!_"

In [5]:
def load(dataframe: pd.DataFrame, file_path: str):
    """Load a DataFrame to a file in either CSV or Parquet format."""
    if file_path.endswith('.csv'):
        dataframe.to_csv(file_path, index=False)
    elif file_path.endswith('.parquet'):
        dataframe.to_parquet(file_path, index=False)
    else:
        raise Exception(f"Warning: {file_path} is not a valid file type. Please try again.")

In [6]:
# Ready for the moment of truth? It's time to test the functions that you wrote!
raw_electricity_capability_df = extract_json_data("electricity_capability_nested.json")
raw_electricity_sales_df = extract_tabular_data("electricity_sales.csv")    

cleaned_electricity_sales_df = transform_electricity_sales_data(raw_electricity_sales_df)

load(raw_electricity_capability_df, "loaded__electricity_capability.parquet")
load(cleaned_electricity_sales_df, "loaded__electricity_sales.csv")