## Running an ETL Pipeline

Ready to run your first ETL pipeline? Let's get to it!

Here, the functions extract(), transform(), and load() have been defined for you. To run this data ETL pipeline, you're going to execute each of these functions. If you're curious, take a peek at what the extract() function looks like.

<code>
def extract(file_name):
    print(f"Extracting data from {file_name}")
    return pd.read_csv(file_name)
</code>

### Instructions
    - Use the extract() function to extract data from the raw_data.csv file.
    - Transform the extracted_data DataFrame using the transform() function.
    - Finally, load the transformed_data DataFrame to the cleaned_data SQL table.

In [1]:
def extract(file_name):
    print(f"Extracting data from {file_name}.")
    return pd.read_csv(file_name)

def transform(data_frame):
    print(f"Transforming {data_frame.shape[0]} rows of raw data.")

def load(data_frame, target_table):
  print(f"Loading cleaned data to {target_table}.")

In [None]:
# Extract data from the raw_data.csv file
extracted_data = extract(file_name="raw_data.csv")

# Transform the extracted_data
transformed_data = transform(data_frame=extracted_data)

# Load the transformed_data to cleaned_data.csv
load(data_frame=transformed_data, target_table="cleaned_data")

## ELT in Action

Feeling pretty good about running ETL processes? Well, it's time to give ELT pipelines a try. Like before, the extract(), load(), and transform() functions have been defined for you; all you'll have to worry about is running these functions. Good luck!

### Instructions
    - Use the appropriate ETL function to extract data from the raw_data.csv file.
    - Load the raw_data DataFrame into the raw_data table in a data warehouse.
    - Call the transform() function to transform the data in the raw_data source table.

In [2]:
def extract(file_name):
    print(f"Extracting data from {file_name}.")
    return pd.read_csv(file_name)

def load(data_frame, table_name):
    print(f"Loading cleaned data to {table_name}.")
    data_warehouse.load_table(data_frame, table_name)

def transform(source_table, target_table):
    data_warehouse.run_sql(f"""
        CREATE TABLE {target_table} AS
            SELECT
                CONCAT("Product ID: ", product_id),
            quantity * price
        FROM {source_table};
    """)

In [None]:
# Extract data from the raw_data.csv file
raw_data = extract(file_name="raw_data.csv")

# Load the extracted_data to the raw_data table
load(data_frame=raw_data, table_name="raw_data")

# Transform data in the raw_data table
transform(
  source_table="raw_data", 
  target_table="cleaned_data"
)

## Building an ETL Pipeline

Ready to ratchet up the fun? In this exercise, you'll be responsible for building the rest of the load() function before running each step in the ETL process. The extract() and transform() functions have been defined for you. Good luck!

### Instructions
    - Complete the load() function by writing the transformed_data DataFrame to a .csv file, using file_name.
    - Use the transform() function to clean the extracted_data DataFrame.
    - Load transformed_data to the transformed_data.csv file using the load() function.

In [4]:
def extract(file_name):
    print("Extracting data from a source system")
    return pd.DataFrame([{"item_name": "Laptop", "quantity": 11}])

def transform(data_frame):
    print("Transformed the raw data returned from extract()")
    return data_frame

def load(data_frame, file_name):
    # Write cleaned_data to a CSV using file_name
    data_frame.to_csv(file_name)
    print(f"Successfully loaded data to {file_name}")

In [None]:
def load(data_frame, file_name):
  # Write cleaned_data to a CSV using file_name
  data_frame.to_csv(file_name)
  print(f"Successfully loaded data to {file_name}")

extracted_data = extract(file_name="raw_data.csv")

# Transform extracted_data using transform() function
transformed_data = transform(data_frame=extracted_data)

# Load transformed_data to the file transformed_data.csv
load(data_frame=transformed_data, file_name="transformed_data.csv")

## The "T" in ELT

Let's not forget about ELT! Here, the extract() and load() functions have been defined for you. Now, all that's left is to finish defining the transform() function and run the pipeline. Go get 'em!

### Instructions
    - Update the transform() function to call the .execute() method on the data_warehouse object.
    - Use the newly-updated transform() function to populate data in the total_sales target table by transforming data in the raw_sales_data source table.

In [5]:
def load(data_frame, table_name):
    print("Loading extracted data to sale_items.")

def extract(file_name):
    print("Extracted data from file storage.")
    return None

In [None]:
# Complete building the transform() function
def transform(source_table, target_table):
  data_warehouse.execute(f"""
  CREATE TABLE {target_table} AS
      SELECT
          CONCAT("Product ID: ", product_id),
          quantity * price
      FROM {source_table};
  """)

extracted_data = extract(file_name="raw_sales_data.csv")
load(data_frame=extracted_data, table_name="raw_sales_data")

# Populate total_sales by transforming raw_sales_data
transform(source_table="raw_sales_data", target_table="total_sales")

## Extracting, Transforming, and Loading Student Scores Data

Alright, it's time to build your own ETL pipeline from scratch. In this exercise, you'll build three functions; extract(), transform(), and load(). Then, you'll use these functions to run your pipeline.

The pandas library has been imported as pd. Enjoy!

### Instructions 1/4
    - In the extract() function, use the appropriate pandas function to read a CSV into memory.

In [None]:
def extract(file_name):
    # Read a CSV with a path stored using file_name into memory
    return pd.read_csv(file_name)

### Instructions 2/4
    - Update the transform() function to filter the data_frame to only include the columns industry_name and number_of_firms.

In [None]:
def extract(file_name):
    return pd.read_csv(file_name)

def transform(data_frame):
    # Filter the data_frame to only incude a subset of columns
    return data_frame.loc[:, ["industry_name", "number_of_firms"]]

### Instructions 3/4
    - In the load() function write the data_frame to a path stored using the parameter file_name.

In [None]:
def extract(file_name):
    return pd.read_csv(file_name)

def transform(data_frame):
    return data_frame.loc[:, ["industry_name", "number_of_firms"]]

def load(data_frame, file_name):
    # Write the data_frame to a CSV
    data_frame.to_csv(file_name)

### Instructions 4/4
    - Pass the transformed_data DataFrame to the load() function, and run the ETL pipeline.

In [None]:
def extract(file_name):
  return pd.read_csv(file_name)

def transform(data_frame):
  return data_frame.loc[:, ["industry_name", "number_of_firms"]]

def load(data_frame, file_name):
  data_frame.to_csv(file_name)
  
extracted_data = extract(file_name="raw_industry_data.csv")
transformed_data = transform(data_frame=extracted_data)

# Pass the transformed_data DataFrame to the load() function
load(data_frame=transformed_data, file_name="number_of_firms.csv")