# Part I: Upload and Process Data

In this part of the project, we will use the `main.py` script to set up the database and run scripts that read NHL data from AWS S3 buckets, process it, and insert it into the database.

## Table of Contents
1. [Introduction](#introduction)
2. [Running the `main.py` Script](#running-the-mainpy-script)
3. [Understanding the Logging](#understanding-the-logging)
4. [Verifying the Processed Data](#verifying-the-processed-data)
5. [Conclusion](#conclusion)


## 1. Introduction

This notebook will guide you through running the `main.py` script, which automates the process of downloading NHL data from AWS S3 buckets, processing it, and inserting it into the `hockey_stats` database. We will also review the log file generated during this process to ensure everything has executed correctly.


## 2. Running the `main.py` Script

The `main.py` script is the entry point for the data processing workflow. It runs a series of Python scripts that handle different parts of the data processing pipeline. These scripts are responsible for downloading data, processing it, and inserting it into the database.

Below is the code from `main.py`:


In [1]:
import logging
import subprocess

# Configure logging
logging.basicConfig(
    filename="data_processing.log",
    level=logging.INFO,
    format="%(asctime)s:%(levelname)s:%(message)s",
)

def run_script(script_name):
    """Code to kickoff downloading data from AWS S3 buckets and
    inserting data into datatables in the hockey_stats database.
    """
    try:
        result = subprocess.run(
            ["python", script_name], capture_output=True, text=True, check=True
        )
        logging.info(f"Output of {script_name}:\n{result.stdout}")
        if result.stderr:
            logging.error(f"Errors in {script_name}:\n{result.stderr}")
        print(result.stdout)
        print(result.stderr)
    except subprocess.CalledProcessError as e:
        # This will catch the error if the subprocess fails and check=True is set
        logging.error(f"Script {script_name} failed with error: {e.stderr}")
        print(f"Script {script_name} failed with error: {e.stderr}")
    except Exception as e:
        logging.error(f"Failed to run {script_name}: {e}")
        print(f"Failed to run {script_name}: {e}")

def main():
    """The main event"""
    scripts = [
        "game_processor.py",
        "game_shifts_processor.py",
        "game_skater_stats_processor.py",
        "game_plays_processor.py",
        "player_info_processor.py",
    ]

    for script in scripts:
        run_script(script)

if __name__ == "__main__":
    main()


Script game_processor.py failed with error: Traceback (most recent call last):
  File "/Users/ericwiniecke/.pyenv/versions/3.12.4/envs/cost_cup_env/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ericwiniecke/.pyenv/versions/3.12.4/envs/cost_cup_env/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 3300, in raw_connection
    return self.pool.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ericwiniecke/.pyenv/versions/3.12.4/envs/cost_cup_env/lib/python3.12/site-packages/sqlalchemy/pool/base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ericwiniecke/.pyenv/versions/3.12.4/envs/cost_cup_env/lib/python3.12/site-packages/sqlalchemy/pool/base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            ^^^^^^^^^^^^

To run this script in Jupyter, we use the `!` command to execute the script in the notebook environment. This will simulate running the script in the terminal.

# Running the main.py script
!python main.py


## 3. Understanding the Logging

As `main.py` runs, it logs detailed information to a file named `data_processing.log`. This log file is crucial for debugging and understanding what happened during the execution of the scripts.

Let's take a look at the contents of `data_processing.log` to ensure everything ran smoothly.


In [None]:
# Display the contents of the data_processing.log file
with open('data_processing.log', 'r') as log_file:
    log_content = log_file.read()

print(log_content)


### Analyzing the Log

Check for any ERROR messages in the log file. If errors are present, they indicate issues that need to be addressed before moving forward. If everything looks good, the scripts have successfully processed and inserted the data into the database.


## 4. Verifying the Processed Data

After running the `main.py` script, it's important to verify that the data has been correctly processed and inserted into the database. We can do this by querying the database and checking the contents of the tables.


In [None]:
import pandas as pd
from sqlalchemy import create_engine

# Create a database connection
engine = create_engine('your_database_connection_string')

# Example query to check the data in one of the tables
df = pd.read_sql("SELECT * FROM game_skater_stats LIMIT 5;", engine)

# Display the data
df.head()


## 5. Conclusion

In this notebook, we successfully ran the `main.py` script, reviewed the logs, and verified that the data was processed and inserted into the database correctly. This sets the stage for more advanced analysis in subsequent parts of the project.
