# NCAA Stats Data Pipeline

This "pipeline" is a notebook used to setup NCAA data in our Databricks sandbox. It's largely used as a workaround since we don't have access to DLT/jobs in our sandbox environment; For now, I'll just run the scripts manually like a peasant, but in real-life this could be converted to
DLT pipelines, jobs, etc

The steps in this notebook:
1. Setup the initial schema for landing NCAA data
1. Load raw data into Databricks
1. Run ETL scripts to cleanup and transform data into a format suitable for analysis

## Setup
Run cells in this section to get your environment setup

In [1]:
# Setup module autoreload
%load_ext autoreload
%autoreload 2

In [2]:
# Load environment variables using dotenv

from dotenv import load_dotenv

load_dotenv()

True

In [3]:
# Create a Spark session for the Databricks compute environment
from pyspark.sql import SparkSession
from ncaa_tournament_predictor.config import Config
from ncaa_tournament_predictor.databricks import get_databricks_spark_session

# Explicit typing as SparkSession here to help out intellisense...DatabricksSession intellisense
# isn't very good. In all my exploration so far, the DatabricksSession is compatible with the SparkSession
spark: SparkSession = get_databricks_spark_session(Config.databricks_profile())

In [None]:
# Run all cells above this one to setup your environment

## Schema Setup

Initial steps to create a Databricks schema for holding NCAA mens basketball data

In [None]:
# Create the ncaa_mens_basketball schema
spark.sql("create schema if not exists object_computing.ncaa_mens_basketball;")

## Raw Data Volumes
Setup volumes for holding raw data files from various external data sources (CSVs, text files, etc)

In [None]:
# Create a volume for raw Kaggle stats data

from ncaa_tournament_predictor import volumes

raw_kaggle_stats_sql_object = volumes.as_sql_object(volumes.raw_kaggle_stats)
spark.sql(f"create volume if not exists {raw_kaggle_stats_sql_object}")

In [None]:
# Copy raw data into the raw_kaggle_stats volume

import os

from ncaa_tournament_predictor import volumes

notebook_dir = os.path.abspath(os.getcwd())
kaggle_dataset_path = os.path.abspath(
    os.path.join(notebook_dir, "../datasets/kaggle_ncaa_stats")
)

for filename in os.listdir(kaggle_dataset_path):
    spark.copyFromLocalToFs(
        local_path=os.path.join(kaggle_dataset_path, filename),
        dest_path=os.path.join(volumes.without_dbfs_protocol(volumes.raw_kaggle_stats), filename)
    )

In [None]:
# Read the Kaggle stats dataset
from ncaa_tournament_predictor import transformation, volumes

raw_kaggle_stats = (
    spark.read.format("csv")
        .options(header=True, inferSchema=True, mergeSchema=True)
        .load(volumes.raw_kaggle_stats)
)
cleaned_ncaa_data = transformation.get_cleaned_kaggle_stats(raw_kaggle_stats)

In [None]:
# Create a volume for raw head-to-head data

from ncaa_tournament_predictor import volumes

spark.sql(f"create volume if not exists {volumes.as_sql_object(volumes.raw_head_to_head)}")

In [None]:
# Copy raw data into the raw_head_to_head volume

import os

from ncaa_tournament_predictor import volumes

notebook_dir = os.path.abspath(os.getcwd())
head_to_head_dataset_path = os.path.abspath(
    os.path.join(notebook_dir, "../datasets/kenpom_head_to_head")
)

for filename in os.listdir(head_to_head_dataset_path):
    spark.copyFromLocalToFs(
        local_path=os.path.join(head_to_head_dataset_path, filename),
        dest_path=os.path.join(volumes.without_dbfs_protocol(volumes.raw_head_to_head), filename)
    )

## Data Cleanup & Transformation
Process the raw data, clean it up, and transform it for analysis

In [None]:
# Create the cleaned Kaggle datasets table
from ncaa_tournament_predictor.jobs import kaggle_stats

kaggle_stats.write_cleaned_kaggle_stats()

In [None]:
# Create the cleaned head-to-head table

from ncaa_tournament_predictor.jobs import head_to_head

head_to_head.write_cleaned_head_to_head_data()

## Creating Training Datasets
Combine data sets to create a dataset used for training an ML model

In [5]:
from ncaa_tournament_predictor import transformation

training_dataset = transformation.get_training_dataset()
training_dataset.head()

HBox(children=(IntProgress(value=0, bar_style='success'), Label(value='')))

Row(year=2013, team='Troy', conference='SB', games=33, wins=12, adjusted_offensive_efficiency=97.3, adjusted_defensive_efficiency=107.5, power_rating=0.2416, effective_field_goal_percentage=45.3, effective_field_goal_percentage_allowed=51.0, turnover_rate=16.5, caused_turnover_rate=19.1, offensive_rebound_rate=30.7, defensive_rebound_rate=33.0, freethrows_attempted_rate=24.5, freethrows_allowed_rate=36.6, two_point_shooting_percentage=44.5, two_point_shooting_percentage_allowed=49.5, three_point_shooting_percentage=31.2, three_point_shooting_percentage_allowed=35.8, adjusted_tempo=61.1, wins_above_bubble=-16.6, postseason_result='N/A', tournament_seed=None, source_filename='dbfs:/Volumes/object_computing/ncaa_mens_basketball/raw_kaggle_stats/cbb2013.csv', game_date=datetime.date(2013, 1, 1), team_1='Yale', team_1_score=70, team_2='Iowa St.', team_2_score=80, team_1_won=False, winning_team='Iowa St.')