## ETL PROCESS FOR CHADWICK BASEBALL DATA
This is an ETL process to import data from the Chadwick Databank into a Microsoft SQL Server database, either locally hosted or on AWS.
https://github.com/chadwickbureau/baseballdatabank

## INTENT:
* become part of AWS lambda job
* import from S3
* Deploy schema fresh
* call SSMS job to do post-import updates

### TODO:
* Integrate native python logging framework
* Define schema and key relationships for entire Chadwick db upon import
* Validation testing on imports -- basic metadata catalog to check on number of rows and full set of tables etc
* Normalization in SQL post-processing

In [68]:
import pandas as pd
import os.path
import pyodbc
import sqlalchemy

In [111]:
# migrate me to a config file.
root_dir = "../../baseballdatabank/"

server = "(localdb)\MSSQLLocalDB"
database = "baseball"

In [111]:
engine = sqlalchemy.create_engine("mssql+pyodbc://" + server + "/" + database + "?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server")

In [134]:
subdirs = ["core","contrib"]

with engine.connect() as conn:

    for subdir in subdirs:
        print(subdir)
    
        for i in os.listdir(root_dir + subdir):

            if i.endswith(".csv"):

                file_name = root_dir + subdir + "/" + i
                table_name = subdir + "_" + i.replace(".csv","")

                df = pd.read_csv(file_name)
                
                # didn't realize that inf was an actual valid state for a pandas float
                # infinite ERAs are unfortunate.
                df.replace({np.inf: np.nan, -np.inf: np.nan}, inplace=True)  
                
                # should probably add data validation checks at this step prior to import into sql

                df.to_sql(name=table_name, con=engine, if_exists='replace', index=False)

                print(i + " successfully uploaded.")
    

core
AllstarFull.csv successfully uploaded.
Appearances.csv successfully uploaded.
Batting.csv successfully uploaded.
BattingPost.csv successfully uploaded.
Fielding.csv successfully uploaded.
FieldingOF.csv successfully uploaded.
FieldingOFsplit.csv successfully uploaded.
FieldingPost.csv successfully uploaded.
HomeGames.csv successfully uploaded.
Managers.csv successfully uploaded.
ManagersHalf.csv successfully uploaded.
Parks.csv successfully uploaded.
People.csv successfully uploaded.
Pitching.csv successfully uploaded.
PitchingPost.csv successfully uploaded.
SeriesPost.csv successfully uploaded.
Teams.csv successfully uploaded.
TeamsFranchises.csv successfully uploaded.
TeamsHalf.csv successfully uploaded.
contrib
AwardsManagers.csv successfully uploaded.
AwardsPlayers.csv successfully uploaded.
AwardsShareManagers.csv successfully uploaded.
AwardsSharePlayers.csv successfully uploaded.
CollegePlaying.csv successfully uploaded.
HallOfFame.csv successfully uploaded.
Salaries.csv su

### Everything past this point is scratch work that may be unnecessary.

In [135]:
# from sqlalchemy.ext.automap import automap_base
# from sqlalchemy.orm import Session

# Base = automap_base()
# Base.prepare(engine, reflect=True)



# still in progress: generating read call in sqlalchemy to pull data from defined view
#metadata = sqlalchemy.MetaData()
#sqlalchemy.MetaData.reflect(metadata)
#test_view = sqlalchemy.Table('test_view', metadata)

In [136]:
# Base.classes.keys()

[]