# Guided Project
### Designing and Creating a Database

## Getting to Know the Data

In this guided project, we're going to learn how to:
* Import data into SQLite
* Design a normalized database schema
* Create tables for our schema
* Insert data into our schema


We will be working with a file of [Major League Baseball](https://en.wikipedia.org/wiki/Major_League_Baseball) games from [Retrosheet](http://www.retrosheet.org/). Retrosheet compiles detailed statistics on baseball games from the 1800s through to today. The main file we will be working from `game_log.csv`, has been produced by combining 127 separate CSV files from retrosheet, and has been pre-cleaned to remove some inconsistencies. The game log has hundreds of data points on each game which we will normalize this data into several separate tables using SQL, providing a robust database of game-level statistics.<br>

In addition to the main file, we have also included three 'helper' files, also sourced from Retrosheet:
* park_codes.csv
* person_codes.csv
* team_codes.csv

These three helper files in some cases contain extra data, but will also make things easier as they will form the basis for three of our normalized tables.<br>

An important first step when working with any new data is to perform exploratory data analysis (EDA). EDA gets us familiar with the data and gives us a level of background knowledge that will help us throughout our project. The methods you use when performing EDA will depend on what you plan to do with the data. In our case, we're wanting to create a normalized database, so our focus should be:

* Becoming familiar, at a high level, with the meaning of each column in each file.
* Thinking about the relationships between columns within each file.
* Thinking about the relationships between columns across different files.

We have included a `game_log_fields.txt` file from Retrosheet which explains the fields included in our main file, which will be useful to assist our EDA. You can use `!cat game_log_fields.txt` in its own Jupyter cell to read the contents of the file.<br>

**If you're not familiar with baseball some of this can seem overwhelming at first, however this presents a great opportunity.** 

### When you are working with data professionally, you'll often encounter data in an industry you might be unfamiliar with - it might be digital marketing, geological engineering or industrial machinery. 

In these instances, you'll have to perform research in order to understand the data you're working with.<br>

Baseball is a great topic to practice these skills with. Because of the long history within baseball of the collection and analysis of statistics (most famously the [Sabermetrics](https://en.wikipedia.org/wiki/Sabermetrics) featured in the movie [Moneyball](https://en.wikipedia.org/wiki/Moneyball_%28film%29)), there is a wide range of online resources available to help you get answers to any questions you may have.<br>

Let's get started exploring the data by using pandas to read and explore the data. Setting the following options after you import pandas is recommended– they will prevent the DataFrame output from being truncated, given the size of the main game log file:

```python
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)
```

In [1]:
import pandas as pd

In [2]:
# due to limit on github upload filesize,
# use the code below to unzip the data file
# into your current work folder.

# for the same reason, the current repository
# does not contain 'mlb.db' database file
# as working the codes below, you will create db.

# for the first run of this notebook,
# uncomment this and run once.

#!unzip ./game_log.zip

Archive:  ./game_log.zip
  inflating: game_log.csv            
   creating: __MACOSX/
  inflating: __MACOSX/._game_log.csv  


In [3]:
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)

In [56]:
game_log = pd.read_csv('game_log.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
game_log = pd.read_csv('game_log.csv', low_memory = False)

In [5]:
park_codes = pd.read_csv('park_codes.csv')
person_codes = pd.read_csv("person_codes.csv")
team_codes = pd.read_csv("team_codes.csv")

* Using pandas, read in each of the four CSV files: `game_log.csv`, `park_codes.csv`, `person_codes.csv`, `team_codes.csv`. For each:
  * Use methods and attributes like `DataFrame.shape`, `DataFrame.head()`, and `DataFrame.tail()` to explore the data.
  * Write a brief paragraph to describe each file, including for the helper files how the data intersects with the main log file.
* Research any fields you are not familiar with, using both the text file and Google as needed. In particular, you should explore and write a short paragraph on:
  * What each defensive position number represents.
  * The values in the various league fields, and which leagues they represent.

In [6]:
for (df, name) in zip([game_log, park_codes, person_codes, team_codes],
              ['game_log', 'park_codes', 'person_codes', 'team_codes']):
    
    print('')
    print('#'*10, name, '#'*10)
    print('shape:',df.shape)
    print('columns:', df.columns)

    print(df.head(3))
    print(df.tail(3))
    print('')
    


########## game_log ##########
shape: (171907, 161)
columns: Index(['date', 'number_of_game', 'day_of_week', 'v_name', 'v_league',
       'v_game_number', 'h_name', 'h_league', 'h_game_number', 'v_score',
       ...
       'h_player_7_name', 'h_player_7_def_pos', 'h_player_8_id',
       'h_player_8_name', 'h_player_8_def_pos', 'h_player_9_id',
       'h_player_9_name', 'h_player_9_def_pos', 'additional_info',
       'acquisition_info'],
      dtype='object', length=161)
       date  number_of_game day_of_week v_name v_league  v_game_number h_name  \
0  18710504               0         Thu    CL1      NaN              1    FW1   
1  18710505               0         Fri    BS1      NaN              1    WS3   
2  18710506               0         Sat    CL1      NaN              2    RC1   

  h_league  h_game_number  v_score  h_score  length_outs day_night completion  \
0      NaN              1        0        2         54.0         D        NaN   
1      NaN              1       20   

            date  number_of_game day_of_week v_name v_league  v_game_number  \
171904  20161002               0         Sun    LAN       NL            162   
171905  20161002               0         Sun    PIT       NL            162   
171906  20161002               0         Sun    MIA       NL            161   

       h_name h_league  h_game_number  v_score  h_score  length_outs  \
171904    SFN       NL            162        1        7         51.0   
171905    SLN       NL            162        4       10         51.0   
171906    WAS       NL            162        7       10         51.0   

       day_night completion forefeit protest park_id  attendance  \
171904         D        NaN      NaN     NaN   SFO03     41445.0   
171905         D        NaN      NaN     NaN   STL10     44615.0   
171906         D        NaN      NaN     NaN   WAS11     28730.0   

        length_minutes v_line_score h_line_score  v_at_bats  v_hits  \
171904           184.0    000100000    23000002x  

In [46]:
# game log data from 1871.05.04 - 2016.10.02
game_log.date.unique()

array([18710504, 18710505, 18710506, ..., 20160930, 20161001, 20161002])

In [6]:
print('park_id' in game_log.columns)
print('id' in game_log.columns)
print('team_id' in game_log.columns)

True
False
False


In [7]:
# player id

import re
print(re.findall(r'[0-9a-z_]*_id_[0-9a-z_]*', ' '.join(game_log.columns)))
print(re.findall(r'[0-9a-z_]*_id', ' '.join(game_log.columns)))

['winning_rbi_batter_id_name']
['park_id', 'hp_umpire_id', '1b_umpire_id', '2b_umpire_id', '3b_umpire_id', 'lf_umpire_id', 'rf_umpire_id', 'v_manager_id', 'h_manager_id', 'winning_pitcher_id', 'losing_pitcher_id', 'saving_pitcher_id', 'winning_rbi_batter_id', 'winning_rbi_batter_id', 'v_starting_pitcher_id', 'h_starting_pitcher_id', 'v_player_1_id', 'v_player_2_id', 'v_player_3_id', 'v_player_4_id', 'v_player_5_id', 'v_player_6_id', 'v_player_7_id', 'v_player_8_id', 'v_player_9_id', 'h_player_1_id', 'h_player_2_id', 'h_player_3_id', 'h_player_4_id', 'h_player_5_id', 'h_player_6_id', 'h_player_7_id', 'h_player_8_id', 'h_player_9_id']


In [8]:
# there is no team id
# instead team name exists in 'v_name', 'h_name' 
# in the format of 'team_id' in 'team_codes' dataset

print(game_log.v_name[:3])
print(game_log.h_name[:3])

0    CL1
1    BS1
2    CL1
Name: v_name, dtype: object
0    FW1
1    WS3
2    RC1
Name: h_name, dtype: object


### Note 1 : dataset relationship

* `game_log.csv` [ `v_player_[num]_id, ...` ] --- `person_codes.csv` [ `id` ] 
* `game_log.csv` [ `park_id` ] --- `park_codes.csv` [ `park_id` ]
* `game_log.csv` [ `v_name`, `h_name` ] --- `team_codes.csv` [ ``team_id ]

### Note 2 : game_log_fields

```
Field(s)  Meaning
    1     Date in the form "yyyymmdd"
    2     Number of game:
             "0" -- a single game
             "1" -- the first game of a double (or triple) header
                    including seperate admission doubleheaders
             "2" -- the second game of a double (or triple) header
                    including seperate admission doubleheaders
             "3" -- the third game of a triple-header
             "A" -- the first game of a double-header involving 3 teams
             "B" -- the second game of a double-header involving 3 teams
    3     Day of week  ("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
  4-5     Visiting team and league
    6     Visiting team game number
          For this and the home team game number, ties are counted as
          games and suspended games are counted from the starting
          rather than the ending date.
  7-8     Home team and league
    9     Home team game number
10-11     Visiting and home team score (unquoted)
   12     Length of game in outs (unquoted).  A full 9-inning game would
          have a 54 in this field.  If the home team won without batting
          in the bottom of the ninth, this field would contain a 51.
   13     Day/night indicator ("D" or "N")
   14     Completion information.  If the game was completed at a
          later date (either due to a suspension or an upheld protest)
          this field will include:
             "yyyymmdd,park,vs,hs,len" Where
          yyyymmdd -- the date the game was completed
          park -- the park ID where the game was completed
          vs -- the visitor score at the time of interruption
          hs -- the home score at the time of interruption
          len -- the length of the game in outs at time of interruption
          All the rest of the information in the record refers to the
          entire game.
   15     Forfeit information:
             "V" -- the game was forfeited to the visiting team
             "H" -- the game was forfeited to the home team
             "T" -- the game was ruled a no-decision
   16     Protest information:
             "P" -- the game was protested by an unidentified team
             "V" -- a disallowed protest was made by the visiting team
             "H" -- a disallowed protest was made by the home team
             "X" -- an upheld protest was made by the visiting team
             "Y" -- an upheld protest was made by the home team
          Note: two of these last four codes can appear in the field
          (if both teams protested the game).
   17     Park ID
   18     Attendance (unquoted)
   19     Time of game in minutes (unquoted)
20-21     Visiting and home line scores.  For example:
             "010000(10)0x"
          Would indicate a game where the home team scored a run in
          the second inning, ten in the seventh and didn't bat in the
          bottom of the ninth.
22-38     Visiting team offensive statistics (unquoted) (in order):
             at-bats
             hits
             doubles
             triples
             homeruns
             RBI
             sacrifice hits.  This may include sacrifice flies for years
                prior to 1954 when sacrifice flies were allowed.
             sacrifice flies (since 1954)
             hit-by-pitch
             walks
             intentional walks
             strikeouts
             stolen bases
             caught stealing
             grounded into double plays
             awarded first on catcher's interference
             left on base
39-43     Visiting team pitching statistics (unquoted)(in order):
             pitchers used ( 1 means it was a complete game )
             individual earned runs
             team earned runs
             wild pitches
             balks
44-49     Visiting team defensive statistics (unquoted) (in order):
             putouts.  Note: prior to 1931, this may not equal 3 times
                the number of innings pitched.  Prior to that, no
                putout was awarded when a runner was declared out for
                being hit by a batted ball.
             assists
             errors
             passed balls
             double plays
             triple plays
50-66     Home team offensive statistics
67-71     Home team pitching statistics
72-77     Home team defensive statistics
78-79     Home plate umpire ID and name
80-81     1B umpire ID and name
82-83     2B umpire ID and name
84-85     3B umpire ID and name
86-87     LF umpire ID and name
88-89     RF umpire ID and name
          If any umpire positions were not filled for a particular game
          the fields will be "","(none)".
90-91     Visiting team manager ID and name
92-93     Home team manager ID and name
94-95     Winning pitcher ID and name
96-97     Losing pitcher ID and name
98-99     Saving pitcher ID and name--"","(none)" if none awarded
100-101   Game Winning RBI batter ID and name--"","(none)" if none
          awarded
102-103   Visiting starting pitcher ID and name
104-105   Home starting pitcher ID and name
106-132   Visiting starting players ID, name and defensive position,
          listed in the order (1-9) they appeared in the batting order.
133-159   Home starting players ID, name and defensive position
          listed in the order (1-9) they appeared in the batting order.
  160     Additional information.  This is a grab-bag of informational
          items that might not warrant a field on their own.  The field 
          is alpha-numeric. Some items are represented by tokens such as:
             "HTBF" -- home team batted first.
             Note: if "HTBF" is specified it would be possible to see
             something like "01002000x" in the visitor's line score.
          Changes in umpire positions during a game will also appear in 
          this field.  These will be in the form:
             umpchange,inning,umpPosition,umpid with the latter three
             repeated for each umpire.
          These changes occur with umpire injuries, late arrival of 
          umpires or changes from completion of suspended games. Details
          of suspended games are in field 14.
  161     Acquisition information:
             "Y" -- we have the complete game
             "N" -- we don't have any portion of the game
             "D" -- the game was derived from box score and game story
             "P" -- we have some portion of the game.  We may be missing
                    innings at the beginning, middle and end of the game.
 
Missing fields will be NULL.
```

## Importing Data into SQLite

To insert data into a noramalized database, we'll need a single column that can be used as a primary key. The game log file does not have a single column that can be used as a primary key to uniquely identify each game. There are three ways that we could handle this:
* Make a compound primary key, such as a primary key of the `date`, `h_name`, and `number_of_game` columns.
* Insert an integer primary key, eg where the first row is `1`, the second row is `2`, etc.
* Insert a new column using a custom format.

Because we have not yet normalized our data, it's better not to start with a compound primary key - if we do this, we might end up needing to create a compound key in another table that includes this compound key, which would quickly become cumbersome to work with. An integer primary key is a good choice, but we should first explore whether Retrosheet already have a system for uniquely identifying each game. If they do, this is a better option. It means that if at some later stage we choose to incorporate more detailed game data into our database, the keys we use will be compatible with other sources.<br>

Exploring the Retrosheet site, we can find this [data dictionary](http://www.retrosheet.org/eventfile.htm) for their event files, which list every event within each game. This includes the following description:
* `id`: Each game begins with a **twelve character ID record** which identifies the `date`, `home team`, and `number of the game`. For example, `ATL198304080` should be read as follows. 
  * The first three characters identify the home team (the Braves). 
  * The next four are the year (1983). 
  * The next two are the month (April) using the standard numeric notation, 04, followed by the day (08). 
  * The last digit indicates if this is a single game (`0`), first game (`1`) or second game (`2`) if more than one game is played during a day, usually a double header The id record starts the description of a game thus ending the description of the preceding game in the file.

You might notice that this essentially makes a custom key using the three columns we identified in our composite key example earlier. After we import the data, we'll construct this column to use as a primary key in our final database.<br>

Our next task is to import the data into SQLite. There are **three key ways to import data into a SQLite database**:

### 1. Using the Python SQLite library
The [Python SQLite library](https://docs.python.org/3/library/sqlite3.html) gives us ultimate control when importing data. We will first need to get the data into Python - we might choose to use the [csv module](https://docs.python.org/3/library/csv.html) for this. Next, we would use the [`Cursor.execute()` method] (https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.execute) to create a table for our data.<br>

Lastly, we can use the [`Cursor.executemany()` method](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.executemany) to insert multiple rows of data in a single command. If we create our [connection object](https://docs.python.org/3/library/sqlite3.html#connection-objects) with a filename that doesn't exist, the sqlite module will create the database file for us.<br>

We should take advantage of the `?` placeholder value syntax instead of using [python string formatting](https://pyformat.info/) **to prevent [SQL injection attacks](https://en.wikipedia.org/wiki/SQL_injection#Incorrect_type_handling)** (like the hilarious [XKCD 'Bobby Tables' comic example](https://xkcd.com/327/)) and maintain the correct data types. 

### Even though in this project we won't be running any external user code, this is an extremely good habit to get into. 

Here's what our syntax would look like for the last step:

```
my_list_of_lists = [
    [4, 4, 8, 2],
    [5, 1, 6, 3],
    [5, 2, 4, 6]
]
c = """
INSERT INTO table_name (
    column_one,
    column_two,
    [...]
) VALUES (
    ?,
    ?
    [...]
);
"""
cur.executemany(c, my_list_of_lists)
```

The advantage of this method is that we have the highest level of control over what we're doing. Additionally, if we have larger data, we can write a loop that iterates over our source line by line so that we don't have to read all of it into memory at once.<br>

**The disadvantage is that there is a lot of manual data handling required**.

### 2. Using pandas
The pandas library includes a handy [`DataFrame.to_sql()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) that we can use to send the contents of a dataframe to a SQLite connection object. We can either create the table first using the method above, or if the table does not exist, pandas will create it for us. Here's an example of what that looks like:

```python
my_dataframe.to_csv('table_name', sqlite_connection_object, index=False)
```
Most of the time, we'll want to use index=False, otherwise pandas will create an extra column for the pandas index.<br>

The advantage of this method is that **it can often be done with a line or two of code**.<br>

The disadvantage is that pandas **may alter the data as it reads it in and converts the columns to types automatically**. Additionally, this **requires the data to be small enough to be able to be stored in-memory** using pandas.

### 3. From the SQLite shell
The last method is to use the SQLite shell to import the data. Like the pandas method, we can either create the table manually ourselves, or rely on SQLite to do it for us. Here's the commands you would use to import a CSV using the SQLite shell:

```bash
sqlite> .mode csv
sqlite> .import filename.csv table_name
```

### This is one of the quickest methods to use and works well with large data sources. 

There are several **minor inconveniences** to this method. 
* SQLite **detects the column types using the first row of data**, which can lead to incorrect types.
* You'll need **SQLite shell access**, which you won't always have.
* Lastly, if you want to create the table yourself, you will **need to remove the header from the first line** of your CSV, otherwise SQLite will make that the first row of your table.

With all of these methods, unless we explicitly create the table, **the table will be created with no primary key**. For now this isn't a problem as we'll be migrating this data into new, normalized tables.<br>

We'll use the pandas method in this instance, because we've already read the data into dataframes.

### The type conversion isn't a big issue– 
as outlined above we will move the data into new tables and can handle type conversion then.<br>

#### Hint: 
Just like in the previous guided project, our database retains 'state', so if we run a query that creates or modifies a table twice, the query will fail. [Some commands like `CREATE TABLE`](https://sqlite.org/lang_createtable.html) support `IF NOT EXISTS` which will allow you to run your notebook without these errors. You should consult the [SQLite documentation](https://sqlite.org/lang.html) for the availability and syntax of these clauses.

* Recreate the `run_command()` and `run_query()` functions from the previous guided project, which you can use
* Use `DataFrame.to_sql()` to create tables for each of our dataframes in a new SQLite database, `mlb.db`:
  * The table name should be the same as each of the CSV filename without the extension, eg `game_log.csv` should be imported to a table called `game_log`.
* Using `run_command()`, create a new column in the `game_log` table called `game_id`:
  * Use **SQL string concatenation** to update the new columns with a unique ID using the Retrosheet format outlined above.

In [7]:
import sqlite3

In [8]:
def run_query(query):
    with sqlite3.connect('mlb.db') as conn:
        return pd.read_sql(query, conn)
    
def run_command(query):
    with sqlite3.connect('mlb.db') as conn:
        conn.execute(query)
        conn.commit()

* create an empty db named as `mlb.db` via terminal command

In [9]:
%%bash
sqlite3 mlb.db

In [10]:
con = sqlite3.connect('mlb.db')
game_log.to_sql('game_log', con)

In [11]:
run_query('SELECT * FROM game_log LIMIT 5')

Unnamed: 0,index,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
0,0,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,0,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y
1,1,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y
2,2,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y
3,3,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y
4,4,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,2232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y


In [12]:
add_column_game_id_query = '''
        ALTER TABLE game_log
        ADD COLUMN game_id
'''
run_command(add_column_game_id_query)

In [13]:
update_game_id_concat_query = '''
        UPDATE game_log
        SET game_id = h_name||date||number_of_game;
'''
run_command(update_game_id_concat_query)

In [14]:
run_query('SELECT game_id FROM game_log LIMIT 5')

Unnamed: 0,game_id
0,FW1187105040
1,WS3187105050
2,RC1187105060
3,CH1187105080
4,TRO187105090


## Looking for Normalization Opportunities

When we spoke about database normalization in the previous mission, we mentioned that there were **normal forms**, a series of 5 progressive stages. Each of these stages have specific rules about the structure of the data that you can use to normalize.<br>

Rather than learn and follow specific normalized forms, we're going to look for specific opportunities to normalize our data by reducing repetition. Here are two examples of repetition we can find and remove:<br>

### Repetition in columns

Let's look at the following segment of data:

In [15]:
run_query('''SELECT v_player_1_id, v_player_1_name, v_player_1_def_pos,
                    v_player_2_id, v_player_2_name, v_player_2_def_pos
                FROM game_log
                LIMIT 5
''')

Unnamed: 0,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos
0,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
1,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0
2,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
3,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
4,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0


We have three columns that relate to one player, followed by three columns that relate to another player. We could restructure our data to remove this repetition - we would need to add an extra column to include the data that was previously only contained in the name of the column:

id|name|def_pos|off_pos
---|---|---|---
villj001|Jonathan Villar|5.0|1.0
granc001|Curtis Granderson|8.0|1.0
kendh001|Howie Kendrick|7.0|1.0
jasoj001|John Jaso|3.0|1.0
gordd002|Dee Gordon|4.0|1.0

### Non-primary key columns should be attributes of the primary key
The primary key of our game log is our `game_id`, but the players name are not attributes of a game, but of the player id. If the only data we had was the game log, we would remove this column and create a new table that had the names of each player. As it happens, our `person_codes` table already has a list of our player IDs and names, so we can remove these without the need for creating a new table first.

### Redundant Data
Lastly, we want to eliminate any redundant data - that is, columns where the data is available elsewhere. A good example of this can be found in our `park_codes` table, which will form the basis of our eventual park table. Let's look at the first few rows (we won't display the notes column as is not relevant to our discussion):

In [16]:
con = sqlite3.connect('mlb.db')
park_codes.to_sql('park_codes', con)

In [17]:
run_query('SELECT park_id, name, aka, city, state, end, league FROM park_codes LIMIT 5')

Unnamed: 0,park_id,name,aka,city,state,end,league
0,ALB01,Riverside Park,,Albany,NY,05/30/1882,NL
1,ALT01,Columbia Park,,Altoona,PA,05/31/1884,UA
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,,AL
3,ARL01,Arlington Stadium,,Arlington,TX,10/03/1993,AL
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,,AL


The start and end columns show the first and last games played at the park, however we will be able to derive this information by looking at the park information for each game. Similarly, the league information is going to be available elsewhere in our database.

* Looking at the various files, look for opportunities to normalize the data and record your observations in a markdown cell.

In [18]:
person_codes.to_sql('person_codes', con); team_codes.to_sql('team_codes', con)

In [19]:
run_query('SELECT * FROM person_codes LIMIT 5')

Unnamed: 0,index,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,0,aardd001,Aardsma,David,04/06/2004,,,
1,1,aaroh101,Aaron,Hank,04/13/1954,,,
2,2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,3,aased001,Aase,Don,07/26/1977,,,
4,4,abada001,Abad,Andy,09/10/2001,,,


In [20]:
run_query('SELECT * FROM team_codes LIMIT 5')

Unnamed: 0,index,team_id,league,start,end,city,nickname,franch_id,seq
0,0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


In [249]:
print(list(game_log.columns))

['date', 'number_of_game', 'day_of_week', 'v_name', 'v_league', 'v_game_number', 'h_name', 'h_league', 'h_game_number', 'v_score', 'h_score', 'length_outs', 'day_night', 'completion', 'forefeit', 'protest', 'park_id', 'attendance', 'length_minutes', 'v_line_score', 'h_line_score', 'v_at_bats', 'v_hits', 'v_doubles', 'v_triples', 'v_homeruns', 'v_rbi', 'v_sacrifice_hits', 'v_sacrifice_flies', 'v_hit_by_pitch', 'v_walks', 'v_intentional_walks', 'v_strikeouts', 'v_stolen_bases', 'v_caught_stealing', 'v_grounded_into_double', 'v_first_catcher_interference', 'v_left_on_base', 'v_pitchers_used', 'v_individual_earned_runs', 'v_team_earned_runs', 'v_wild_pitches', 'v_balks', 'v_putouts', 'v_assists', 'v_errors', 'v_passed_balls', 'v_double_plays', 'v_triple_plays', 'h_at_bats', 'h_hits', 'h_doubles', 'h_triples', 'h_homeruns', 'h_rbi', 'h_sacrifice_hits', 'h_sacrifice_flies', 'h_hit_by_pitch', 'h_walks', 'h_intentional_walks', 'h_strikeouts', 'h_stolen_bases', 'h_caught_stealing', 'h_grounded_

In [22]:
'v_player_2_id' in game_log.columns

True

In [23]:
team_codes.head()

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


## Planning a Normalized Schema

Now that we've started to think about normalization ideas, it's time to start planning our schema. The best way to work visually with a schema diagram, just like the ones we've used so far in this course. Start by creating a diagram of the four existing tables and their columns, and then gradually create new tables that move the data into a more normalized state.<br>

Some people like to do this on paper, others use diagramming tools like Sketch or Figma, others like using Photoshop or similar. Our recommendation is that the best way to do this is using a schema designing tool like [DbDesigner.net](https://dbdesigner.net/). This free tool allows you to create a schema and will create lines to show foreign key relations clearly.

![dbdesigner](https://s3.amazonaws.com/dq-content/193/dbdesigner-screenshot.png)

In the end, you should choose the tool that you feel like you will be able to work quickly in as you plan out your schema.

Here are some tips when planning out your schema:

* **Don't be afraid to experiment**. It's unlikely that your first few steps will be there in your finished product - try things and see how they look.
* If you're using a tool like DbDesigner which automatically shows lines for foreign key relationships, don't worry if your lines look messy. This is normal– you can move the tables around to neaten things up at the end, but don't waste time on it while you are still normalizing.
* The following facts about the data may help you with your normalization decisions:
  * **Historically, teams sometimes move between leagues.**
  * **The same person might be in a single game as both a player and a manager**
  * **Because of how pitchers are represented in the game log, not all pitchers used in a game will be shown. We only want to worry about the pitchers mentioned via position or the 'winning pitcher'/ 'losing pitcher'.**
* It is possible to over-normalize. We want to **finish with about 7-8 tables total**.

Lastly, we advise spending between 60-90 minutes on your planning your schema. In the next step, we will introduce our suggested schema that we will work with for the rest of the project, but working as much of it out yourself is highly recommended.

* Using whichever design tool you feel most comfortable with, plan a schema for our baseball database.
* When you are happy with your schema, [insert a screenshot or photo into a markdown cell](https://daringfireball.net/projects/markdown/syntax#img).

## Note_Ideation on database normalization

### from `game_log` table

* `manager` table
  * `manager_id`, `name`
  * PRIMARY KEY = `manager_id`

* `player` table
  * `player_id`, `name`, `def_pos`, `off_pos`
  * PRIMARY KEY = `player_id`

* `umpire` table
  * `umpire_id`, `name`
  * PRIMARY KEY = `umpire_id`
* `match_meta` table >>> ORIGIN
  * `game_id`, `date`, `h_name`, `v_name`, `h_line_score`, `v_line_score`,`length_outs`, `day_night`, `completion`, `forefeit`, `protest`, `park_id`, `attendance`, `hp_umpire_id`, `1b_umpire_id`, `2b_umpire_id`, `3b_umpire_id`, `1f_umpire_id`, `rf_umpire_id`
  * PRIMARY KEY = `game_id`

* `match_lineup` table
  * `game_id`
  * `v_manager_id`, `h_manager_id`
  * `winning_pitcher_id`, `losing_pitcher_id`, `saving_pitcher_id`
  * `winning_rbi_batter_id`
  * `v_starting_pitcher_id`, `h_starting_pitcher_id`
  * `v_player_1_id`, `v_player_1_def_pos`, `v_player_2_id`, `v_player_2_def_pos`, `v_player_3_id`, `v_player_3_def_pos`, `v_player_4_id`, `v_player_4_def_pos`, `v_player_5_id`, `v_player_5_def_pos`, `v_player_6_id`, `v_player_6_def_pos`, `v_player_7_id`, `v_player_7_def_pos`, `v_player_8_id`,`v_player_8_def_pos`, `v_player_9_id`, `v_player_9_def_pos`
  * `h_player_1_id`, `h_player_1_def_pos`, `h_player_2_id`, `h_player_2_def_pos`, `h_player_3_id`, `h_player_3_def_pos`, `h_player_4_id`, `h_player_4_def_pos`, `h_player_5_id`, `h_player_5_def_pos`, `h_player_6_id`, `h_player_6_def_pos`, `h_player_7_id`, `h_player_7_def_pos`, `h_player_8_id`,`h_player_8_def_pos`, `h_player_9_id`, `h_player_9_def_pos`
  * PRIMARY KEY = `game_id`

* `match_def_stat` table
  * `game_id`, `h_name`, `v_name`
  * `h_putouts`, `h_assists`, `h_passed_balls`, `h_double_plays`, `h_triple_plays`
  * `v_putouts`, `v_assists`, `v_passed_balls`, `v_double_plays`, `v_triple_plays`
  * PRIMARY KEY = `game_id`, `h_name`, `v_name`

* `match_off_stat` table
  * `game_id`, `h_name`, `v_name`
  * `h_at_bats`, `h_hits`, `h_doubles`, `h_triples`, `h_homeruns`, `h_rbi`, `h_sacrifice_flies`, `h_hit_by_pitch`, `h_walks`, `h_intentional_walks`, `h_strikeouts`, `h_stolen_bases`, `h_caught_stealing`, `h_grounded_into_double`, `h_left_on_base`
  * `v_at_bats`, `v_hits`, `v_doubles`, `v_triples`, `v_homeruns`, `v_rbi`, `v_sacrifice_flies`, `v_hit_by_pitch`, `v_walks`, `v_intentional_walks`, `v_strikeouts`, `v_stolen_bases`, `v_caught_stealing`, `v_grounded_into_double`, `v_left_on_base`
  * PRIMARY KEY = `game_id`, `h_game`, `v_game`

* `match_pitch_stat` table
  * `game_id`, `h_name`, `v_name`
  * `h_pitchers_used`, `h_individual_earned_runs`, `h_team_earned_runs`, `h_wild_pitches`, `h_balks`
  * `v_pitchers_used`, `v_individual_earned_runs`, `v_team_earned_runs`, `v_wild_pitches`, `v_balks`
  * PRIMARY KEY = `game_id`, `h_game`, `v_game`

## Creating Tables Without Foreign Key Relations

So that we can work through the rest of the steps together, we will provide a schema for the rest of this guided project. As we work through each table, we'll explain some of the decision made when normalizing and creating the schema. Below is the schema we will use:

![mlb-schema](https://s3.amazonaws.com/dq-content/193/mlb_schema.svg)

As we work through creating the tables in this schema, **we'll talk about why we made particular choices during the normalization process**. We'll start by creating the tables that don't contain any foreign key relations. It's important to start with these tables, as **other tables will have relations to these tables, and so these tables will need to exist first**.<br>

The tables we will create are below, with some notes on the normalization choices made:

* `person`
  * Each of the 'debut' columns have been omitted, as the data will be able to be found from other tables.
   * Since the game log file has no data on coaches, we made the decision to not include this data.
* `park`
  * The start, end, and league columns contain data that is found in the main game log and can be removed.
* `league`
  * Because some of the older leagues are not well known, we will create a table to store league names.
* `appearance_type`
  * Our appearance table will include data on players with positions, umpires, managers, and awards (like winning pitcher). This table will store information on what different types of appearances are available.

We'll first create each table, and then we'll insert the data. Previously, we learned to use the `CREATE` statement with the `VALUES` clause to manually specify values to be inserted. For the `person` and `park` tables, we have `person_codes` and `park_codes` which contain the data we'll need. To use this data we'll `CREATE` with `SELECT`:

```sql
INSERT INTO table_one
SELECT * FROM table_two;
```

Note that you will need to adjust the select statement to specify select columns in order, since our original table and new table have different number of columns in different orders.<br>

Similar to `IF NOT EXISTS`, we can use [INSERT OR IGNORE as specified in the SQLite documentation](https://www.sqlite.org/lang_insert.html) to prevent our code from failing if we run it a second time in our notebook.<br>

For the `league` table you will need to manually specify the values, and for `appearance_type` we have provided a `appearance_type.csv` that you can import which contains all the values you need for this table.

* Create the `person` table with columns and primary key as shown in the schema diagram.
  * Select the appropriate type based on the data.
  * Insert the data from the `person_codes` table.
  * Write a query to display the first few rows of the table.


In [24]:
person_codes.head()

Unnamed: 0,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,aardd001,Aardsma,David,04/06/2004,,,
1,aaroh101,Aaron,Hank,04/13/1954,,,
2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,aased001,Aase,Don,07/26/1977,,,
4,abada001,Abad,Andy,09/10/2001,,,


In [34]:
create_person_table = '''
        CREATE TABLE person (
            person_id TEXT,
            first_name TEXT,
            last_name TEXT,
            PRIMARY KEY(person_id)
        )
'''

run_command(create_person_table)

In [35]:
insert_data_to_person = '''
        INSERT OR IGNORE INTO person
        SELECT id, first, last FROM person_codes;
'''

run_command(insert_data_to_person)

In [36]:
run_query('SELECT * FROM person LIMIT 5')

Unnamed: 0,person_id,first_name,last_name
0,aardd001,David,Aardsma
1,aaroh101,Hank,Aaron
2,aarot101,Tommie,Aaron
3,aased001,Don,Aase
4,abada001,Andy,Abad


* Create the `park` table with columns and primary key as shown in the schema diagram.
  * Select the appropriate type based on the data
  * Insert the data from the `park_codes` table.
  * Write a query to display the first few rows of the table.

In [29]:
park_codes.head()

Unnamed: 0,park_id,name,aka,city,state,start,end,league,notes
0,ALB01,Riverside Park,,Albany,NY,09/11/1880,05/30/1882,NL,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,04/30/1884,05/31/1884,UA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,04/19/1966,,AL,
3,ARL01,Arlington Stadium,,Arlington,TX,04/21/1972,10/03/1993,AL,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,04/11/1994,,AL,


In [37]:
create_table_park = '''
        CREATE TABLE park (
                park_id TEXT,
                name TEXT,
                nickname TEXT,
                city TEXT,
                state TEXT,
                notes VARCHAR,
                PRIMARY KEY(park_id)
        );
'''
run_command(create_table_park)

In [38]:
insert_data_park = '''
        INSERT OR IGNORE INTO park
        SELECT park_id, name, aka, city, state, notes
            FROM park_codes;
'''
run_command(insert_data_park)

In [39]:
run_query('SELECT * FROM park LIMIT 5')

Unnamed: 0,park_id,name,nickname,city,state,notes
0,ALB01,Riverside Park,,Albany,NY,TRN:9/11/80;6/15&9/10/1881;5/16-5/18&5/30/1882
1,ALT01,Columbia Park,,Altoona,PA,
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,
3,ARL01,Arlington Stadium,,Arlington,TX,
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,


* Create & insert data into table : `league`
<br><br>
* Create the league table with columns and primary key as shown in the schema diagram.
  * Select the appropriate type based on the data.
  * Insert the data manually based on your research on the names of the six league IDs.
  * Write a query to display the table.

In [43]:
print(game_log.v_league.unique())
print(game_log.h_league.unique())

[nan 'NL' 'AA' 'UA' 'PL' 'AL' 'FL']
[nan 'NL' 'AA' 'UA' 'PL' 'AL' 'FL']


In [44]:
create_table_league = '''
        CREATE TABLE league (
                league_id TEXT,
                name TEXT,
                PRIMARY KEY(league_id)
        );
'''
run_command(create_table_league)

research source:
* https://en.wikipedia.org/wiki/List_of_organized_baseball_leagues

In [45]:
insert_data_league = '''
        INSERT OR IGNORE INTO league
        (league_id, name)
        VALUES 
        ("NL", "National League"),
        ("AL", "American League"),
        ("AA", "American Association"),
        ("UA", "Union Association"),
        ("PL", "Player's League"),
        ("FL", "Federal League");
'''
run_command(insert_data_league)

In [47]:
run_query('SELECT * FROM league')

Unnamed: 0,league_id,name
0,NL,National League
1,AL,American League
2,AA,American Association
3,UA,Union Association
4,PL,Player's League
5,FL,Federal League


* Create the `appearance_type` table with columns and primary key as shown in the schema diagram.
  * Select the appropriate type based on the data.
  * Import and insert the data from `appearance_type.csv`.
  * Write a query to display the table.

In [54]:
app_type = pd.read_csv('appearance_type.csv')

print(app_type.shape)
app_type.head()

(31, 3)


Unnamed: 0,appearance_type_id,name,category
0,O1,Batter 1,offense
1,O2,Batter 2,offense
2,O3,Batter 3,offense
3,O4,Batter 4,offense
4,O5,Batter 5,offense


In [153]:
run_command('DROP TABLE IF EXISTS appearance_type;')

In [154]:
# import csv dataframe to sql database as table

app_type.to_sql('appearance_type_noprime', con)
con.commit()

# check
run_query('SELECT * FROM appearance_type_noprime LIMIT 5')

Unnamed: 0,index,appearance_type_id,name,category
0,0,O1,Batter 1,offense
1,1,O2,Batter 2,offense
2,2,O3,Batter 3,offense
3,3,O4,Batter 4,offense
4,4,O5,Batter 5,offense


In [155]:
create_table_appearance_type = '''
        CREATE TABLE appearance_type (
                appearance_type_id TEXT,
                name TEXT,
                category TEXT,
                PRIMARY KEY (appearance_type_id)
        );
'''

run_command(create_table_appearance_type)

In [157]:
insert_data_appearance_type = '''
        INSERT INTO appearance_type
        SELECT
            appearance_type_id, name, category
        FROM appearance_type_noprime;
'''
run_command(insert_data_appearance_type)

In [159]:
run_query('SELECT * FROM appearance_type LIMIT 5')

Unnamed: 0,appearance_type_id,name,category
0,O1,Batter 1,offense
1,O2,Batter 2,offense
2,O3,Batter 3,offense
3,O4,Batter 4,offense
4,O5,Batter 5,offense


In [160]:
run_command('DROP TABLE appearance_type_noprime')

## Adding The Team and Game Tables

Now that we have added all of the tables that don't have foreign key relationships, lets add the next two tables. The `game` and `team` tables need to exist before our two appearance tables are created. Here are the schema of these tables, and the two tables they have foreign key relations to:

![adding-the-team-and-game-tables](https://s3.amazonaws.com/dq-content/193/mlb_schema_2.svg)

Here are some notes on the normalization choices made with each of these tables:
* `team`
  * The start, end, and sequence columns can be derived from the game level data.
* `game`
  * We have chosen to include all columns for the game log that don't refer to one specific team or player, instead putting those in two appearance tables.
  * We have removed the column with the day of the week, as this can be derived from the date.
  * We have changed the `day_night` column to `day`, with the intention of making this a boolean column. Even though SQLite doesn't support the `BOOLEAN` type, we can use this when creating our table and SQLite will manage the underlying types behind the scenes (for more on how this works [refer to the SQLite documentation](https://www.sqlite.org/datatype3.html)). This means that anyone quering the schema of our database in the future understands how that column is intended to be used.

In an earlier mission, we discussed in passing that by default, **SQLite doesn't enforce foreign key relationships**. To ensure the integrity of our data, we need to make sure we enable it after each time we create a sqlite3 connection object in python.<br>

If you are re-using the `run_command()` function from the earlier guided project, you can **add a single line to enable enforcement of foreign key restraints**:

```python
def run_command(c):
    with sqlite3.connect(DB) as conn:
        conn.execute('PRAGMA foreign_keys = ON;')
        conn.isolation_level = None
        conn.execute(c)
```

Let's create the team and game tables.

In [60]:
# update run_command func

def run_command(q):
    with sqlite3.connect('mlb.db') as conn:
        conn.execute('PRAGMA foreign_keys = ON;')
        conn.isolation_level = None
        conn.execute(q)
        conn.commit()

* Create the `team` table with columns, primary key, and foreign key as shown in the schema diagram.
  * Select the appropriate type based on the data.
  * Insert the data from the `team_codes` table.
  * Write a query to display the first few rows of the table.


In [57]:
team_codes.head()

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


In [68]:
q = '''
        DROP TABLE IF EXISTS team;
'''
run_command(q)

In [69]:
create_table_team = '''
        CREATE TABLE team (
                team_id TEXT,
                league_id TEXT,
                city TEXT,
                nickname TEXT,
                franch_id TEXT,
                PRIMARY KEY(team_id),
                FOREIGN KEY(league_id) REFERENCES league(league_id)
        );
'''
run_command(create_table_team)

In [72]:
insert_data_team = '''
        INSERT OR IGNORE INTO team
        SELECT
            team_id, league, city, nickname, franch_id
        FROM team_codes;
'''
run_command(insert_data_team)

In [73]:
run_query('SELECT * FROM team LIMIT 5')

Unnamed: 0,team_id,league_id,city,nickname,franch_id
0,ALT,UA,Altoona,Mountain Cities,ALT
1,ARI,NL,Arizona,Diamondbacks,ARI
2,BFN,NL,Buffalo,Bisons,BFN
3,BFP,PL,Buffalo,Bisons,BFP
4,BL1,,Baltimore,Canaries,BL1


* Create the `game` table with columns, primary key, and foreign key as shown in the schema diagram.
* Select the appropriate type based on the data.
* Insert the data from the `game_log` table.
* Write a query to display the first few rows of the table.

In [83]:
run_query('SELECT * FROM game_log LIMIT 5')

Unnamed: 0,index,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,v_score,h_score,length_outs,day_night,completion,forefeit,protest,park_id,attendance,length_minutes,v_line_score,h_line_score,v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays,h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays,hp_umpire_id,hp_umpire_name,1b_umpire_id,1b_umpire_name,2b_umpire_id,2b_umpire_name,3b_umpire_id,3b_umpire_name,lf_umpire_id,lf_umpire_name,rf_umpire_id,rf_umpire_name,v_manager_id,v_manager_name,h_manager_id,h_manager_name,winning_pitcher_id,winning_pitcher_name,losing_pitcher_id,losing_pitcher_name,saving_pitcher_id,saving_pitcher_name,winning_rbi_batter_id,winning_rbi_batter_id_name,v_starting_pitcher_id,v_starting_pitcher_name,h_starting_pitcher_id,h_starting_pitcher_name,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos,v_player_3_id,v_player_3_name,v_player_3_def_pos,v_player_4_id,v_player_4_name,v_player_4_def_pos,v_player_5_id,v_player_5_name,v_player_5_def_pos,v_player_6_id,v_player_6_name,v_player_6_def_pos,v_player_7_id,v_player_7_name,v_player_7_def_pos,v_player_8_id,v_player_8_name,v_player_8_def_pos,v_player_9_id,v_player_9_name,v_player_9_def_pos,h_player_1_id,h_player_1_name,h_player_1_def_pos,h_player_2_id,h_player_2_name,h_player_2_def_pos,h_player_3_id,h_player_3_name,h_player_3_def_pos,h_player_4_id,h_player_4_name,h_player_4_def_pos,h_player_5_id,h_player_5_name,h_player_5_def_pos,h_player_6_id,h_player_6_name,h_player_6_def_pos,h_player_7_id,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info,game_id
0,0,18710504,0,Thu,CL1,,1,FW1,,1,0,2,54.0,D,,,,FOR01,200.0,120.0,0,10010000,30.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,6.0,1.0,,-1.0,,4.0,1.0,1.0,1.0,0.0,0.0,27.0,9.0,0.0,3.0,0.0,0.0,31.0,4.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,,0.0,0.0,,-1.0,,3.0,1.0,0.0,0.0,0.0,0.0,27.0,3.0,3.0,1.0,1.0,0.0,boakj901,John Boake,,,,,,,,,,,paboc101,Charlie Pabor,lennb101,Bill Lennon,mathb101,Bobby Mathews,prata101,Al Pratt,,,,,prata101,Al Pratt,mathb101,Bobby Mathews,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,selmf101,Frank Sellman,5.0,mathb101,Bobby Mathews,1.0,foraj101,Jim Foran,3.0,goldw101,Wally Goldsmith,6.0,lennb101,Bill Lennon,2.0,caret101,Tom Carey,4.0,mince101,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y,FW1187105040
1,1,18710505,0,Fri,BS1,,1,WS3,,1,20,18,54.0,D,,,,WAS01,5000.0,145.0,107000435,640113030,41.0,13.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,18.0,,5.0,3.0,,-1.0,,12.0,1.0,6.0,6.0,1.0,0.0,27.0,13.0,10.0,1.0,2.0,0.0,49.0,14.0,2.0,0.0,0.0,11.0,0.0,0.0,0.0,10.0,,2.0,1.0,,-1.0,,14.0,1.0,7.0,7.0,0.0,0.0,27.0,20.0,10.0,2.0,3.0,0.0,dobsh901,Henry Dobson,,,,,,,,,,,wrigh101,Harry Wright,younn801,Nick Young,spala101,Al Spalding,braia102,Asa Brainard,,,,,spala101,Al Spalding,braia102,Asa Brainard,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,watef102,Fred Waterman,5.0,forcd101,Davy Force,6.0,mille105,Everett Mills,3.0,allid101,Doug Allison,2.0,hallg101,George Hall,7.0,leona101,Andy Leonard,4.0,braia102,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y,WS3187105050
2,2,18710506,0,Sat,CL1,,2,RC1,,1,12,4,54.0,D,,,,RCK01,1000.0,140.0,610020003,10020100,49.0,11.0,1.0,1.0,0.0,8.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,10.0,1.0,0.0,0.0,2.0,0.0,27.0,12.0,8.0,5.0,0.0,0.0,36.0,7.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,,3.0,5.0,,-1.0,,5.0,1.0,3.0,3.0,1.0,0.0,27.0,12.0,13.0,3.0,0.0,0.0,mawnj901,J.H. Manny,,,,,,,,,,,paboc101,Charlie Pabor,hasts101,Scott Hastings,prata101,Al Pratt,fishc102,Cherokee Fisher,,,,,prata101,Al Pratt,fishc102,Cherokee Fisher,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mackd101,Denny Mack,3.0,addyb101,Bob Addy,4.0,fishc102,Cherokee Fisher,1.0,hasts101,Scott Hastings,8.0,ham-r101,Ralph Ham,5.0,ansoc101,Cap Anson,2.0,sagep101,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y,RC1187105060
3,3,18710508,0,Mon,CL1,,3,CH1,,1,12,14,54.0,D,,,,CHI01,5000.0,150.0,101403111,77000000,46.0,15.0,2.0,1.0,2.0,10.0,0.0,0.0,0.0,0.0,,1.0,0.0,,-1.0,,7.0,1.0,6.0,6.0,0.0,0.0,27.0,15.0,11.0,6.0,0.0,0.0,43.0,11.0,2.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,,2.0,1.0,,-1.0,,6.0,1.0,4.0,4.0,0.0,0.0,27.0,14.0,7.0,2.0,0.0,0.0,willg901,Gardner Willard,,,,,,,,,,,paboc101,Charlie Pabor,woodj106,Jimmy Wood,zettg101,George Zettlein,prata101,Al Pratt,,,,,prata101,Al Pratt,zettg101,George Zettlein,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0,paboc101,Charlie Pabor,7.0,allia101,Art Allison,8.0,white104,Elmer White,9.0,prata101,Al Pratt,1.0,sutte101,Ezra Sutton,5.0,carlj102,Jim Carleton,3.0,bassj101,John Bass,6.0,mcatb101,Bub McAtee,3.0,kingm101,Marshall King,8.0,hodec101,Charlie Hodes,2.0,woodj106,Jimmy Wood,4.0,simmj101,Joe Simmons,9.0,folet101,Tom Foley,7.0,duffe101,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y,CH1187105080
4,4,18710509,0,Tue,BS1,,2,TRO,,1,9,5,54.0,D,,,,TRO01,3250.0,145.0,2232,101003000,46.0,17.0,4.0,1.0,0.0,6.0,0.0,0.0,0.0,2.0,,0.0,1.0,,-1.0,,12.0,1.0,2.0,2.0,0.0,0.0,27.0,12.0,5.0,0.0,1.0,0.0,36.0,9.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,,0.0,2.0,,-1.0,,7.0,1.0,3.0,3.0,1.0,0.0,27.0,11.0,7.0,3.0,0.0,0.0,leroi901,Isaac Leroy,,,,,,,,,,,wrigh101,Harry Wright,pikel101,Lip Pike,spala101,Al Spalding,mcmuj101,John McMullin,,,,,spala101,Al Spalding,mcmuj101,John McMullin,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0,birdd102,Dave Birdsall,9.0,mcvec101,Cal McVey,2.0,wrigh101,Harry Wright,8.0,goulc101,Charlie Gould,3.0,schah101,Harry Schafer,5.0,conef101,Fred Cone,7.0,spala101,Al Spalding,1.0,flync101,Clipper Flynn,9.0,mcgem101,Mike McGeary,2.0,yorkt101,Tom York,8.0,mcmuj101,John McMullin,1.0,kings101,Steve King,7.0,beave101,Edward Beavens,4.0,bells101,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y,TRO187105090


In [86]:
# data type & length check for some columns.

print('date - datatype? :',game_log.date.dtype, '\n')

for cols in ['completion', 'forefeit', 'protest',
            'attendance', 'length_minutes',
            'additional_info', 'acquisition_info']:
    print(cols)
    print('max length:', max([len(str(content)) for content in game_log[cols]]))
    print(game_log[cols].unique()[:5])
    print('')

date - datatype? : int64 

completion
max length: 23
[nan '19200904,,0,6,36' '19210630,,3,2,45' '19340731,,1,5,41'
 '19430913,,4,4,34']

forefeit
max length: 3
[nan 'V' 'H' 'T']

protest
max length: 3
[nan 'Y' 'H' 'V' 'X']

attendance
max length: 7
[  200.  5000.  1000.  3250.  2500.]

length_minutes
max length: 6
[ 120.  145.  140.  150.  105.]

additional_info
max length: 113
[nan 'HTBF' 'umpchange,2,umphome,highd101' 'umpchange,?,umphome,morgb101'
 'umpchange,4,umphome,lougw901']

acquisition_info
max length: 3
['Y' nan]



In [87]:
#run_command('DROP TABLE IF EXISTS game;')

In [88]:
create_table_game = '''
        CREATE TABLE game (
                game_id TEXT,
                date INTEGER,
                number_of_game INTEGER,
                park_id TEXT,
                length_outs FLOAT,
                day NUMERIC,
                completion TEXT,
                forefeit TEXT,
                protest TEXT,
                attendance FLOAT,
                length_minutes FLOAT,
                additional_info VARCHAR,
                acqusition_info TEXT,
                PRIMARY KEY(game_id),
                FOREIGN KEY(park_id) REFERENCES park(park_id)
        );
'''
run_command(create_table_game)

In [89]:
insert_data_game = '''
        INSERT INTO game
        SELECT
            game_id, date, number_of_game,
            park_id, length_outs,
                CASE
                    WHEN day_night = "D" THEN 1
                    ELSE 0
                END AS day,
            completion, forefeit, protest,
            attendance, length_minutes,
            additional_info, acquisition_info
        FROM game_log;            
'''
run_command(insert_data_game)

In [90]:
run_query('SELECT * FROM game LIMIT 5')

Unnamed: 0,game_id,date,number_of_game,park_id,length_outs,day,completion,forefeit,protest,attendance,length_minutes,additional_info,acqusition_info
0,FW1187105040,18710504,0,FOR01,54.0,1,,,,200.0,120.0,,Y
1,WS3187105050,18710505,0,WAS01,54.0,1,,,,5000.0,145.0,HTBF,Y
2,RC1187105060,18710506,0,RCK01,54.0,1,,,,1000.0,140.0,,Y
3,CH1187105080,18710508,0,CHI01,54.0,1,,,,5000.0,150.0,,Y
4,TRO187105090,18710509,0,TRO01,54.0,1,,,,3250.0,145.0,HTBF,Y


## Adding the Team Appearance Table

At this point, because we have told SQLite to enforce foreign key constraints and have inserted data that obeys these contraints, we'll get an error if we try to drop a table or delete rows within a table. For example, you might try running `DELETE FROM park where park_id = "FOR01";`. If you get stuck, one option is to run `!rm mlb.db` in its own Jupyter cell to delete the database file so you can run all your cells to recreate the database files, tables and data.<br>

Our next task is to add the `team_appearance` table. Here is the schema of the table and the three tables it has foreign key relations to:

![team-appearance-table-relations](https://s3.amazonaws.com/dq-content/193/mlb_schema_3.svg)

The `team_appearance` table has a compound primary key composed of the team name and the game ID. In addition, a boolean column `home` is used to differentiate between the home and the away team. The rest of the columns are scores or statistics that in our original game log are repeated for each of the home and away teams.<br>

In order to insert this data cleanly, we'll need to use a `UNION` clause:

```sql
INSERT INTO team_appearance
    SELECT
        h_name,
        game_id,
        1 AS home,
        h_league,
        h_score,
        h_line_score,
        h_at_bats,
        [...]
    FROM game_log

UNION

    SELECT    
        v_name,
        game_id,
        0 AS home,
        v_league,
        v_score,
        v_line_score,
        v_at_bats,
        [...]
    from game_log;
```

In order to save yourself from having to manually type all the column names, you might like to use a query like the following to extract the schema from the `game_log` table, and use that as a starting point for your query:

```sql
SELECT sql FROM sqlite_master
WHERE name = "game_log"
  AND type = "table";
```

* Create the `team_appearance` table with columns, primary key, and foreign keys as shown in the schema diagram.
  * Select the appropriate type based on the data.
  * Insert the data from the `game_log` table, using a `UNION` clause to combine the data from the column sets for the home and away teams.
  * Write a query to verify that your data was inserted correctly.

In [93]:
# start index point of visitor team stat
print(list(game_log.columns).index('v_at_bats'))

# end index point of visitor team stat
print(list(game_log.columns).index('v_triple_plays'))

# start index point of home team stat
print(list(game_log.columns).index('h_at_bats'))

# end index point of home team stat
print(list(game_log.columns).index('h_triple_plays'))

21
48
49
76


* Create joined strings to make queries

In [109]:
v_stats = ','.join(list(game_log.columns)[21:49])
h_stats = ','.join(list(game_log.columns)[49:77])

In [110]:
v_stats

'v_at_bats,v_hits,v_doubles,v_triples,v_homeruns,v_rbi,v_sacrifice_hits,v_sacrifice_flies,v_hit_by_pitch,v_walks,v_intentional_walks,v_strikeouts,v_stolen_bases,v_caught_stealing,v_grounded_into_double,v_first_catcher_interference,v_left_on_base,v_pitchers_used,v_individual_earned_runs,v_team_earned_runs,v_wild_pitches,v_balks,v_putouts,v_assists,v_errors,v_passed_balls,v_double_plays,v_triple_plays'

In [111]:
h_stats

'h_at_bats,h_hits,h_doubles,h_triples,h_homeruns,h_rbi,h_sacrifice_hits,h_sacrifice_flies,h_hit_by_pitch,h_walks,h_intentional_walks,h_strikeouts,h_stolen_bases,h_caught_stealing,h_grounded_into_double,h_first_catcher_interference,h_left_on_base,h_pitchers_used,h_individual_earned_runs,h_team_earned_runs,h_wild_pitches,h_balks,h_putouts,h_assists,h_errors,h_passed_balls,h_double_plays,h_triple_plays'

In [116]:
default_stats_to_insert = ','.join([sp[2:]+' FLOAT' for sp in v_stats.split(',')])
default_stats_to_insert

'at_bats FLOAT,hits FLOAT,doubles FLOAT,triples FLOAT,homeruns FLOAT,rbi FLOAT,sacrifice_hits FLOAT,sacrifice_flies FLOAT,hit_by_pitch FLOAT,walks FLOAT,intentional_walks FLOAT,strikeouts FLOAT,stolen_bases FLOAT,caught_stealing FLOAT,grounded_into_double FLOAT,first_catcher_interference FLOAT,left_on_base FLOAT,pitchers_used FLOAT,individual_earned_runs FLOAT,team_earned_runs FLOAT,wild_pitches FLOAT,balks FLOAT,putouts FLOAT,assists FLOAT,errors FLOAT,passed_balls FLOAT,double_plays FLOAT,triple_plays FLOAT'

In [127]:
#run_command('DROP TABLE IF EXISTS team_appearance')

In [129]:
create_table_team_appearance = '''
        CREATE TABLE team_appearance (
                team_id TEXT,
                game_id TEXT,
                home NUMERIC,
                league_id TEXT,
                score INTEGER,
                line_score INTEGER,
'''+default_stats_to_insert+''',
                PRIMARY KEY(team_id, game_id),
                FOREIGN KEY(team_id) REFERENCES team(team_id),
                FOREIGN KEY(game_id) REFERENCES game(game_id),
                FOREIGN KEY(league_id) REFERENCES league(league_id)
        );
'''

run_command(create_table_team_appearance)

In [130]:
insert_data_team_appearance = '''
        INSERT INTO team_appearance
            SELECT
                h_name,
                game_id,
                1 AS home,
                h_league,
                h_score,
                h_line_score,
'''+h_stats+'''
                FROM game_log
        UNION
            SELECT
                v_name,
                game_id,
                0 AS home,
                v_league,
                v_score,
                v_line_score,
'''+v_stats+'''
            FROM game_log;
'''

run_command(insert_data_team_appearance)

In [131]:
run_query('SELECT * FROM team_appearance LIMIT 5')

Unnamed: 0,team_id,game_id,home,league_id,score,line_score,at_bats,hits,doubles,triples,homeruns,rbi,sacrifice_hits,sacrifice_flies,hit_by_pitch,walks,intentional_walks,strikeouts,stolen_bases,caught_stealing,grounded_into_double,first_catcher_interference,left_on_base,pitchers_used,individual_earned_runs,team_earned_runs,wild_pitches,balks,putouts,assists,errors,passed_balls,double_plays,triple_plays
0,ALT,ALT188404300,1,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,ALT,ALT188405020,1,UA,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,ALT,ALT188405030,1,UA,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,ALT,ALT188405050,1,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,ALT,ALT188405100,1,UA,9,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [132]:
run_query('SELECT * FROM team_appearance ORDER BY home LIMIT 5')

Unnamed: 0,team_id,game_id,home,league_id,score,line_score,at_bats,hits,doubles,triples,homeruns,rbi,sacrifice_hits,sacrifice_flies,hit_by_pitch,walks,intentional_walks,strikeouts,stolen_bases,caught_stealing,grounded_into_double,first_catcher_interference,left_on_base,pitchers_used,individual_earned_runs,team_earned_runs,wild_pitches,balks,putouts,assists,errors,passed_balls,double_plays,triple_plays
0,ALT,CNU188404170,0,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,ALT,CNU188404180,0,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,ALT,CNU188404190,0,UA,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,ALT,SLU188404240,0,UA,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,ALT,SLU188404260,0,UA,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Adding the Person Appearance 

The final table we need to create is `person_appearance`. Here is the schema of the table and the four tables it has foreign key relations to:

![person-appearance-table](https://s3.amazonaws.com/dq-content/193/mlb_schema_4.svg)

The `person_appearance` table will be used to store information on appearances in games by managers, players, and umpires as detailed in the `appearance_type` table.<br>

We'll need to use a similar technique to insert data as we used with the `team_appearance` table, however we will have to write much larger queries - one for each column instead of one for each team as before. We will need to work out for each column what the `appearance_type_id` will be by cross-referencing the columns with the `appearance_type` table.<br>

We have decided to create an integer primary key for this table, because having every column be a compound primary quickly becomes cumbersome when writing queries. In SQLite, if you have an integer primary key and don't specify a value for this column when inserting rows, [SQLite will autoincrement this column for you](https://sqlite.org/autoinc.html).

Below is an excerpt of the query that we'll write to insert the data:

```sql
INSERT INTO person_appearance (
    game_id,
    team_id,
    person_id,
    appearance_type_id
)
    SELECT
        game_id,
        NULL,
        lf_umpire_id,
        "ULF"
    FROM game_log
    WHERE lf_umpire_id IS NOT NULL

UNION

    SELECT
        game_id,
        NULL,
        rf_umpire_id,
        "URF"
    FROM game_log
    WHERE rf_umpire_id IS NOT NULL

UNION

    SELECT
        game_id,
        v_name,
        v_manager_id,
        "MM"
    FROM game_log
    WHERE v_manager_id IS NOT NULL

UNION

    SELECT
        game_id,
        h_name,
        h_manager_id,
        "MM"
    FROM game_log
    WHERE h_manager_id IS NOT NULL

UNION

    SELECT
        game_id,
        CASE
            WHEN h_score > v_score THEN h_name
            ELSE v_name
            END,
        winning_pitcher_id,
        "AWP"
    FROM game_log
    WHERE winning_pitcher_id IS NOT NULL

UNION
    [...]
```

When we get to the offensive and defensive positions for both teams, we essentially are performing `36` permutations: `2` (*home, away*) `* 2` (*offense + defense*) `* 9` (*9 positions*).<br>

To save us from manually copying this out, we can instead use a loop and [python string formatting](https://pyformat.info/) to generate the queries:

```python
template = """
INSERT INTO person_appearance (
    game_id,
    team_id,
    person_id,
    appearance_type_id
) 
    SELECT
        game_id,
        {hv}_name,
        {hv}_player_{num}_id,
        "O{num}"
    FROM game_log
    WHERE {hv}_player_{num}_id IS NOT NULL

UNION

    SELECT
        game_id,
        {hv}_name,
        {hv}_player_{num}_id,
        "D" || CAST({hv}_player_{num}_def_pos AS INT)
    FROM game_log
    WHERE {hv}_player_{num}_id IS NOT NULL;
"""

run_command(c1)
run_command(c2)

for hv in ["h","v"]:
    for num in range(1,10):
        query_vars = {
            "hv": hv,
            "num": num
        }
        # run commmand is a helper function which runs
        # a query against our database.
        run_command(template.format(**query_vars))
```

* Create the `person_appearance` table with columns, primary key, and foreign keys as shown in the schema diagram.
  * Select the appropriate type based on the data.
  * Insert the data from the `game_log` table, using `UNION` clauses to combine the data from the columns for managers, umpires, pitchers, and awards.
  * Use a loop with string formatting to insert the data for offensive and defensive positions from the `game_log` table.
  * Write a query to verify that your data was inserted correctly.

In [139]:
#run_command('DROP TABLE IF EXISTS person_appearance;')

In [140]:
create_table_person_appearance = '''
        CREATE TABLE person_appearance (
                appearance_id INTEGER,
                person_id TEXT,
                team_id TEXT,
                game_id TEXT,
                appearance_type_id TEXT,
                PRIMARY KEY(appearance_id),
                FOREIGN KEY(person_id) REFERENCES person(person_id),
                FOREIGN KEY(team_id) REFERENCES team(team_id),
                FOREIGN KEY(game_id) REFERENCES game(game_id),
                FOREIGN KEY(appearance_type_id) REFERENCES appearance_type(appearance_type_id)                
        );
'''

run_command(create_table_person_appearance)

In [143]:
run_query('SELECT * FROM appearance_type;')

Unnamed: 0,index,appearance_type_id,name,category
0,0,O1,Batter 1,offense
1,1,O2,Batter 2,offense
2,2,O3,Batter 3,offense
3,3,O4,Batter 4,offense
4,4,O5,Batter 5,offense
5,5,O6,Batter 6,offense
6,6,O7,Batter 7,offense
7,7,O8,Batter 8,offense
8,8,O9,Batter 9,offense
9,9,D1,Pitcher,defense


In [202]:
# insert query for umpires
umpire_loc = ['hp', '1b', '2b', '3b', 'lf', 'rf']
umpire_loc_id = ['UHP', 'U1B', 'U2B', 
                 'U3B', 'ULF', 'URF']

insert_umpire_template = """
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
            SELECT
                game_id,
                NULL,
                {loc}_umpire_id,
                "{loc_id}"
            FROM game_log
            WHERE {loc}_umpire_id IS NOT NULL;
"""

for i, uloc in enumerate(umpire_loc):
    query_vars = {
        "loc": uloc,
        "loc_id":umpire_loc_id[i]
    }
    run_command(insert_umpire_template.format(**query_vars))
    

OperationalError: unrecognized token: "1b_umpire_id"

### Unexpected error occurred;
After cheking the `game_log` table in database file,

* This works.
```
run_query('SELECT hp_umpire_id FROM game_log LIMIT 5')
run_query('SELECT lf_umpire_id FROM game_log LIMIT 5')
run_query('SELECT rf_umpire_id FROM game_log LIMIT 5')
```

* This does not work. (maybe umpire id columns starting with number?)
```
run_query('SELECT 1b_umpire_id FROM game_log LIMIT 5')
run_query('SELECT 2b_umpire_id FROM game_log LIMIT 5')
run_query('SELECT 3b_umpire_id FROM game_log LIMIT 5')
```

* For the fields with errors, those work.
```
run_query('SELECT "1b_umpire_id" FROM game_log LIMIT 5')
run_query('SELECT "2b_umpire_id" FROM game_log LIMIT 5')
run_query('SELECT "3b_umpire_id" FROM game_log LIMIT 5')
```

In [206]:
# Duplicating error situation
run_query('SELECT hp_umpire_id FROM game_log LIMIT 3')

Unnamed: 0,hp_umpire_id
0,boakj901
1,dobsh901
2,mawnj901


In [205]:
run_query('SELECT 1b_umpire_id FROM game_log LIMIT 3')

DatabaseError: Execution failed on sql 'SELECT 1b_umpire_id FROM game_log LIMIT 3': unrecognized token: "1b_umpire_id"

In [204]:
run_query('SELECT "1b_umpire_id" FROM game_log LIMIT 3')

Unnamed: 0,1b_umpire_id
0,
1,
2,


### For `umpire` fields, we need to write queries for two types as indicated above.

In [239]:
# reset table
run_command('DROP TABLE IF EXISTS person_appearance;')

create_table_person_appearance = '''
        CREATE TABLE person_appearance (
                appearance_id INTEGER,
                person_id TEXT,
                team_id TEXT,
                game_id TEXT,
                appearance_type_id TEXT,
                PRIMARY KEY(appearance_id),
                FOREIGN KEY(person_id) REFERENCES person(person_id),
                FOREIGN KEY(team_id) REFERENCES team(team_id),
                FOREIGN KEY(game_id) REFERENCES game(game_id),
                FOREIGN KEY(appearance_type_id) REFERENCES appearance_type(appearance_type_id)                
        );
'''

run_command(create_table_person_appearance)

In [240]:
# hp, lf, rf umpire

# insert query for umpires
umpire_loc = ['hp', 'lf', 'rf']
umpire_loc_id = ['UHP', 'ULF', 'URF']

insert_umpire_template = """
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
            SELECT
                game_id,
                NULL,
                {loc}_umpire_id,
                "{loc_id}"
            FROM game_log
            WHERE {loc}_umpire_id IS NOT NULL;
"""

for i, uloc in enumerate(umpire_loc):
    query_vars = {
        "loc": uloc,
        "loc_id":umpire_loc_id[i]
    }
    run_command(insert_umpire_template.format(**query_vars))
    

In [241]:
# 1b umpire
insert_umpire_template_1b = '''
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
            SELECT
                game_id,
                NULL,
                "1b_umpire_id",
                "U1B"
            FROM game_log
            WHERE "1b_umpire_id" IS NOT NULL
'''

run_command(insert_umpire_template_1b)

In [242]:
# 2b umpire
insert_umpire_template_2b = '''
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
            SELECT
                game_id,
                NULL,
                "2b_umpire_id",
                "U2B"
            FROM game_log
            WHERE "2b_umpire_id" IS NOT NULL
'''

run_command(insert_umpire_template_2b)

In [243]:
# 3b umpire

insert_umpire_template_3b = '''
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
            SELECT
                game_id,
                NULL,
                "3b_umpire_id",
                "U3B"
            FROM game_log
            WHERE "3b_umpire_id" IS NOT NULL
'''

run_command(insert_umpire_template_3b)

# inserting umpire info completed.

In [244]:
# manager

insert_manager_template = '''
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
    
            SELECT
                game_id,
                v_name,
                v_manager_id,
                "MM"
            FROM game_log
            WHERE v_manager_id IS NOT NULL
        
        UNION
        
            SELECT
                game_id,
                h_name,
                h_manager_id,
                "MM"
            FROM game_log
            WHERE h_manager_id IS NOT NULL;
'''

run_command(insert_manager_template)

# inserting manager info completed.

REFERENCE!
```
pit_bats = ['winning_pitcher_id',
             'winning_pitcher_name',
             'losing_pitcher_id',
             'losing_pitcher_name',
             'saving_pitcher_id',
             'saving_pitcher_name',
             'winning_rbi_batter_id',
             'winning_rbi_batter_id_name',
             'v_starting_pitcher_id',
             'v_starting_pitcher_name',
             'h_starting_pitcher_id',
             'h_starting_pitcher_name']

pit_bats_dict = {'AWP':'Winning Pitcher',
                'ALP':'Losing Pitcher',
                'ASP':'Saving Pitcher',
                'AWB':'Winning RBI Batter',
                'PSP':'Starting Pitcher'}
```

In [245]:
# pitcher & batter



insert_pit_bat_template = '''
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        )
        
            SELECT
                game_id,
                CASE
                    WHEN h_score > v_score THEN h_name
                    ELSE v_name
                    END,
                winning_pitcher_id,
                "AWP"
            FROM game_log
            WHERE winning_pitcher_id IS NOT NULL
        
        UNION
            
            SELECT
                game_id,
                CASE
                    WHEN h_score < v_score THEN h_name
                    ELSE v_name
                    END,
                losing_pitcher_id,
                "ALP"
            FROM game_log
            WHERE losing_pitcher_id IS NOT NULL
            
        UNION
            
            SELECT
                game_id,
                CASE
                    WHEN h_score > v_score THEN h_name
                    ELSE v_name
                    END,
                saving_pitcher_id,
                "ASP"
            FROM game_log
            WHERE saving_pitcher_id IS NOT NULL
            
        UNION
        
            SELECT
                game_id,
                CASE
                    WHEN h_score > v_score THEN h_name
                    ELSE v_name
                    END,
                winning_rbi_batter_id,
                "AWB"
            FROM game_log
            WHERE winning_rbi_batter_id IS NOT NULL
            
        UNION
        
            SELECT
                game_id,
                v_name,
                v_starting_pitcher_id,
                "PSP"
            FROM game_log
            WHERE v_starting_pitcher_id IS NOT NULL
            
        UNION
        
            SELECT
                game_id,
                h_name,
                h_starting_pitcher_id,
                "PSP"
            FROM game_log
            WHERE h_starting_pitcher_id IS NOT NULL;
'''

run_command(insert_pit_bat_template)

# inserting pitchers & batters info completed.

In [246]:
# players (general)

template = """
        INSERT INTO person_appearance (
            game_id,
            team_id,
            person_id,
            appearance_type_id
        ) 
            SELECT
                game_id,
                {hv}_name,
                {hv}_player_{num}_id,
                "O{num}"
            FROM game_log
            WHERE {hv}_player_{num}_id IS NOT NULL

        UNION

            SELECT
                game_id,
                {hv}_name,
                {hv}_player_{num}_id,
                "D" || CAST({hv}_player_{num}_def_pos AS INT)
            FROM game_log
            WHERE {hv}_player_{num}_id IS NOT NULL;
"""

for hv in ["h","v"]:
    for num in range(1,10):
        query_vars = {
            "hv": hv,
            "num": num
        }
        # run commmand is a helper function which runs
        # a query against our database.
        run_command(template.format(**query_vars))

## Removing the Original Tables

We've now created all normalized tables and inserted all of our data!<br>
Our last task is to remove the tables we created to import the original CSVs.

In [247]:
table_to_drop = ['game_log', 'park_codes', 'team_codes', 'person_codes']

for t in table_to_drop:
    run_command('DROP TABLE IF EXISTS '+t)
    

## FINISHED!

![mlb-schema](https://s3.amazonaws.com/dq-content/193/mlb_schema.svg)
![person-appearance-table](https://s3.amazonaws.com/dq-content/193/mlb_schema_4.svg)


## Next Steps

In this mission, we learned how to:
* Import CSV data into a database.
* Design a normalized schema for a large, predominantly single table data set.
* Create tables that match the schema design.
* Migrate data from unnormalized tables into our normalized tables.

To extend this project, you might like to consider one or more of the following:
* Transform the the dates into a SQLite compatible format.
* Extract the line scores into innings level data in a new table.
* Create views to make querying stats easier, eg:
  * Season level stats.
  * All time records.
* Supplement the database using new data, for instance:
  * Add data from retrosheet game logs for years after 2016.
  * Source and add missing pitcher information.
  * Add player level per-game stats.
  * Source and include base coach data.