# Guided Project
### Designing and Creating a Database

## Getting to Know the Data

In this guided project, we're going to learn how to:
* Import data into SQLite
* Design a normalized database schema
* Create tables for our schema
* Insert data into our schema


We will be working with a file of [Major League Baseball](https://en.wikipedia.org/wiki/Major_League_Baseball) games from [Retrosheet](http://www.retrosheet.org/). Retrosheet compiles detailed statistics on baseball games from the 1800s through to today. The main file we will be working from `game_log.csv`, has been produced by combining 127 separate CSV files from retrosheet, and has been pre-cleaned to remove some inconsistencies. The game log has hundreds of data points on each game which we will normalize this data into several separate tables using SQL, providing a robust database of game-level statistics.<br>

In addition to the main file, we have also included three 'helper' files, also sourced from Retrosheet:
* park_codes.csv
* person_codes.csv
* team_codes.csv

These three helper files in some cases contain extra data, but will also make things easier as they will form the basis for three of our normalized tables.<br>

An important first step when working with any new data is to perform exploratory data analysis (EDA). EDA gets us familiar with the data and gives us a level of background knowledge that will help us throughout our project. The methods you use when performing EDA will depend on what you plan to do with the data. In our case, we're wanting to create a normalized database, so our focus should be:

* Becoming familiar, at a high level, with the meaning of each column in each file.
* Thinking about the relationships between columns within each file.
* Thinking about the relationships between columns across different files.

We have included a `game_log_fields.txt` file from Retrosheet which explains the fields included in our main file, which will be useful to assist our EDA. You can use `!cat game_log_fields.txt` in its own Jupyter cell to read the contents of the file.<br>

**If you're not familiar with baseball some of this can seem overwhelming at first, however this presents a great opportunity.** 

### When you are working with data professionally, you'll often encounter data in an industry you might be unfamiliar with - it might be digital marketing, geological engineering or industrial machinery. 

In these instances, you'll have to perform research in order to understand the data you're working with.<br>

Baseball is a great topic to practice these skills with. Because of the long history within baseball of the collection and analysis of statistics (most famously the [Sabermetrics](https://en.wikipedia.org/wiki/Sabermetrics) featured in the movie [Moneyball](https://en.wikipedia.org/wiki/Moneyball_%28film%29)), there is a wide range of online resources available to help you get answers to any questions you may have.<br>

Let's get started exploring the data by using pandas to read and explore the data. Setting the following options after you import pandas is recommended– they will prevent the DataFrame output from being truncated, given the size of the main game log file:

```python
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)
```

In [53]:
import pandas as pd

In [54]:
# due to limit on github upload filesize,
# use the code below to unzip the data file
# into your current work folder.

# for the same reason, the current repository
# does not contain 'mlb.db' database file
# as working the codes below, you will create db.

# for the first run of this notebook,
# uncomment this and run once.

# !unzip ./game_log.zip

Archive:  ./game_log.zip
  inflating: game_log.csv            
   creating: __MACOSX/
  inflating: __MACOSX/._game_log.csv  


In [55]:
pd.set_option('max_columns', 180)
pd.set_option('max_rows', 200000)
pd.set_option('max_colwidth', 5000)

In [56]:
game_log = pd.read_csv('game_log.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [57]:
game_log = pd.read_csv('game_log.csv', low_memory = False)

In [58]:
park_codes = pd.read_csv('park_codes.csv')
person_codes = pd.read_csv("person_codes.csv")
team_codes = pd.read_csv("team_codes.csv")

* Using pandas, read in each of the four CSV files: `game_log.csv`, `park_codes.csv`, `person_codes.csv`, `team_codes.csv`. For each:
  * Use methods and attributes like `DataFrame.shape`, `DataFrame.head()`, and `DataFrame.tail()` to explore the data.
  * Write a brief paragraph to describe each file, including for the helper files how the data intersects with the main log file.
* Research any fields you are not familiar with, using both the text file and Google as needed. In particular, you should explore and write a short paragraph on:
  * What each defensive position number represents.
  * The values in the various league fields, and which leagues they represent.

In [59]:
for (df, name) in zip([game_log, park_codes, person_codes, team_codes],
              ['game_log', 'park_codes', 'person_codes', 'team_codes']):
    
    print('')
    print('#'*10, name, '#'*10)
    print('shape:',df.shape)
    print('columns:', df.columns)

    print(df.head(3))
    print(df.tail(3))
    print('')
    


########## game_log ##########
shape: (171907, 161)
columns: Index(['date', 'number_of_game', 'day_of_week', 'v_name', 'v_league',
       'v_game_number', 'h_name', 'h_league', 'h_game_number', 'v_score',
       ...
       'h_player_7_name', 'h_player_7_def_pos', 'h_player_8_id',
       'h_player_8_name', 'h_player_8_def_pos', 'h_player_9_id',
       'h_player_9_name', 'h_player_9_def_pos', 'additional_info',
       'acquisition_info'],
      dtype='object', length=161)
       date  number_of_game day_of_week v_name v_league  v_game_number h_name  \
0  18710504               0         Thu    CL1      NaN              1    FW1   
1  18710505               0         Fri    BS1      NaN              1    WS3   
2  18710506               0         Sat    CL1      NaN              2    RC1   

  h_league  h_game_number  v_score  h_score  length_outs day_night completion  \
0      NaN              1        0        2         54.0         D        NaN   
1      NaN              1       20   

            date  number_of_game day_of_week v_name v_league  v_game_number  \
171904  20161002               0         Sun    LAN       NL            162   
171905  20161002               0         Sun    PIT       NL            162   
171906  20161002               0         Sun    MIA       NL            161   

       h_name h_league  h_game_number  v_score  h_score  length_outs  \
171904    SFN       NL            162        1        7         51.0   
171905    SLN       NL            162        4       10         51.0   
171906    WAS       NL            162        7       10         51.0   

       day_night completion forefeit protest park_id  attendance  \
171904         D        NaN      NaN     NaN   SFO03     41445.0   
171905         D        NaN      NaN     NaN   STL10     44615.0   
171906         D        NaN      NaN     NaN   WAS11     28730.0   

        length_minutes v_line_score h_line_score  v_at_bats  v_hits  \
171904           184.0    000100000    23000002x  

In [46]:
# game log data from 1871.05.04 - 2016.10.02
game_log.date.unique()

array([18710504, 18710505, 18710506, ..., 20160930, 20161001, 20161002])

In [6]:
print('park_id' in game_log.columns)
print('id' in game_log.columns)
print('team_id' in game_log.columns)

True
False
False


In [7]:
# player id

import re
print(re.findall(r'[0-9a-z_]*_id_[0-9a-z_]*', ' '.join(game_log.columns)))
print(re.findall(r'[0-9a-z_]*_id', ' '.join(game_log.columns)))

['winning_rbi_batter_id_name']
['park_id', 'hp_umpire_id', '1b_umpire_id', '2b_umpire_id', '3b_umpire_id', 'lf_umpire_id', 'rf_umpire_id', 'v_manager_id', 'h_manager_id', 'winning_pitcher_id', 'losing_pitcher_id', 'saving_pitcher_id', 'winning_rbi_batter_id', 'winning_rbi_batter_id', 'v_starting_pitcher_id', 'h_starting_pitcher_id', 'v_player_1_id', 'v_player_2_id', 'v_player_3_id', 'v_player_4_id', 'v_player_5_id', 'v_player_6_id', 'v_player_7_id', 'v_player_8_id', 'v_player_9_id', 'h_player_1_id', 'h_player_2_id', 'h_player_3_id', 'h_player_4_id', 'h_player_5_id', 'h_player_6_id', 'h_player_7_id', 'h_player_8_id', 'h_player_9_id']


In [8]:
# there is no team id
# instead team name exists in 'v_name', 'h_name' 
# in the format of 'team_id' in 'team_codes' dataset

print(game_log.v_name[:3])
print(game_log.h_name[:3])

0    CL1
1    BS1
2    CL1
Name: v_name, dtype: object
0    FW1
1    WS3
2    RC1
Name: h_name, dtype: object


### Note 1 : dataset relationship

* `game_log.csv` [ `v_player_[num]_id, ...` ] --- `person_codes.csv` [ `id` ] 
* `game_log.csv` [ `park_id` ] --- `park_codes.csv` [ `park_id` ]
* `game_log.csv` [ `v_name`, `h_name` ] --- `team_codes.csv` [ ``team_id ]

### Note 2 : game_log_fields

```
Field(s)  Meaning
    1     Date in the form "yyyymmdd"
    2     Number of game:
             "0" -- a single game
             "1" -- the first game of a double (or triple) header
                    including seperate admission doubleheaders
             "2" -- the second game of a double (or triple) header
                    including seperate admission doubleheaders
             "3" -- the third game of a triple-header
             "A" -- the first game of a double-header involving 3 teams
             "B" -- the second game of a double-header involving 3 teams
    3     Day of week  ("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
  4-5     Visiting team and league
    6     Visiting team game number
          For this and the home team game number, ties are counted as
          games and suspended games are counted from the starting
          rather than the ending date.
  7-8     Home team and league
    9     Home team game number
10-11     Visiting and home team score (unquoted)
   12     Length of game in outs (unquoted).  A full 9-inning game would
          have a 54 in this field.  If the home team won without batting
          in the bottom of the ninth, this field would contain a 51.
   13     Day/night indicator ("D" or "N")
   14     Completion information.  If the game was completed at a
          later date (either due to a suspension or an upheld protest)
          this field will include:
             "yyyymmdd,park,vs,hs,len" Where
          yyyymmdd -- the date the game was completed
          park -- the park ID where the game was completed
          vs -- the visitor score at the time of interruption
          hs -- the home score at the time of interruption
          len -- the length of the game in outs at time of interruption
          All the rest of the information in the record refers to the
          entire game.
   15     Forfeit information:
             "V" -- the game was forfeited to the visiting team
             "H" -- the game was forfeited to the home team
             "T" -- the game was ruled a no-decision
   16     Protest information:
             "P" -- the game was protested by an unidentified team
             "V" -- a disallowed protest was made by the visiting team
             "H" -- a disallowed protest was made by the home team
             "X" -- an upheld protest was made by the visiting team
             "Y" -- an upheld protest was made by the home team
          Note: two of these last four codes can appear in the field
          (if both teams protested the game).
   17     Park ID
   18     Attendance (unquoted)
   19     Time of game in minutes (unquoted)
20-21     Visiting and home line scores.  For example:
             "010000(10)0x"
          Would indicate a game where the home team scored a run in
          the second inning, ten in the seventh and didn't bat in the
          bottom of the ninth.
22-38     Visiting team offensive statistics (unquoted) (in order):
             at-bats
             hits
             doubles
             triples
             homeruns
             RBI
             sacrifice hits.  This may include sacrifice flies for years
                prior to 1954 when sacrifice flies were allowed.
             sacrifice flies (since 1954)
             hit-by-pitch
             walks
             intentional walks
             strikeouts
             stolen bases
             caught stealing
             grounded into double plays
             awarded first on catcher's interference
             left on base
39-43     Visiting team pitching statistics (unquoted)(in order):
             pitchers used ( 1 means it was a complete game )
             individual earned runs
             team earned runs
             wild pitches
             balks
44-49     Visiting team defensive statistics (unquoted) (in order):
             putouts.  Note: prior to 1931, this may not equal 3 times
                the number of innings pitched.  Prior to that, no
                putout was awarded when a runner was declared out for
                being hit by a batted ball.
             assists
             errors
             passed balls
             double plays
             triple plays
50-66     Home team offensive statistics
67-71     Home team pitching statistics
72-77     Home team defensive statistics
78-79     Home plate umpire ID and name
80-81     1B umpire ID and name
82-83     2B umpire ID and name
84-85     3B umpire ID and name
86-87     LF umpire ID and name
88-89     RF umpire ID and name
          If any umpire positions were not filled for a particular game
          the fields will be "","(none)".
90-91     Visiting team manager ID and name
92-93     Home team manager ID and name
94-95     Winning pitcher ID and name
96-97     Losing pitcher ID and name
98-99     Saving pitcher ID and name--"","(none)" if none awarded
100-101   Game Winning RBI batter ID and name--"","(none)" if none
          awarded
102-103   Visiting starting pitcher ID and name
104-105   Home starting pitcher ID and name
106-132   Visiting starting players ID, name and defensive position,
          listed in the order (1-9) they appeared in the batting order.
133-159   Home starting players ID, name and defensive position
          listed in the order (1-9) they appeared in the batting order.
  160     Additional information.  This is a grab-bag of informational
          items that might not warrant a field on their own.  The field 
          is alpha-numeric. Some items are represented by tokens such as:
             "HTBF" -- home team batted first.
             Note: if "HTBF" is specified it would be possible to see
             something like "01002000x" in the visitor's line score.
          Changes in umpire positions during a game will also appear in 
          this field.  These will be in the form:
             umpchange,inning,umpPosition,umpid with the latter three
             repeated for each umpire.
          These changes occur with umpire injuries, late arrival of 
          umpires or changes from completion of suspended games. Details
          of suspended games are in field 14.
  161     Acquisition information:
             "Y" -- we have the complete game
             "N" -- we don't have any portion of the game
             "D" -- the game was derived from box score and game story
             "P" -- we have some portion of the game.  We may be missing
                    innings at the beginning, middle and end of the game.
 
Missing fields will be NULL.
```

## Importing Data into SQLite

To insert data into a noramalized database, we'll need a single column that can be used as a primary key. The game log file does not have a single column that can be used as a primary key to uniquely identify each game. There are three ways that we could handle this:
* Make a compound primary key, such as a primary key of the `date`, `h_name`, and `number_of_game` columns.
* Insert an integer primary key, eg where the first row is `1`, the second row is `2`, etc.
* Insert a new column using a custom format.

Because we have not yet normalized our data, it's better not to start with a compound primary key - if we do this, we might end up needing to create a compound key in another table that includes this compound key, which would quickly become cumbersome to work with. An integer primary key is a good choice, but we should first explore whether Retrosheet already have a system for uniquely identifying each game. If they do, this is a better option. It means that if at some later stage we choose to incorporate more detailed game data into our database, the keys we use will be compatible with other sources.<br>

Exploring the Retrosheet site, we can find this [data dictionary](http://www.retrosheet.org/eventfile.htm) for their event files, which list every event within each game. This includes the following description:
* `id`: Each game begins with a **twelve character ID record** which identifies the `date`, `home team`, and `number of the game`. For example, `ATL198304080` should be read as follows. 
  * The first three characters identify the home team (the Braves). 
  * The next four are the year (1983). 
  * The next two are the month (April) using the standard numeric notation, 04, followed by the day (08). 
  * The last digit indicates if this is a single game (`0`), first game (`1`) or second game (`2`) if more than one game is played during a day, usually a double header The id record starts the description of a game thus ending the description of the preceding game in the file.

You might notice that this essentially makes a custom key using the three columns we identified in our composite key example earlier. After we import the data, we'll construct this column to use as a primary key in our final database.<br>

Our next task is to import the data into SQLite. There are **three key ways to import data into a SQLite database**:

### 1. Using the Python SQLite library
The [Python SQLite library](https://docs.python.org/3/library/sqlite3.html) gives us ultimate control when importing data. We will first need to get the data into Python - we might choose to use the [csv module](https://docs.python.org/3/library/csv.html) for this. Next, we would use the [`Cursor.execute()` method] (https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.execute) to create a table for our data.<br>

Lastly, we can use the [`Cursor.executemany()` method](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.executemany) to insert multiple rows of data in a single command. If we create our [connection object](https://docs.python.org/3/library/sqlite3.html#connection-objects) with a filename that doesn't exist, the sqlite module will create the database file for us.<br>

We should take advantage of the `?` placeholder value syntax instead of using [python string formatting](https://pyformat.info/) **to prevent [SQL injection attacks](https://en.wikipedia.org/wiki/SQL_injection#Incorrect_type_handling)** (like the hilarious [XKCD 'Bobby Tables' comic example](https://xkcd.com/327/)) and maintain the correct data types. 

### Even though in this project we won't be running any external user code, this is an extremely good habit to get into. 

Here's what our syntax would look like for the last step:

```
my_list_of_lists = [
    [4, 4, 8, 2],
    [5, 1, 6, 3],
    [5, 2, 4, 6]
]
c = """
INSERT INTO table_name (
    column_one,
    column_two,
    [...]
) VALUES (
    ?,
    ?
    [...]
);
"""
cur.executemany(c, my_list_of_lists)
```

The advantage of this method is that we have the highest level of control over what we're doing. Additionally, if we have larger data, we can write a loop that iterates over our source line by line so that we don't have to read all of it into memory at once.<br>

**The disadvantage is that there is a lot of manual data handling required**.

### 2. Using pandas
The pandas library includes a handy [`DataFrame.to_sql()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) that we can use to send the contents of a dataframe to a SQLite connection object. We can either create the table first using the method above, or if the table does not exist, pandas will create it for us. Here's an example of what that looks like:

```python
my_dataframe.to_csv('table_name', sqlite_connection_object, index=False)
```
Most of the time, we'll want to use index=False, otherwise pandas will create an extra column for the pandas index.<br>

The advantage of this method is that **it can often be done with a line or two of code**.<br>

The disadvantage is that pandas **may alter the data as it reads it in and converts the columns to types automatically**. Additionally, this **requires the data to be small enough to be able to be stored in-memory** using pandas.

### 3. From the SQLite shell
The last method is to use the SQLite shell to import the data. Like the pandas method, we can either create the table manually ourselves, or rely on SQLite to do it for us. Here's the commands you would use to import a CSV using the SQLite shell:

```bash
sqlite> .mode csv
sqlite> .import filename.csv table_name
```

### This is one of the quickest methods to use and works well with large data sources. 

There are several **minor inconveniences** to this method. 
* SQLite **detects the column types using the first row of data**, which can lead to incorrect types.
* You'll need **SQLite shell access**, which you won't always have.
* Lastly, if you want to create the table yourself, you will **need to remove the header from the first line** of your CSV, otherwise SQLite will make that the first row of your table.

With all of these methods, unless we explicitly create the table, **the table will be created with no primary key**. For now this isn't a problem as we'll be migrating this data into new, normalized tables.<br>

We'll use the pandas method in this instance, because we've already read the data into dataframes.

### The type conversion isn't a big issue– 
as outlined above we will move the data into new tables and can handle type conversion then.<br>

#### Hint: 
Just like in the previous guided project, our database retains 'state', so if we run a query that creates or modifies a table twice, the query will fail. [Some commands like `CREATE TABLE`](https://sqlite.org/lang_createtable.html) support `IF NOT EXISTS` which will allow you to run your notebook without these errors. You should consult the [SQLite documentation](https://sqlite.org/lang.html) for the availability and syntax of these clauses.

* Recreate the `run_command()` and `run_query()` functions from the previous guided project, which you can use
* Use `DataFrame.to_sql()` to create tables for each of our dataframes in a new SQLite database, `mlb.db`:
  * The table name should be the same as each of the CSV filename without the extension, eg `game_log.csv` should be imported to a table called `game_log`.
* Using `run_command()`, create a new column in the `game_log` table called `game_id`:
  * Use **SQL string concatenation** to update the new columns with a unique ID using the Retrosheet format outlined above.

In [9]:
import sqlite3

In [14]:
def run_query(query):
    with sqlite3.connect('mlb.db') as conn:
        return pd.read_sql(query, conn)
    
def run_command(query):
    with sqlite3.connect('mlb.db') as conn:
        conn.execute(query)
        conn.commit()

In [11]:
%%bash
sqlite3 mlb.db

In [12]:
con = sqlite3.connect('mlb.db')
game_log.to_sql('game_log', con)

In [15]:
run_query('SELECT * FROM game_log LIMIT 5')

Unnamed: 0,index,date,number_of_game,day_of_week,v_name,v_league,v_game_number,h_name,h_league,h_game_number,...,h_player_7_name,h_player_7_def_pos,h_player_8_id,h_player_8_name,h_player_8_def_pos,h_player_9_id,h_player_9_name,h_player_9_def_pos,additional_info,acquisition_info
0,0,18710504,0,Thu,CL1,,1,FW1,,1,...,Ed Mincher,7.0,mcdej101,James McDermott,8.0,kellb105,Bill Kelly,9.0,,Y
1,1,18710505,0,Fri,BS1,,1,WS3,,1,...,Asa Brainard,1.0,burrh101,Henry Burroughs,9.0,berth101,Henry Berthrong,8.0,HTBF,Y
2,2,18710506,0,Sat,CL1,,2,RC1,,1,...,Pony Sager,6.0,birdg101,George Bird,7.0,stirg101,Gat Stires,9.0,,Y
3,3,18710508,0,Mon,CL1,,3,CH1,,1,...,Ed Duffy,6.0,pinke101,Ed Pinkham,5.0,zettg101,George Zettlein,1.0,,Y
4,4,18710509,0,Tue,BS1,,2,TRO,,1,...,Steve Bellan,5.0,pikel101,Lip Pike,3.0,cravb101,Bill Craver,6.0,HTBF,Y


In [16]:
add_column_game_id_query = '''
        ALTER TABLE game_log
        ADD COLUMN game_id
'''
run_command(add_column_game_id_query)

In [18]:
update_game_id_concat_query = '''
        UPDATE game_log
        SET game_id = h_name||date||number_of_game;
'''
run_command(update_game_id_concat_query)

In [20]:
run_query('SELECT game_id FROM game_log LIMIT 5')

Unnamed: 0,game_id
0,FW1187105040
1,WS3187105050
2,RC1187105060
3,CH1187105080
4,TRO187105090


## Looking for Normalization Opportunities

When we spoke about database normalization in the previous mission, we mentioned that there were **normal forms**, a series of 5 progressive stages. Each of these stages have specific rules about the structure of the data that you can use to normalize.<br>

Rather than learn and follow specific normalized forms, we're going to look for specific opportunities to normalize our data by reducing repetition. Here are two examples of repetition we can find and remove:<br>

### Repetition in columns

Let's look at the following segment of data:

In [22]:
run_query('''SELECT v_player_1_id, v_player_1_name, v_player_1_def_pos,
                    v_player_2_id, v_player_2_name, v_player_2_def_pos
                FROM game_log
                LIMIT 5
''')

Unnamed: 0,v_player_1_id,v_player_1_name,v_player_1_def_pos,v_player_2_id,v_player_2_name,v_player_2_def_pos
0,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
1,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0
2,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
3,whitd102,Deacon White,2.0,kimbg101,Gene Kimball,4.0
4,wrigg101,George Wright,6.0,barnr102,Ross Barnes,4.0


We have three columns that relate to one player, followed by three columns that relate to another player. We could restructure our data to remove this repetition - we would need to add an extra column to include the data that was previously only contained in the name of the column:

id|name|def_pos|off_pos
---|---|---|---
villj001|Jonathan Villar|5.0|1.0
granc001|Curtis Granderson|8.0|1.0
kendh001|Howie Kendrick|7.0|1.0
jasoj001|John Jaso|3.0|1.0
gordd002|Dee Gordon|4.0|1.0

### Non-primary key columns should be attributes of the primary key
The primary key of our game log is our `game_id`, but the players name are not attributes of a game, but of the player id. If the only data we had was the game log, we would remove this column and create a new table that had the names of each player. As it happens, our `person_codes` table already has a list of our player IDs and names, so we can remove these without the need for creating a new table first.

### Redundant Data
Lastly, we want to eliminate any redundant data - that is, columns where the data is available elsewhere. A good example of this can be found in our `park_codes` table, which will form the basis of our eventual park table. Let's look at the first few rows (we won't display the notes column as is not relevant to our discussion):

In [26]:
con = sqlite3.connect('mlb.db')
park_codes.to_sql('park_codes', con)

In [28]:
run_query('SELECT park_id, name, aka, city, state, end, league FROM park_codes LIMIT 5')

Unnamed: 0,park_id,name,aka,city,state,end,league
0,ALB01,Riverside Park,,Albany,NY,05/30/1882,NL
1,ALT01,Columbia Park,,Altoona,PA,05/31/1884,UA
2,ANA01,Angel Stadium of Anaheim,Edison Field; Anaheim Stadium,Anaheim,CA,,AL
3,ARL01,Arlington Stadium,,Arlington,TX,10/03/1993,AL
4,ARL02,Rangers Ballpark in Arlington,The Ballpark in Arlington; Ameriquest Fl,Arlington,TX,,AL


The start and end columns show the first and last games played at the park, however we will be able to derive this information by looking at the park information for each game. Similarly, the league information is going to be available elsewhere in our database.

* Looking at the various files, look for opportunities to normalize the data and record your observations in a markdown cell.

In [29]:
person_codes.to_sql('person_codes', con); team_codes.to_sql('team_codes', con)

In [31]:
run_query('SELECT * FROM person_codes LIMIT 5')

Unnamed: 0,index,id,last,first,player_debut,mgr_debut,coach_debut,ump_debut
0,0,aardd001,Aardsma,David,04/06/2004,,,
1,1,aaroh101,Aaron,Hank,04/13/1954,,,
2,2,aarot101,Aaron,Tommie,04/10/1962,,04/06/1979,
3,3,aased001,Aase,Don,07/26/1977,,,
4,4,abada001,Abad,Andy,09/10/2001,,,


In [33]:
run_query('SELECT * FROM team_codes LIMIT 5')

Unnamed: 0,index,team_id,league,start,end,city,nickname,franch_id,seq
0,0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


In [38]:
ten_cols = []
for i, colname in enumerate(game_log.columns):
    
    if i != 0 and i % 9 == 0:
        print(ten_cols)
        ten_cols = []
    elif i == (len(game_log.columns)-1):
        print(ten_cols)
    else:
        ten_cols.append(colname)
    

['date', 'number_of_game', 'day_of_week', 'v_name', 'v_league', 'v_game_number', 'h_name', 'h_league', 'h_game_number']
['h_score', 'length_outs', 'day_night', 'completion', 'forefeit', 'protest', 'park_id', 'attendance']
['v_line_score', 'h_line_score', 'v_at_bats', 'v_hits', 'v_doubles', 'v_triples', 'v_homeruns', 'v_rbi']
['v_sacrifice_flies', 'v_hit_by_pitch', 'v_walks', 'v_intentional_walks', 'v_strikeouts', 'v_stolen_bases', 'v_caught_stealing', 'v_grounded_into_double']
['v_left_on_base', 'v_pitchers_used', 'v_individual_earned_runs', 'v_team_earned_runs', 'v_wild_pitches', 'v_balks', 'v_putouts', 'v_assists']
['v_passed_balls', 'v_double_plays', 'v_triple_plays', 'h_at_bats', 'h_hits', 'h_doubles', 'h_triples', 'h_homeruns']
['h_sacrifice_hits', 'h_sacrifice_flies', 'h_hit_by_pitch', 'h_walks', 'h_intentional_walks', 'h_strikeouts', 'h_stolen_bases', 'h_caught_stealing']
['h_first_catcher_interference', 'h_left_on_base', 'h_pitchers_used', 'h_individual_earned_runs', 'h_team_ea

In [39]:
'v_player_2_id' in game_log.columns

True

In [37]:
team_codes.head()

Unnamed: 0,team_id,league,start,end,city,nickname,franch_id,seq
0,ALT,UA,1884,1884,Altoona,Mountain Cities,ALT,1
1,ARI,NL,1998,0,Arizona,Diamondbacks,ARI,1
2,BFN,NL,1879,1885,Buffalo,Bisons,BFN,1
3,BFP,PL,1890,1890,Buffalo,Bisons,BFP,1
4,BL1,,1872,1874,Baltimore,Canaries,BL1,1


## Note_Ideation on database normalization

### from `game_log` table

* `manager` table
  * `manager_id`, `name`
  * PRIMARY KEY = `manager_id`

* `player` table
  * `player_id`, `name`, `def_pos`, `off_pos`
  * PRIMARY KEY = `player_id`

* `umpire` table
  * `umpire_id`, `name`
  * PRIMARY KEY = `umpire_id`
* `match_meta` table >>> ORIGIN
  * `game_id`, `date`, `h_name`, `v_name`, `h_line_score`, `v_line_score`,`length_outs`, `day_night`, `completion`, `forefeit`, `protest`, `park_id`, `attendance`, `hp_umpire_id`, `1b_umpire_id`, `2b_umpire_id`, `3b_umpire_id`, `1f_umpire_id`, `rf_umpire_id`
  * PRIMARY KEY = `game_id`

* `match_lineup` table
  * `game_id`
  * `v_manager_id`, `h_manager_id`
  * `winning_pitcher_id`, `losing_pitcher_id`, `saving_pitcher_id`
  * `winning_rbi_batter_id`
  * `v_starting_pitcher_id`, `h_starting_pitcher_id`
  * `v_player_1_id`, `v_player_1_def_pos`, `v_player_2_id`, `v_player_2_def_pos`, `v_player_3_id`, `v_player_3_def_pos`, `v_player_4_id`, `v_player_4_def_pos`, `v_player_5_id`, `v_player_5_def_pos`, `v_player_6_id`, `v_player_6_def_pos`, `v_player_7_id`, `v_player_7_def_pos`, `v_player_8_id`,`v_player_8_def_pos`, `v_player_9_id`, `v_player_9_def_pos`
  * `h_player_1_id`, `h_player_1_def_pos`, `h_player_2_id`, `h_player_2_def_pos`, `h_player_3_id`, `h_player_3_def_pos`, `h_player_4_id`, `h_player_4_def_pos`, `h_player_5_id`, `h_player_5_def_pos`, `h_player_6_id`, `h_player_6_def_pos`, `h_player_7_id`, `h_player_7_def_pos`, `h_player_8_id`,`h_player_8_def_pos`, `h_player_9_id`, `h_player_9_def_pos`
  * PRIMARY KEY = `game_id`

* `match_def_stat` table
  * `game_id`, `h_name`, `v_name`
  * `h_putouts`, `h_assists`, `h_passed_balls`, `h_double_plays`, `h_triple_plays`
  * `v_putouts`, `v_assists`, `v_passed_balls`, `v_double_plays`, `v_triple_plays`
  * PRIMARY KEY = `game_id`, `h_name`, `v_name`

* `match_off_stat` table
  * `game_id`, `h_name`, `v_name`
  * `h_at_bats`, `h_hits`, `h_doubles`, `h_triples`, `h_homeruns`, `h_rbi`, `h_sacrifice_flies`, `h_hit_by_pitch`, `h_walks`, `h_intentional_walks`, `h_strikeouts`, `h_stolen_bases`, `h_caught_stealing`, `h_grounded_into_double`, `h_left_on_base`
  * `v_at_bats`, `v_hits`, `v_doubles`, `v_triples`, `v_homeruns`, `v_rbi`, `v_sacrifice_flies`, `v_hit_by_pitch`, `v_walks`, `v_intentional_walks`, `v_strikeouts`, `v_stolen_bases`, `v_caught_stealing`, `v_grounded_into_double`, `v_left_on_base`
  * PRIMARY KEY = `game_id`, `h_game`, `v_game`

* `match_pitch_stat` table
  * `game_id`, `h_name`, `v_name`
  * `h_pitchers_used`, `h_individual_earned_runs`, `h_team_earned_runs`, `h_wild_pitches`, `h_balks`
  * `v_pitchers_used`, `v_individual_earned_runs`, `v_team_earned_runs`, `v_wild_pitches`, `v_balks`
  * PRIMARY KEY = `game_id`, `h_game`, `v_game`

## Planning a Normalized Schema

Now that we've started to think about normalization ideas, it's time to start planning our schema. The best way to work visually with a schema diagram, just like the ones we've used so far in this course. Start by creating a diagram of the four existing tables and their columns, and then gradually create new tables that move the data into a more normalized state.<br>

Some people like to do this on paper, others use diagramming tools like Sketch or Figma, others like using Photoshop or similar. Our recommendation is that the best way to do this is using a schema designing tool like [DbDesigner.net](https://dbdesigner.net/). This free tool allows you to create a schema and will create lines to show foreign key relations clearly.

![dbdesigner](https://s3.amazonaws.com/dq-content/193/dbdesigner-screenshot.png)

In the end, you should choose the tool that you feel like you will be able to work quickly in as you plan out your schema.

Here are some tips when planning out your schema:

* **Don't be afraid to experiment**. It's unlikely that your first few steps will be there in your finished product - try things and see how they look.
* If you're using a tool like DbDesigner which automatically shows lines for foreign key relationships, don't worry if your lines look messy. This is normal– you can move the tables around to neaten things up at the end, but don't waste time on it while you are still normalizing.
* The following facts about the data may help you with your normalization decisions:
  * **Historically, teams sometimes move between leagues.**
  * **The same person might be in a single game as both a player and a manager**
  * **Because of how pitchers are represented in the game log, not all pitchers used in a game will be shown. We only want to worry about the pitchers mentioned via position or the 'winning pitcher'/ 'losing pitcher'.**
* It is possible to over-normalize. We want to **finish with about 7-8 tables total**.

Lastly, we advise spending between 60-90 minutes on your planning your schema. In the next step, we will introduce our suggested schema that we will work with for the rest of the project, but working as much of it out yourself is highly recommended.

* Using whichever design tool you feel most comfortable with, plan a schema for our baseball database.
* When you are happy with your schema, [insert a screenshot or photo into a markdown cell](https://daringfireball.net/projects/markdown/syntax#img).