# database-normalization-and-relations

* [Go direct to summary note](#summary)

In the previous mission, we learned how to create a foreign key to reference a table in another record and how to use joins to query across tables using the foreign key. In this mission, we'll dive more deeply into relations between tables, learn about data normalization, and how we can take advantage of them to perform more complex joins.

* In this mission, we'll work with data on Academy Award nominations from 2001 to 2010 for just the lead and supporting acting roles. The Academy Awards, also known as the Oscars, is an annual awards ceremony hosted to recognize the achievements in the film industry. There are many different awards categories and the members of the academy vote every year to decide which artist or film should get the award. The full dataset, containing data on all award categories from years 1927 to 2010, can be found [here](https://www.aggdata.com/awards/oscar). We've cleaned and transformed the data and created `academy_awards.db`.

The database file academy_awards.db contains 2 tables:

* `nominations`, where each row describes an individual actor's nomination.
* `ceremonies`, where each row describes an individual Academy Awards ceremony.

The `nominations` table has the following schema:

* `id` - integer field, primary key for uniquely identifying rows.
* `ceremony_id` - integer field, foreign key reference to the id column from the ceremonies table.
* `category`: text field, award category. Can only be one of the following 4 values:
    * `Actor -- Leading Role`.
    * `Actor -- Supporting Role`.
    * `Actress -- Leading Role`.
    * `Actress -- Supporting Role`.
* `nominee`: text field, name of the actor or actress.
* `movie`: text field, name of the movie the actor or actresses was nominated for.
* `character`: text field, name of the character this actor or actress played.
* `won`: Boolean field, if the actor or actress won the award (either 0 or 1).

The won column is specified as Boolean in the schema but since **SQLite doesn't have a Boolean type**, SQLite uses the integer data type instead. `The integer 0 represents False while the integer 1 represents True`.

In [104]:
import pandas as pd
import numpy as np
import re

In [137]:
academy_awards = pd.read_csv('data/academy_awards.csv',
                            encoding='latin1')

In [138]:
academy_awards.head(3)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,


### Ceremonies

Since awards are only given to one winner, 4 nominees for each award lose and 1 nominee wins. You'll notice that among the nominees for each award, there was 1 nominee with a value of 1 for won and 4 nominees with a value of 0 for won. You may have also noticed that the movie `The King's Speech` shows up twice. This is because separate actors were nominated for different roles in the same movie.<br>

The `ceremonies` table has the following schema:

* `id` - integer field, primary key for uniquely identify rows.
* `year` - integer field, the year of the ceremony.
  * year ranges limited from 2000 to 2010.
* `host` - text field, the host for that ceremony.


In [139]:
year = range(2000, 2011)
host = ['Billy Crystal', 'Steve Martin', 'Whoopi Goldberg',
       'Steve Martin', 'Billy Crystal', 'Chris Rock',
       'Jon Stewart', 'Ellen DeGeneres', 'Jon Stewart',
       'Hugh Jackman', 'Steve Martin']

ceremonies = pd.DataFrame(columns=['id','year', 'host'])
ceremonies['year'] = year
ceremonies['host'] = host
ceremonies['id'] = ceremonies.index + 1

Here's what the entire `ceremonies` table, which only contains 11 rows, looks like:

In [140]:
ceremonies.head(3)

Unnamed: 0,id,year,host
0,1,2000,Billy Crystal
1,2,2001,Steve Martin
2,3,2002,Whoopi Goldberg


### Clean and re-format data


In [141]:
academy_awards = \
    academy_awards.rename(columns={
            'Year':'year',
            'Category':'category',
            'Nominee':'nominee',
            'Additional Info':'movie',
            'Won?':'won',
            'Unnamed: 5':'character'
        })

In [142]:
academy_awards.drop(['Unnamed: 6',
                    'Unnamed: 7',
                    'Unnamed: 8',
                    'Unnamed: 9',
                    'Unnamed: 10'],
                    axis=1,
                   inplace=True)

In [143]:
academy_awards.movie.unique().tolist()[:10]

["Biutiful {'Uxbal'}",
 "True Grit {'Rooster Cogburn'}",
 "The Social Network {'Mark Zuckerberg'}",
 "The King's Speech {'King George VI'}",
 "127 Hours {'Aron Ralston'}",
 "The Fighter {'Dicky Eklund'}",
 "Winter's Bone {'Teardrop'}",
 "The Town {'James Coughlin'}",
 "The Kids Are All Right {'Paul'}",
 "The King's Speech {'Lionel Logue'}"]

In [144]:
academy_awards['is_actor'] = academy_awards.category.map(lambda x: 1 if x.startswith('Act') else 0)

In [145]:
academy_awards = academy_awards[academy_awards.is_actor==1]

In [146]:
academy_awards['movie'].apply(lambda x: 1 if type(x) == str else 0).value_counts()

1    1568
0      15
Name: movie, dtype: int64

In [147]:
academy_awards['is_movie'] = academy_awards['movie'].apply(lambda x: 1 if type(x) == str else 0)

In [148]:
academy_awards = academy_awards[academy_awards.is_movie == 1]

In [149]:
for mov in academy_awards.movie:
    if type(mov) == float:
        print(mov, type(mov))
else:
    print('Non float.')

Non float.


In [150]:
academy_awards.head(3)

Unnamed: 0,year,category,nominee,movie,won,character,is_actor,is_movie
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,1,1
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,1,1
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,1,1


In [151]:
academy_awards['year'] = academy_awards['year'].apply(lambda x: x[:4]).astype(int)

In [152]:
academy_awards = academy_awards[academy_awards['year'] >= 2000]

In [153]:
academy_awards['character'] = academy_awards['movie'].apply(lambda x: re.findall(r'\{.*\}', x)[0][2:-2])

In [154]:
academy_awards['movie'] = academy_awards['movie'].apply(lambda x: x.split('{')[0][:-1])

In [155]:
nominations = academy_awards.copy()

In [156]:
#nominations['ceremony_id'] = nominations.ceremony_id.astype(int)

In [157]:
nominations['year'].unique()

array([2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000])

In [158]:
nominations['won'].unique()

array(['NO', 'YES'], dtype=object)

In [159]:
nominations['won'] = nominations['won'].map({'YES':1, 'NO':0})

In [160]:
nominations['id'] = nominations.index+1

In [161]:
# re-select & order the columns of dataframe
nominations = nominations[['id', 'year', 'category',
                          'nominee', 'movie', 'character', 'won']]

### nominations dataframe preprocess done.

Here's what the first 10 rows in the `nominations` table look like:

In [162]:
nominations.head(3)

Unnamed: 0,id,year,category,nominee,movie,character,won
0,1,2010,Actor -- Leading Role,Javier Bardem,Biutiful,Uxbal,0
1,2,2010,Actor -- Leading Role,Jeff Bridges,True Grit,Rooster Cogburn,0
2,3,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network,Mark Zuckerberg,0


### check if there is any nan value in dataframe.

In [163]:
for col in nominations.columns:
    if nominations[col].isnull().sum():
        print(col)
else:
    print('No nan value')

No nan value


In [164]:
for col in ceremonies.columns:
    if ceremonies[col].isnull().sum():
        print(col)
else:
    print('No nan value')

No nan value


### Input DataFrame data into sqlite db file.

In [182]:
import sqlite3

# input two dataframe into aws database as two tables.

In [183]:
# access empty sqlite db, already created via local cmd.
conn = sqlite3.connect('data/academy.db')
cursor = conn.cursor()

In [184]:
sql = 'drop table if exists ceremonies;'
cursor.execute(sql); conn.commit()

In [185]:
sql = '''
        create table ceremonies (
            id int,
            year int,
            host varchar,
            primary key(id)
        );
'''

cursor.execute(sql)
conn.commit()

In [186]:
sql = 'drop table if exists nominations;'
cursor.execute(sql); conn.commit()

In [187]:
sql = '''
        create table nominations (
            id int,
            year int,
            category varchar,
            nominee varchar,
            movie varchar,
            character varchar,
            won int,
            primary key(id)
        );
'''

cursor.execute(sql)
conn.commit()

In [191]:
# insert data into table 'ceremonies' in academy.db

for i in range(ceremonies.shape[0]):
    
    row = ceremonies.iloc[i]
    
    sql = '''
            insert into ceremonies
            values (%d, %d, "%s")
    ''' % (row['id'], row['year'], row['host'])
    
    cursor.execute(sql)
    conn.commit()

In [188]:
# insert data into table 'nominations' in academy.db

for i in range(nominations.shape[0]):
    
    row = nominations.iloc[i]
    
    sql = '''
            insert into nominations
            values (%d, %d, "%s", "%s", "%s", "%s", %d)
    ''' % (row['id'], row['year'], row['category'],
          row['nominee'], row['movie'], row['character'],
          row['won'])
    
    cursor.execute(sql)
    conn.commit()

In [189]:
sql = 'pragma table_info(nominations);'
pd.read_sql(sql, conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,int,0,,1
1,1,year,int,0,,0
2,2,category,varchar,0,,0
3,3,nominee,varchar,0,,0
4,4,movie,varchar,0,,0
5,5,character,varchar,0,,0
6,6,won,int,0,,0


In [192]:
sql = 'pragma table_info(ceremonies);'
pd.read_sql(sql, conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,int,0,,1
1,1,year,int,0,,0
2,2,host,varchar,0,,0


In [193]:
sql = 'select * from nominations limit 3'
pd.read_sql(sql, conn)

Unnamed: 0,id,year,category,nominee,movie,character,won
0,1,2010,Actor -- Leading Role,Javier Bardem,Biutiful,Uxbal,0
1,2,2010,Actor -- Leading Role,Jeff Bridges,True Grit,Rooster Cogburn,0
2,3,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network,Mark Zuckerberg,0


In [194]:
sql = 'select * from ceremonies limit 3'
pd.read_sql(sql, conn)

Unnamed: 0,id,year,host
0,1,2000,Billy Crystal
1,2,2001,Steve Martin
2,3,2002,Whoopi Goldberg


### Data transporting done.
### Now we return to dataquest and continue the course.

The `ceremonies` table contains just the information on the actual awards ceremony while the `nominations` table contains just the information on individual nominations. If we had instead stored the `year` and `host` values as columns within the `nominations` table and avoided using a `ceremonies` table altogether, our `nominations` table would look like this:

In [196]:
sql = '''
        select * from nominations N
        inner join ceremonies C
        on N.year == C.year
        limit 3
'''
pd.read_sql(sql, conn)

Unnamed: 0,id,year,category,nominee,movie,character,won,id.1,year.1,host
0,1,2010,Actor -- Leading Role,Javier Bardem,Biutiful,Uxbal,0,11,2010,Steve Martin
1,2,2010,Actor -- Leading Role,Jeff Bridges,True Grit,Rooster Cogburn,0,11,2010,Steve Martin
2,3,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network,Mark Zuckerberg,0,11,2010,Steve Martin


While this representation is easier to query, since you don't have to do a join each time you want to get the `year` or `host` information, 
### it has a few problems:

* it contains a lot of redundant data, which means the database will take up more disk space and cost more to store.
* if we want to update or remove a value in the `year` or `host` columns, we need to update every row that contains that same value. This is cumbersome to remember and can cause human error.
* updating or removing many rows can be slow for larger databases. As your data grows bigger, which is often the case with databases used in production, the update and removal speeds become significantly worse.

### We instead chose to normalize the data, 
which involves 
1. separating data into smaller tables with less redundant information and
2. creating relations between the appropriate tables. 

By having the `year` and `host` columns in a separate `ceremonies` table, we get the following benefits:

* much less data redundancy since the actual values for `year` and `host` are only stored in 1 row in `ceremonies`, instead of replicated for each relevant row in nominations.
* separation of concerns and ease of updating.

You can read more about the benefits of database normalization [here](https://en.wikipedia.org/wiki/Database_normalization#Objectives).

There are many types of relations you can create between tables to represent the links between columns. In this mission, we'll focus on the 2 most common relations:

* one-to-many.
* many-to-many.

### A one-to-many relation 
exists whenever many rows in one table need to relate to one row in the other table. The relation between `ceremonies` and `nominations` is a one-to-many relation since many rows in the `nominations` table can be linked to an individual row in the `ceremonies` table. A row in the `ceremonies` table contains no reference to the `nominations` table. However, many rows in the `nominations` table contain a reference to the `ceremonies` table using the `ceremony_id` **foreign key**.<br>

Below is a diagram that demonstrates how multiple rows in the `nominations` table, that share the same `ceremony_id` of `10`, relate to the row in the `ceremonies` table whose id is `10`:

![](img/2.png)

An important thing to remember in a one-to-many relation is that the reference is one-sided. The `nominations` table contains a foreign key reference to the `id` column in ceremonies but the `ceremonies` table contains no references to values in the `nominations` table.<br>

Here are some other examples of one-to-many relations:

* a car insurance policy can have multiple people on it, but each person can only belong to one policy.
* a mother can have many children, but each child can only have one birth mother.
* a reporter can have many articles but each article can only have one associated reporter.

### As with many things in software development, there is a tradeoff to database normalization.
We need to write **longer queries** sometimes and **use joins more often** to grab information from multiple tables. Many companies have databases with hundreds or thousands of tables with many relations in between, so **this can get complicated quickly!** As you become more familiar with querying normalized databases, you'll be able to overcome the added complexity much more easily.<br>

To write a query that involves 2 tables that are in a one-to-many relation, **you need to join on the foreign key column that the "many" side uses to reference the "one" side**. When using the `WHERE` statement to express filtering criteria, we can use columns from both tables. For example, to return all of the movies that won an award in 2010, we'd need to write the following query:


In [198]:
sql = '''
        select movie from nominations
        inner join ceremonies
        on nominations.year == ceremonies.year
        where ceremonies.year == 2010
        and nominations.won == 1;
'''
pd.read_sql(sql, conn)

Unnamed: 0,movie
0,Black Swan
1,The Fighter
2,The Fighter
3,The King's Speech


In the `WHERE` statement, we expressed that we were only interested in rows where the `year` value was `2010` from the `ceremonies` table and where the `won` value was `1` from the `nominations` table. <br>

When joining 2 tables, you can be more explicit about which columns you want returned from which tables using dot notation -- e.g. `nominations.movie`. In the following query, we modified the earlier query to select the `year` and `host` columns from `ceremonies` and the `movie` and `nominee` columns from `nominations`:

In [199]:
sql = '''
        select ceremonies.year,
            ceremonies.host,
            nominations.movie,
            nominations.nominee
        from nominations
        inner join ceremonies
        on nominations.year ==
            ceremonies.year
        where ceremonies.year == 2010
        limit 3;
'''
pd.read_sql(sql, conn)

Unnamed: 0,year,host,movie,nominee
0,2010,Steve Martin,The Fighter,Amy Adams
1,2010,Steve Martin,The Kids Are All Right,Annette Bening
2,2010,Steve Martin,The Fighter,Christian Bale


In the denormalized schema, which had the `year` and `host` columns in `nominations` itself, we'd only need to write the following query to accomplish the same result:

```sql
        select movie from nominations
        where nominations.year == 2010
        limit 3;
```

#### instructions
* We've imported the `sqlite3` library for you already and connected to the `academy_awards.db` database. The Connection instance is named `conn`. <br>
* Write a query that returns all of the years that the actress `Natalie Portman` was nominated for an award.
  * Only return the `year` column from `ceremonies` and the `movie` column from `nominations`.
  * Run the query and assign the full results list to the variable `portman_movies`.
  * Then display `portman_movies` using the `print` function.

In [200]:
sql = '''
        select C.year, N.movie
        from ceremonies C
        inner join nominations N
        on C.year == N.year
        where N.nominee == "Natalie Portman"
'''

portman_movies = conn.execute(sql).fetchall()

for mov in portman_movies:
    print(mov)

(2010, 'Black Swan')
(2004, 'Closer')


If we wanted to extend our analysis to study how Academy Awards affect a nominee's career, we'd need to first add more data on which movies each actor starred in. We need a way to represent the relation between `actors` and `movies`. Our first instinct might be to use a movies table, an `actors` table, and specify a one-to-many relationship between them. The `movies` table could contain an `actor_id` field that acts as a **foreign key** reference to the `id` column from the `actors` table. <br>

We immediately run into a road block. Each movie contains many `actors` and since the `actor_id` column would be an integer field, we have no way to reference multiple rows from the `actors` table. We could have a separate row in movies where each row contains a different value for `actor_id` and cover all the `actors` in the movie that way. This unfortunately means a large amount of data duplication, since the rest of the columns describing that movie probably won't be different.

What if we had a list data type where we could store multiple values:

id|movie|actors
---|---|---
1|The Fighter|Christian Bale, Amy Adams, ...
2|Doubt|Meryl Streep, Amy Adams, ...
3|Junebug|Embeth Davidtz, Amy Adams, ...

SQLite unfortunately doesn't contain a list data type, so we can't simply store the list of actor names. While some other databases do contain a list data type, this is still a poor design for our data. While searching for a movie by name would be a simple `SELECT` query, searching by actors would be incredibly cumbersome and slow.<br>

You may have noticed that the actress **Amy Adams** stars in all 3 of the movies above. If we wanted to write a query that searched every element in the `actors` list for every row in `movies`, the query would take a long time to return as our table starts to hold more than a few thousand movies. We can't use a one-to-many relation since SQLite, and many databases, don't contain a list data type and it would be inefficient to query.

The right way to model actors and movies is to use a many-to-many relation. A **many-to-many** relation allows us to flexibly represent both:

* the actors in a movie and
* the movies an actor has starred in.

### To represent a many-to-many relation, we need to use an intermediate table called a **join table**, which we'll learn more about in the next screen.

# Many-to-many relation

To model a many-to-many relationship, we need to create a separate table that contains the foreign keys to each of the tables that we're creating a many-to-many relationship between. This table is called a **join table**, but is often referenced by [many other names](https://en.wikipedia.org/wiki/Associative_entity). The rows in the join table contain the foreign keys to the 2 other tables. Here's what a join table representing the many-to-many relationship between movies and actors would look:

![](img/3.png)

In a many-to-many relation, we separate the data contained within the rows with the actual relation between the rows. This means we can, for example, edit a movie's name without touching the many actor records that are related to that movie. Each table above has it's own `id` column:

* the `movies.id` column is used as a foreign key reference by the `movies_actors.movie_id` column.
* the `actors.id` column is used as a foreign key reference by the `movies_actors.actor_id` column.
* the `movies_actors.id` column is used just to uniquely identify each row in `movies_actors`.

The `movies_actors` table is no different than any other table in our database and 
### it's role as a join table between `movies` and `actors` is a design pattern. 

For example, we can add more columns to the `movies_actors` table just like with any other table. We could take advantage of this to add attributes that are specific to that movie-actor combination (e.g. `Salary` or `Awards Nominated`). <br>

Creating a join table is similar to creating a regular table except that 
### there need to be 2 foreign columns that reference the 2 tables in the many-to-many relationship:

### 1. Create many-to-many join table using pandas.

In [201]:
sql = 'drop table if exists movies_actors;'
conn.execute(sql); conn.commit()

In [202]:
nominations.nominee.value_counts()

Cate Blanchett            4
Judi Dench                4
Meryl Streep              4
Kate Winslet              4
Javier Bardem             3
Renée Zellweger           3
Helen Mirren              3
Sean Penn                 3
Laura Linney              3
Johnny Depp               3
Philip Seymour Hoffman    3
Jeff Bridges              3
George Clooney            3
Amy Adams                 3
Penélope Cruz             3
Nicole Kidman             3
Colin Firth               2
Benicio Del Toro          2
Djimon Hounsou            2
Tom Wilkinson             2
Jeremy Renner             2
Geoffrey Rush             2
Marcia Gay Harden         2
Melissa Leo               2
Marisa Tomei              2
Ed Harris                 2
Frances McDormand         2
Heath Ledger              2
Michelle Williams         2
Natalie Portman           2
                         ..
Casey Affleck             1
Mo'Nique                  1
Eddie Murphy              1
Saoirse Ronan             1
Denzel Washington   

In [118]:
nominations.movie.value_counts()

From Here to Eternity              5
Bonnie and Clyde                   5
Tom Jones                          5
Peyton Place                       5
Network                            5
All about Eve                      5
On the Waterfront                  5
Mrs. Miniver                       5
The Godfather Part II              5
Who's Afraid of Virginia Woolf?    4
My Man Godfrey                     4
Terms of Endearment                4
Coming Home                        4
The Defiant Ones                   4
Chicago                            4
A Star Is Born                     4
The Last Picture Show              4
The Turning Point                  4
Guess Who's Coming to Dinner       4
Othello                            4
Gentleman's Agreement              4
The Song of Bernadette             4
For Whom the Bell Tolls            4
Judgment at Nuremberg              4
The Godfather                      4
Kramer vs. Kramer                  4
Johnny Belinda                     4
R

In [203]:
# create movies dataframe

movies = pd.DataFrame(nominations['movie'].unique(),\
                      columns=['movie'])
movies['id'] = movies.index + 1
movies = movies[['id', 'movie']]

movies.head()

Unnamed: 0,id,movie
0,1,Biutiful
1,2,True Grit
2,3,The Social Network
3,4,The King's Speech
4,5,127 Hours


In [209]:
movies.id.dtype

dtype('int64')

In [204]:
# create actors dataframe

actors = pd.DataFrame(nominations['nominee'].unique(),\
                     columns=['actor'])
actors['id'] = actors.index + 1
actors = actors[['id', 'actor']]

actors.head()

Unnamed: 0,id,actor
0,1,Javier Bardem
1,2,Jeff Bridges
2,3,Jesse Eisenberg
3,4,Colin Firth
4,5,James Franco


In [211]:
actors.id.dtype

dtype('int64')

In [219]:
# create empty dataframe to stack movies.id and actors.id matched.

movies_actors = pd.DataFrame(columns=['movie_id', 'actor_id'])
movies_actors

Unnamed: 0,movie_id,actor_id


In [220]:
# match many-to-many relationship between movies - actors.
# matching standard = nominations (dataframe)

pairs_df = nominations[['movie', 'nominee']]
pairs_df.head()

Unnamed: 0,movie,nominee
0,Biutiful,Javier Bardem
1,True Grit,Jeff Bridges
2,The Social Network,Jesse Eisenberg
3,The King's Speech,Colin Firth
4,127 Hours,James Franco


In [221]:
actors_ids, movies_ids = [], []

for i in range(pairs_df.shape[0]):
    
    row = pairs_df.iloc[i]
    
    row_actor = row['nominee']
    row_movie = row['movie']
    
    row_actor_id = actors[actors.actor==row_actor].id.values[0]
    row_movie_id = movies[movies.movie==row_movie].id.values[0]
    
    actors_ids.append(row_actor_id)
    movies_ids.append(row_movie_id)
    

In [224]:
movies_actors['movie_id'] = movies_ids
movies_actors['actor_id'] = actors_ids
movies_actors['id'] = movies_actors.index + 1

movies_actors = movies_actors[['movie_id', 'id', 'actor_id']]
movies_actors.head()

Unnamed: 0,movie_id,id,actor_id
0,1,1,1
1,2,2,2
2,3,3,3
3,4,4,4
4,5,5,5


### Input 3 dataframes to sql tables.

In [225]:
conn.execute('drop table if exists movies'); conn.commit()
conn.execute('drop table if exists actors'); conn.commit()
conn.execute('drop table if exists movies_actors'); conn.commit()

In [226]:
sql = '''
        create table movies (
            id integer primary key,
            movie varchar
        );
'''

conn.execute(sql); conn.commit()

sql = '''
        create table actors (
            id integer primary key,
            actor varchar
        );
'''
conn.execute(sql); conn.commit()

In [227]:
# insert data into table 'movies'

for i in range(movies.shape[0]):
    
    row = movies.iloc[i]
    
    sql = '''
            insert into movies
            values (%d, "%s")
    ''' % (row['id'], row['movie'])
    
    cursor.execute(sql)
    conn.commit()

In [228]:
# insert data into table 'actors'

for i in range(actors.shape[0]):
    
    row = actors.iloc[i]
    
    sql = '''
            insert into actors
            values (%d, "%s")
    ''' % (row['id'], row['actor'])
    
    cursor.execute(sql)
    conn.commit()

In [229]:
sql = '''
        create table movies_actors (
            id integer primary key,
            movie_id integer,
            actor_id integer,
            foreign key(movie_id) references movies(id)
            foreign key(actor_id) references actors(id)
        );
'''

conn.execute(sql); conn.commit()

In [230]:
# insert data into table 'movies_actors'

for i in range(movies_actors.shape[0]):
    
    row = movies_actors.iloc[i]
    
    sql = '''
            insert into movies_actors 
            values (%d, %d, %d)
    ''' % (row['id'], row['movie_id'], row['actor_id'])
    
    cursor.execute(sql)
    conn.commit()


Let's explore the data in these tables we just discussed a bit further. We've added information about all of the actors and movies from the `nominations` table to the `movies`, `actors`, and `movies_actors` tables. This will enable us to practice using many-to-many relations

#### instructions
* Write a query that returns the first 5 rows in `movies_actors` and assign the results to `five_join_table`.
* Write a query that returns the first 5 rows in `actors` and assign the results to `five_actors`.
* Write a query that returns the first 5 rows in `movies` and assign the results to `five_movies`.
* Then use the print function to display `five_join_table`, `five_actors`, and `five_movies`.

In [231]:
# first 5 rows in movies_actors and assign the results
sql = '''
        select * from movies_actors limit 5
'''
five_join_table = conn.execute(sql).fetchall()


# first 5 rows in actors and assign the results
sql = '''
        select * from actors limit 5
'''
five_actors = conn.execute(sql).fetchall()


# first 5 rows in movies and assign the results
sql = '''
        select * from movies limit 5
'''
five_movies = conn.execute(sql).fetchall()


# display five_join_table, five_actors, five_movies.

for j, a, m in zip(five_join_table, five_actors, five_movies):
    
    print('five_join_table :', j)
    print('five_actors :', a)
    print('five_movies :', m)

five_join_table : (1, 1, 1)
five_actors : (1, 'Javier Bardem')
five_movies : (1, 'Biutiful')
five_join_table : (2, 2, 2)
five_actors : (2, 'Jeff Bridges')
five_movies : (2, 'True Grit')
five_join_table : (3, 3, 3)
five_actors : (3, 'Jesse Eisenberg')
five_movies : (3, 'The Social Network')
five_join_table : (4, 4, 4)
five_actors : (4, 'Colin Firth')
five_movies : (4, "The King's Speech")
five_join_table : (5, 5, 5)
five_actors : (5, 'James Franco')
five_movies : (5, '127 Hours')


Recall that the values in our join table, `movies_actors`, are all just *integer* id values that refer to different rows in the `movies` and `actors` tables. If we wanted to know the actors who starred in `The Fighter` that were nominated for an Academy Award between 2001 and 2010, we'd have to use multiple joins in our query across all 3 tables.<br>

Let's first join the `movies` table with the `movies_actors` table:


In [239]:
query = '''
    SELECT 
        movies.id 'movies.id', 
        movies.movie 'movies.movie',
        movies_actors.id 'movies_actors.id',
        movies_actors.movie_id 'movies_actors.movie_id',
        movies_actors.actor_id 'movies_actors.actor_id'
    FROM movies
    INNER JOIN movies_actors
    ON movies.id == movies_actors.movie_id
    LIMIT 10
'''
pd.read_sql(query, conn)

Unnamed: 0,movies.id,movies.movie,movies_actors.id,movies_actors.movie_id,movies_actors.actor_id
0,1,Biutiful,1,1,1
1,2,True Grit,2,2,2
2,3,The Social Network,3,3,3
3,4,The King's Speech,4,4,4
4,5,127 Hours,5,5,5
5,6,The Fighter,6,6,6
6,7,Winter's Bone,7,7,7
7,8,The Town,8,8,8
8,9,The Kids Are All Right,9,9,9
9,4,The King's Speech,10,4,10


```sql
SELECT * FROM movies
INNER JOIN movies_actors ON movies.id == movies_actors.movie_id
```

![](img/4.png)

### You may have noticed that the `movies_actors.id` column skips from `5` to `10`. 
We wanted to demonstrate that there's not just one row in the result for each movie in `movies` since the movie, `The King's Speech` shows up twice in the sample results. The results of the query so far are really just the cross join of all the rows in the `movies` table with all the rows in the `movies_actors` table. <br>

We then need to join these results with the `actors` columns. To do this, add another `JOIN` statement in our query:


In [240]:
query = '''
    SELECT 
        movies.id 'movies.id', 
        movies.movie 'movies.movie',
        movies_actors.id 'movies_actors.id',
        movies_actors.movie_id 'movies_actors.movie_id',
        movies_actors.actor_id 'movies_actors.actor_id',
        actors.id 'actors.id',
        actors.actor 'actors.actor'
    
    FROM movies
    
    INNER JOIN movies_actors
    ON movies.id == movies_actors.movie_id
    
    INNER JOIN actors
    ON movies_actors.actor_id == actors.id
    
    LIMIT 10
'''
pd.read_sql(query, conn)

Unnamed: 0,movies.id,movies.movie,movies_actors.id,movies_actors.movie_id,movies_actors.actor_id,actors.id,actors.actor
0,1,Biutiful,1,1,1,1,Javier Bardem
1,2,True Grit,2,2,2,2,Jeff Bridges
2,3,The Social Network,3,3,3,3,Jesse Eisenberg
3,4,The King's Speech,4,4,4,4,Colin Firth
4,5,127 Hours,5,5,5,5,James Franco
5,6,The Fighter,6,6,6,6,Christian Bale
6,7,Winter's Bone,7,7,7,7,John Hawkes
7,8,The Town,8,8,8,8,Jeremy Renner
8,9,The Kids Are All Right,9,9,9,9,Mark Ruffalo
9,4,The King's Speech,10,4,10,10,Geoffrey Rush


![](img/5.png)

We now have a row for each actor in `actors` that played in each movie in `movies`. However, if you go back to the original problem, you'll notice that we were only interested in actors that starred in `The Fighter`. To accomplish this, we can modify the columns returned in the `SELECT` statement and add filtering criteria using the `WHERE` statement:

```sql
SELECT actors.actor FROM movies
INNER JOIN movies_actors ON movies.id == movies_actors.movie_id
INNER JOIN actors ON movies_actors.actor_id == actors.id
WHERE movies.movie == "The Fighter";
```

In [241]:
query = '''
    SELECT actors.actor FROM movies
    INNER JOIN movies_actors ON movies.id == movies_actors.movie_id
    INNER JOIN actors ON movies_actors.actor_id == actors.id
    WHERE movies.movie == "The Fighter";
'''
pd.read_sql(query, conn)

Unnamed: 0,actor
0,Christian Bale
1,Amy Adams
2,Melissa Leo


We'll get back just the 3 actors in our database that starred in The `Fighter`:

![](img/6.png)

You may have noticed that we used **dot notation to specify the column name** in the query above:

* `movies.movie` in the `WHERE` statement.
* `actors.actor` in the `SELECT` statement.

**While this dot notation is required in the `JOIN` statement, it's optional in the `SELECT` and `WHERE` statements.** 

### It's a good habit, however, to write out the full name (instead of just `movie` or `actor`) of the column using dot notation. 

Besides the `id` column, you'll often work with multiple tables that contain the same column names and using dot notation helps us and the database system know what exactly we're referring to.

In the query above,

* we started with the `movies` table (in our `select`),
* joined with the `movies_actors` table,
* and then finally joined with the `actors` table.

We could have actually accomplished the same thing by:

* starting with the `actors` table,
* joining with the movies_actors table, and then joining with the movies table

since the filtering criteria is still the same (`WHERE movies.movie == "The Fighter"`). <br>


### Here's a good summary of the steps you need to do when querying tables that are in a many-to-many relation:

* first, state the question you want answered:
  * we want all of the actors that starred in `The Fighter`. Information on `The Fighter` is in the `movies` table and there's a join table we'll need to use to get the related information from the `actors` table.
* then, understand what joins you need and the filtering criteria you need:
  * we need to join the `movies` table with the `movies_actors` table, then join the results with the `actors` table.
  * our filtering criteria is that we only want the row from `movies` corresponding to `The Fighter`.
* finally, write the query you need based on the joins, columns, and filtering criteria you need.

Writing multiple joins to query tables in a many-to-many relation can be overwhelming at first but it's nothing that practice can't help you overcome. As you practice more, you'll find yourself skipping right to writing the query itself as this kind of querying becomes second nature to you.

#### instructions
* Modify the query we wrote earlier to instead return all the actors that starred in `The King's Speech`.
  * We're interested in both the actor name as well as the movie name this time (in that order).
* Run the query and assign the results list to `kings_actors`.
* Then, use the `print` function to display `kings_actors`.

In [234]:
query = '''
    SELECT actors.actor, movies.movie
    FROM movies
    INNER JOIN movies_actors ON movies.id
                == movies_actors.movie_id
    INNER JOIN actors ON movies_actors.actor_id == actors.id
    WHERE movies.movie == "The King's Speech";
'''

kings_actors = conn.execute(query).fetchall()

for actmov in kings_actors:
    print(actmov)

('Colin Firth', "The King's Speech")
('Geoffrey Rush', "The King's Speech")
('Helena Bonham Carter', "The King's Speech")


#### instructions

* Write a query that returns all of the movies that `"Natalie Portman"` played in.
  * We want to return only the `movie` name (from the `movies` table) and the `actor` name (from the `actors` table).
  * You need to first join the `movies` table with the `movies_actors` table.
  * Then, you need to join the `movies_actors` table with the `actors` table.
  * Finally, you need to add a `where` statement that limits the results to just where `actors.actor` is equal to `Natalie Portman`.
* Run the query and assign the full results list to `portman_joins`.
* Use the print function to display `portman_joins`.

In [235]:
query = '''
    SELECT movies.movie, actors.actor
    FROM movies
    INNER JOIN movies_actors ON movies.id
                == movies_actors.movie_id
    INNER JOIN actors ON movies_actors.actor_id == actors.id
    WHERE actors.actor == "Natalie Portman";
'''

portman_joins = conn.execute(query).fetchall()

for movact in portman_joins:
    print(movact)

('Black Swan', 'Natalie Portman')
('Closer', 'Natalie Portman')


# <a name='summary'></a>Summary Note.

### While normalization helps reduce data redundancy and allows us to decouple related columns into separate tables, 
too much normalization can do more harm than good. A highly normalized database means that even some basic queries can involve joining multiple tables together. <br>

You may have wondered why we didn't try to separate out the actors (`nominee` column) and the movie names (`movie` column) in the `nominations` table. We could have replaced these columns with foreign key references to the `actors` and `movies` tables instead. **This is because we probably wouldn't have realized the gains of normalization by replacing the actual values with foreign key references.** <br>

If we think that we'll almost always be accessing the movies and actors names when we're querying the `nominations` table, then forcing the user to do multiple joins to get the relevant information is quite cumbersome. In addition, we know that **once an awards ceremony has finished, the movies and nominees are not going to change**. 

### This means that another benefit of normalization, easy updating and editing of related values, probably won't be realized. <br>

When we represented the `year` and `host` columns in a separate table from the nominations table, **we made the assumption that we don't always need to access both columns every single time when querying the `nominations` table**. 
### We preferred having less data redundancy and writing a join when we needed to. <br>

### Lastly, it's important to remember that the schema isn't set in stone. 
In many cases, it's best to start out with a denormalized representation of your data with one, or a few, giant tables. As your data grows and your use cases change, you can rethink your schema and restructure your data accordingly. **When structuring your data and writing a schema, it's important to remember the tradeoffs that come with normalization**.

In [323]:
conn.close()