# 2. Merging Tables with Different Join Types

## Left Join

### Counting missing rows with left join
The Movie Database is supported by volunteers going out into the world, collecting data, and entering it into the database. This includes financial data, such as movie budget and revenue. If you wanted to know which movies are still missing data, you could use a left join to identify them. Practice using a left join by merging the ```movies``` table and the ```financials``` table.

In [1]:
import pandas as pd

movies = pd.read_pickle("movies.p")
financials = pd.read_pickle("financials.p")

print(movies.info())
print(financials.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            4803 non-null   int64  
 1   title         4803 non-null   object 
 2   popularity    4803 non-null   float64
 3   release_date  4802 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 150.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3229 entries, 0 to 3228
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       3229 non-null   int64  
 1   budget   3229 non-null   int64  
 2   revenue  3229 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 75.8 KB
None


In [2]:
# Merge movies and financials with a left join
movies_financials = movies.merge(financials, on=("id"), how="left")

In [3]:
# Merge the movies table with the financials table with a left join
movies_financials = movies.merge(financials, on='id', how='left')

# Count the number of rows in the budget column that are missing
number_of_missing_fin = movies_financials['budget'].isnull().sum()

# Print the number of movies missing financials
print(number_of_missing_fin)

1574


### Enriching a dataset

Setting ```how='left'``` with the ```.merge()```method is a useful technique for enriching or enhancing a dataset with additional information from a different table. In this exercise, you will start off with a sample of movie data from the movie series Toy Story. Your goal is to enrich this data by adding the marketing tag line for each movie. You will compare the results of a left join versus an inner join.

In [4]:
dict_toystory = {
    "id": [10193, 863, 862],
    "title": ["Toy Story 3", "Toy Story 2", "Toy Story"],
    "popularity": [59995, 73575, 73640],
    "release_date": ["2010-06-16", "1999-10-30", "1995-10-30"]
}

toy_story = pd.DataFrame(dict_toystory)
print(toy_story)

      id        title  popularity release_date
0  10193  Toy Story 3       59995   2010-06-16
1    863  Toy Story 2       73575   1999-10-30
2    862    Toy Story       73640   1995-10-30


In [5]:
taglines = pd.read_pickle("taglines.p")
print(taglines.info())

<class 'pandas.core.frame.DataFrame'>
Index: 3955 entries, 0 to 4801
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       3955 non-null   int64 
 1   tagline  3955 non-null   object
dtypes: int64(1), object(1)
memory usage: 92.7+ KB
None


In [6]:
# Merge the toy_story and taglines tables with a left join
toystory_tag = toy_story.merge(taglines, on=("id"), how="left")

# Print the rows and shape of toystory_tag
print(toystory_tag)
print(toystory_tag.shape)

      id        title  popularity release_date                   tagline
0  10193  Toy Story 3       59995   2010-06-16  No toy gets left behind.
1    863  Toy Story 2       73575   1999-10-30        The toys are back!
2    862    Toy Story       73640   1995-10-30                       NaN
(3, 5)


In [7]:
# Merge the toy_story and taglines tables with a inner join
toystory_tag = toy_story.merge(taglines, on=("id"))

# Print the rows and shape of toystory_tag
print(toystory_tag)
print(toystory_tag.shape)

      id        title  popularity release_date                   tagline
0  10193  Toy Story 3       59995   2010-06-16  No toy gets left behind.
1    863  Toy Story 2       73575   1999-10-30        The toys are back!
(2, 5)


In [8]:
left_table.merge(one_to_one, on='id', how='left').shape
left_table.merge(one_to_many, on='id', how='left').shape

NameError: name 'left_table' is not defined

### The output of a one-to-many merge with a left join will have greater than or equal rows than the left table.

### Right Join

Most of the recent big-budget science fiction movies can also be classified as action movies. 
You are given a table of science fiction movies called ```scifi_movies``` and another table of action movies called ```action_movies```. Your goal is to find which movies are considered only science fiction movies. 
Once you have this table, you can merge the movies table in to see the movie names. Since this exercise is related to science fiction movies, use a right join as your superhero power to solve this problem.

In [None]:
# Merge action_movies to the scifi_movies with right join
action_scifi = action_movies.merge(scifi_movies, on='movie_id', how='right',
                                   suffixes=('_act','_sci'))

# From action_scifi, select only the rows where the genre_act column is null
scifi_only = action_scifi[action_scifi['genre_act'].isnull()]

# Merge the movies and scifi_only tables with an inner join
movies_and_scifi_only = movies.merge(scifi_only, left_on="id", right_on="movie_id")

# Print the first few rows and shape of movies_and_scifi_only
print(movies_and_scifi_only.head())
print(movies_and_scifi_only.shape)

Popular genres with right join

What are the genres of the most popular movies? 
To answer this question, you need to merge data from the ```movies``` and ```movie_to_genres``` tables. 

In a table called ```pop_movies```, the top 10 most popular movies in the movies table have been selected. 
To ensure that you are analyzing all of the popular movies, merge it with the ```movie_to_genres``` table using a right join. 

To complete your analysis, count the number of different genres. 
Also, the two tables can be merged by the movie ID. 
However, in ```pop_movies``` that column is called ```id```, and in ```movie_to_genres``` it's called ```movie_id```.

In [None]:
# Use right join to merge the movie_to_genres and pop_movies tables
genres_movies = movie_to_genres.merge(pop_movies, how='right', 
                                      left_on="movie_id", 
                                      right_on="id")

# Count the number of genres
genre_count = genres_movies.groupby('genre').agg({'id':'count'})

# Plot a bar chart of the genre_count
genre_count.plot(kind='bar')
plt.show()

## Outer Join

One cool aspect of using an outer join is that, because it returns all rows from both merged tables and null where they do not match, you can use it to find rows that do not have a match in the other table. 

To try for yourself, you have been given two tables with a list of actors from two popular movies: Iron Man 1 and Iron Man 2. Most of the actors played in both movies. Use an outer join to find actors who did not act in both movies.

The Iron Man 1 table is called ```iron_1_actors```, and Iron Man 2 table is called ```iron_2_actors```. 

In [None]:
# Merge iron_1_actors to iron_2_actors on id with outer join using suffixes
iron_1_and_2 = iron_1_actors.merge(iron_2_actors,
                                     how="outer",
                                     on="id",
                                     suffixes= ("_1", "_2"))

# Create an index that returns true if name_1 or name_2 are null
m = ((iron_1_and_2['name_1'].isnull()) | 
     (iron_1_and_2['name_2'].isnull()))

# Print the first few rows of iron_1_and_2
print(iron_1_and_2[m].head())

## Self Join

- hierarchical relationships
- sequential relationships
- graph data

Merging a table to itself can be useful when you want to compare values in a column to other values in the same column. In this exercise, you will practice this by creating a table that for each movie will list the movie director and a member of the crew on one row. You have been given a table called crews, which has columns id, job, and name. First, merge the table to itself using the movie ID. This merge will give you a larger table where for each movie, every job is matched against each other. Then select only those rows with a director in the left table, and avoid having a row where the director's job is listed in both the left and right tables. This filtering will remove job combinations that aren't with the director.

The crews table has been loaded for you.

In [9]:
crews = pd.read_pickle("crews.p")
print(crews.info())

<class 'pandas.core.frame.DataFrame'>
Index: 42502 entries, 0 to 129580
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          42502 non-null  int64 
 1   department  42502 non-null  object
 2   job         42502 non-null  object
 3   name        42502 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.6+ MB
None


In [10]:
# To a variable called crews_self_merged, merge the crews table to itself on the id column using an inner join, setting the suffixes to '_dir' and '_crew' for the left and right tables respectively.

# Merge the crews table to itself
crews_self_merged = crews.merge(crews, on="id", suffixes=("_dir", "_crew"))

In [11]:
# Create a Boolean index, named boolean_filter, that selects rows from the left table with the job of 'Director' and avoids rows with the job of 'Director' in the right table.

# Merge the crews table to itself
crews_self_merged = crews.merge(crews, on='id', how='inner',
                                suffixes=('_dir','_crew'))

# Create a Boolean index to select the appropriate
boolean_filter = ((crews_self_merged['job_dir'] == 'Director') & 
     (crews_self_merged['job_crew'] != 'Director'))
direct_crews = crews_self_merged[boolean_filter]

In [12]:
# Merge the crews table to itself
crews_self_merged = crews.merge(crews, on='id', how='inner',
                                suffixes=('_dir','_crew'))

# Create a boolean index to select the appropriate rows
boolean_filter = ((crews_self_merged['job_dir'] == 'Director') & 
                  (crews_self_merged['job_crew'] != 'Director'))
direct_crews = crews_self_merged[boolean_filter]

# Print the first few rows of direct_crews
print(direct_crews.head())

        id department_dir   job_dir       name_dir department_crew  \
156  19995      Directing  Director  James Cameron         Editing   
157  19995      Directing  Director  James Cameron           Sound   
158  19995      Directing  Director  James Cameron      Production   
160  19995      Directing  Director  James Cameron         Writing   
161  19995      Directing  Director  James Cameron             Art   

           job_crew          name_crew  
156          Editor  Stephen E. Rivkin  
157  Sound Designer  Christopher Boyes  
158         Casting          Mali Finn  
160          Writer      James Cameron  
161    Set Designer    Richard F. Mays  


## Merging on indexes

To practice merging on indexes, you will merge movies and a table called ratings that holds info about movie ratings. Ensure that your merge returns all rows from the movies table, and only matching rows from the ratings table

In [13]:
ratings = pd.read_pickle("ratings.p")

In [14]:
# Merge to the movies table the ratings table on the index
movies_ratings = movies.merge(ratings, how="left", on="id")

# Print the first few rows of movies_ratings
print(movies_ratings.head())

      id                 title  popularity release_date  vote_average  \
0    257          Oliver Twist   20.415572   2005-09-23           6.7   
1  14290  Better Luck Tomorrow    3.877036   2002-01-12           6.5   
2  38365             Grown Ups   38.864027   2010-06-24           6.0   
3   9672              Infamous    3.680896   2006-11-16           6.4   
4  12819       Alpha and Omega   12.300789   2010-09-17           5.3   

   vote_count  
0       274.0  
1        27.0  
2      1705.0  
3        60.0  
4       124.0  


It is time to put together many of the aspects that you have learned in this chapter. In this exercise, you'll find out which movie sequels earned the most compared to the original movie. To answer this question, you will merge a modified version of the sequels and financials tables where their index is the movie ID. You will need to choose a merge type that will return all of the rows from the sequels table and not all the rows of financials table need to be included in the result. From there, you will join the resulting table to itself so that you can compare the revenue values of the original movie to the sequel. Next, you will calculate the difference between the two revenues and sort the resulting dataset.

In [15]:
sequels = pd.read_pickle("sequels.p")

In [17]:
# With the sequels table on the left, merge to it the financials table on index named id, ensuring that all the rows from the sequels are returned and some rows from the other table may not be returned, Save the results to sequels_fin.

# Merge sequels and financials on index id
sequels_fin = sequels.merge(financials, on='id', how='left')

# Self merge with suffixes as inner join with left on sequel and right on id
orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel', 
                             right_on='id', right_index=True,
                             suffixes=('_org','_seq'))

# Add calculation to subtract revenue_org from revenue_seq 
orig_seq['diff'] = orig_seq['revenue_seq'] - orig_seq['revenue_org']

# Select the title_org, title_seq, and diff 
titles_diff = orig_seq[['title_org','title_seq','diff']]

# Print the first rows of the sorted titles_diff
print(titles_diff.sort_values("diff", ascending=False).head())

                                 title_org                 title_seq  \
2929                        Before Sunrise  The Amazing Spider-Man 2   
1256   Star Trek III: The Search for Spock                The Matrix   
293   Indiana Jones and the Temple of Doom              Man of Steel   
1084                                   Saw          Superman Returns   
1334                        The Terminator          Star Trek Beyond   

             diff  
2929  700182027.0  
1256  376517383.0  
293   329845518.0  
1084  287169523.0  
1334  265100616.0  
