> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: TMDB Movie Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

The TMDB Movie Data dataset lists about 11k movies produced by Production Companies and several important details about those movies such as:
<ul>
    <li>A Unique Id</li>
    <li>Movie Name</li>
    <li>Cast</li>
    <li>Director(s)</li>
    <li>Runtime</li>
    <li>Production Companies</li>
    <li>Release Year</li>
    <li>Vote Average</li>
    <li>Budget (adjusted to 2010 rates)</li>
    <li>Revenues (adjusted to 2010 rates)</li>
    <li>Popularity Score</li>
</ul>

Some of the typical questions that I have regarding this dataset are:
<ol>
    <li>Which Production Companies had the highest budget?</li>
    <li>Which Production Companies had the highest revenues?</li>
    <li>Did big budget movies produce higher revenues? Are there exceptions?</li>
    <li>Which movies had revenues lower than production budgets?</li>
    <li>Who were the popular directors?</li>
    <li>Who were the popular actors?</li>
    <li>Which genres were popular?</li>
    <li>Which actors had most movies?</li>
    <li>Which directors had most movies?</li>
    <li>Which genres had most movies?</li>
    <li>Is there a correlation between movie runtimes and popularity?</li>
    <li>Are more movies produced each year?</li>
    <li>Who are directors' favorite actors?</li>
    <li>Who are actors' favorite directors></li>
</ol>    

In [3]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

Steps in Data Wrangling:
<ul>
    <li>Load data from "tmdb-movies.csv" file into a dataframe called "df_movie_list"</li>
    <li>Drop the following columns from the dataframe:</li>
    <ul>
        <li>budget - instead, I am planning to use the budget_adj column</li>
        <li>revenue - instead, I am planning to use the revenue_adj column</li>
        <li>Not relevant to my analysis:</li>
        <ul>
            <li>homepage</li>
            <li>tagline</li>
            <li>keywords</li>
            <li>overview</li>
        </ul>
    </ul>
    <li>Create the following new dataframes for the following data by separating the pipe-separated values into separate rows</li>
    <ul>
        <li>df_cast</li>
        <li>df_director</li>
        <li>df_genre</li>
        <li>df_prod_co</li>
    </ul>
</ul>
    

### General Properties

In [6]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

# Load movie list into a dataframe
df_movie_list = pd.read_csv("tmdb-movies.csv")

# Drop columns not part of the analysis
df_movie_list.drop(['budget', 'revenue', 'homepage', 'tagline', 'keywords', 'overview'], axis = 1, inplace = True)

df_movie_list.head()

Unnamed: 0,id,imdb_id,popularity,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [102]:
# Create new dataframes by separating the pipe-separated values into separate rows
# Keep these dataframes separate so that we may merge them with the main dataframe for analysis

# First, copy only id and cast columns
df_cast_tmp = df_movie_list[['id', 'cast']].copy()

# Next, drop null values
df_cast_tmp.dropna(inplace = True)

# Make five copies of the dataframe to split the cast column into five rows
# Need to figure out  programmatic way of doing this for any number of values separated by the separator
df_c1 = df_cast_tmp.copy()
df_c2 = df_cast_tmp.copy()
df_c3 = df_cast_tmp.copy()
df_c4 = df_cast_tmp.copy()
df_c5 = df_cast_tmp.copy()

# Split data in "cast" into five rows separated by "|"
df_c1['cast'] = df_cast_tmp['cast'].str.split("|", n = -1, expand = True)[0]
df_c2['cast'] = df_cast_tmp['cast'].str.split("|", n = -1, expand = True)[1]
df_c3['cast'] = df_cast_tmp['cast'].str.split("|", n = -1, expand = True)[2]
df_c4['cast'] = df_cast_tmp['cast'].str.split("|", n = -1, expand = True)[3]
df_c5['cast'] = df_cast_tmp['cast'].str.split("|", n = -1, expand = True)[4]

# Merge the rows into a single dataframe
df_cast = df_c1.append(df_c2.append(df_c3.append(df_c4.append(df_c5))))

# Since not all rows in the original dataframe may have five cast members, drop the rows with null values
df_cast.dropna(inplace = True)

# Add a new column to the dataframe titled "row_type"
# We will use this later as a discriminator column when we combine all these dataframes into one again
row_type = np.repeat('Cast', df_cast['id'].count())
df_cast['row_type'] = row_type

# Rename the "cast" column to a generic name
df_cast = df_cast.rename(columns = {'cast': 'row_value'})

# Verification
# df_dir.query('id == 135397')
# df_cast['cast'].isnull().sum()
# df_cast_tmp['cast'].str.split("|", n = -1, expand = True)
df_cast.head()

Unnamed: 0,id,row_value,row_type
0,135397,Chris Pratt,Cast
1,76341,Tom Hardy,Cast
2,262500,Shailene Woodley,Cast
3,140607,Harrison Ford,Cast
4,168259,Vin Diesel,Cast


In [103]:
# Repeat these steps for each of the columns with multiple values separated by a separator
# First, copy only id and director columns
df_dir_tmp = df_movie_list[['id', 'director']].copy()

# Next, drop null values
df_dir_tmp.dropna(inplace = True)

# Make five copies of the dataframe to split the cast column into five rows
# Need to figure out  programmatic way of doing this for any number of values separated by the separator
df_d1 = df_dir_tmp.copy()
df_d2 = df_dir_tmp.copy()
df_d3 = df_dir_tmp.copy()
df_d4 = df_dir_tmp.copy()
df_d5 = df_dir_tmp.copy()

# Split data in "director" into five rows separated by "|"
df_d1['director'] = df_dir_tmp['director'].str.split("|", n = -1, expand = True)[0]
df_d2['director'] = df_dir_tmp['director'].str.split("|", n = -1, expand = True)[1]
df_d3['director'] = df_dir_tmp['director'].str.split("|", n = -1, expand = True)[2]
df_d4['director'] = df_dir_tmp['director'].str.split("|", n = -1, expand = True)[3]
df_d5['director'] = df_dir_tmp['director'].str.split("|", n = -1, expand = True)[4]

# Merge the rows into a single dataframe
df_dir = df_d1.append(df_d2.append(df_d3.append(df_d4.append(df_d5))))

# Since not all rows in the original dataframe may have five directors, drop the rows with null values
df_dir.dropna(inplace = True)

# Add a new column to the dataframe titled "row_type"
# We will use this later as a discriminator column when we combine all these dataframes into one again
row_type = np.repeat('Director', df_dir['id'].count())
df_dir['row_type'] = row_type

# Rename the "cast" column to a generic name
df_dir = df_dir.rename(columns = {'director': 'row_value'})

# Verification
# df_dir.query('id == 135397')
# df_dir['director'].isnull().sum()
# df_dir_tmp['director'].str.split("|", n = -1, expand = True)
df_dir.head()

Unnamed: 0,id,row_value,row_type
0,135397,Colin Trevorrow,Director
1,76341,George Miller,Director
2,262500,Robert Schwentke,Director
3,140607,J.J. Abrams,Director
4,168259,James Wan,Director


In [100]:
# Repeat these steps for each of the columns with multiple values separated by a separator
# First, copy only id and genres columns
df_genres_tmp = df_movie_list[['id', 'genres']].copy()

# Next, drop null values
df_genres_tmp.dropna(inplace = True)

# Make five copies of the dataframe to split the cast column into five rows
# Need to figure out  programmatic way of doing this for any number of values separated by the separator
df_g1 = df_genres_tmp.copy()
df_g2 = df_genres_tmp.copy()
df_g3 = df_genres_tmp.copy()
df_g4 = df_genres_tmp.copy()
df_g5 = df_genres_tmp.copy()

# Split data in "genres" into five rows separated by "|"
df_g1['genres'] = df_genres_tmp['genres'].str.split("|", n = -1, expand = True)[0]
df_g2['genres'] = df_genres_tmp['genres'].str.split("|", n = -1, expand = True)[1]
df_g3['genres'] = df_genres_tmp['genres'].str.split("|", n = -1, expand = True)[2]
df_g4['genres'] = df_genres_tmp['genres'].str.split("|", n = -1, expand = True)[3]
df_g5['genres'] = df_genres_tmp['genres'].str.split("|", n = -1, expand = True)[4]

# Merge the rows into a single dataframe
df_genre = df_g1.append(df_g2.append(df_g3.append(df_g4.append(df_g5))))

# Since not all rows in the original dataframe may have five genres, drop the rows with null values
df_genre.dropna(inplace = True)

# Add a new column to the dataframe titled "row_type"
# We will use this later as a discriminator column when we combine all these dataframes into one again
row_type = np.repeat('Genre', df_genre['id'].count())
df_genre['row_type'] = row_type

# Verification
# df_genre.query('id == 135397')
# df_genre['genres'].isnull().sum()
# df_genres_tmp['genres'].str.split("|", n = -1, expand = True)
df_genre.head()

Unnamed: 0,id,genres,row_type
0,135397,Action,Genre
1,76341,Action,Genre
2,262500,Adventure,Genre
3,140607,Action,Genre
4,168259,Action,Genre


In [101]:
# Repeat these steps for each of the columns with multiple values separated by a separator
# First, copy only id and production_companies columns
df_prod_co_tmp = df_movie_list[['id', 'production_companies']].copy()

# Next, drop null values
df_prod_co_tmp.dropna(inplace = True)

# Make five copies of the dataframe to split the cast column into five rows
# Need to figure out  programmatic way of doing this for any number of values separated by the separator
df_p1 = df_prod_co_tmp.copy()
df_p2 = df_prod_co_tmp.copy()
df_p3 = df_prod_co_tmp.copy()
df_p4 = df_prod_co_tmp.copy()
df_p5 = df_prod_co_tmp.copy()

# Split data in "production_companies" into five rows separated by "|"
df_p1['production_companies'] = df_prod_co_tmp['production_companies'].str.split("|", n = -1, expand = True)[0]
df_p2['production_companies'] = df_prod_co_tmp['production_companies'].str.split("|", n = -1, expand = True)[1]
df_p3['production_companies'] = df_prod_co_tmp['production_companies'].str.split("|", n = -1, expand = True)[2]
df_p4['production_companies'] = df_prod_co_tmp['production_companies'].str.split("|", n = -1, expand = True)[3]
df_p5['production_companies'] = df_prod_co_tmp['production_companies'].str.split("|", n = -1, expand = True)[4]

# Merge the rows into a single dataframe
df_prod_co = df_p1.append(df_p2.append(df_p3.append(df_p4.append(df_p5))))

# Since not all rows in the original dataframe may have five production companies, drop the rows with null values
df_prod_co.dropna(inplace = True)

# Add a new column to the dataframe titled "row_type"
# We will use this later as a discriminator column when we combine all these dataframes into one again
row_type = np.repeat('Production Company', df_prod_co['id'].count())
df_prod_co['row_type'] = row_type

# Verification
# df_prod_co.query('id == 135397')
# df_prod_co['production_companies'].isnull().sum()
# df_prod_co_tmp['production_companies'].str.split("|", n = -1, expand = True)
df_prod_co.head()

Unnamed: 0,id,production_companies,row_type
0,135397,Universal Studios,Production Company
1,76341,Village Roadshow Pictures,Production Company
2,262500,Summit Entertainment,Production Company
3,140607,Lucasfilm,Production Company
4,168259,Universal Pictures,Production Company


In [97]:
# Create a single dataframe to combine all the above 
df_movie_transposed = df_cast.append(df_dir.append(df_genre.append(df_prod_co)))

df_movie_transposed.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,cast,director,genres,id,production_companies,row_type
0,Chris Pratt,,,135397,,Cast
1,Tom Hardy,,,76341,,Cast
2,Shailene Woodley,,,262500,,Cast
3,Harrison Ford,,,140607,,Cast
4,Vin Diesel,,,168259,,Cast


> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!