# Project: Movies Dataset

## Files to review:
- You can review the file called **```Assessing Data```** where I analyzed the dataset we are going to work with.
- I also did some data cleansing in the file called **```Data Cleansing```**
- In this notebook, we're going to answer some questions using visuals and descriptive statistics

## Questions to Answer:
In this part I will write any question I want to answer, so I don't forget.
- [x] Highest revenue per year, and from which movie
- [x] What's the genre that generated more revenue and which one is the genre that has the best votes.
- [x] Average Revenue per Genre from 1960-2015
- [x] Correlations between Revenue and Budget
- [x] Correlations between Budget and Popularity

## Table of Contents
<ul>
<li><a href="#eda"><b>Exploratory Data Analysis</b></a></li>
<li><a href="#f1"><b>Function for plotting</b></a></li>
<li><a href="#f2"><b>Function creating top 10</b></a></li>
<li><a href="#single"><b>Single Variable Analysis</b></a></li>
<ul>    
<dd><li><a href="#qs1">Movies created per year</a></li></dd>
<li><a href="#qs2">Top 10 production companies</a></li>
<li><a href="#qs3">Top 10 Actors</a></li>
</ul>
<li><a href="#multiple"><b>Multiple Variable Exploration</b></a></li>
<ul>  
<li><a href="#qm1">Highest Revenue Per Year and from which movie</a></li>
<li><a href="#qm2">Top 10 Movies that generated most revenue</a></li>
<li><a href="#qm3">Revenue in Millions from 1960-2015</a></li>
<li><a href="#qm4">Movie that made the most money</a></li> 
<li><a href="#qm5">Movie that made the less money</a></li>
<li><a href="#qm6">What's the genre that has more movies and which one is the genre that has the best votes? </a></li>
</ul>
<li><a href="#conclusions"><b>Conclusions</b></a></li>


<a id='intro'></a>
## Introduction

> I think movies is a topic we all like and feel familiar with and that's why I chose to work with this dataset. In the following sections **I'll be analyzing each of the questions stated above to understand more about this dataset and about the movie industry.**

In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

<a id='eda'></a>
## Exploratory Data Analysis


> With the goal of addressing the research questions I used descriptive statistics to find insights and relationships between variables. 

<a id='f1'></a>
> ### Function for plotting
I'm going to create a function that I'm going to call everytime I want to create a plot.

In [2]:
def plotTitle(title, xlabel, ylabel):
    """
    This is a function that I will call every time I want to create a plot 
    """
    plt.title(title, fontsize = 22)
    plt.xlabel(xlabel, labelpad = 15, fontsize = 18)
    plt.ylabel(ylabel, labelpad = 15, fontsize = 18) 

<a id='f2'></a>
> ### Function creating the top 10 of any column
I'm going to create a function that I'm going to call everytime I want to create a top 10

In [3]:
def top10(column):
    """
    This function will help me to create the top 10 of any column
    """
    top10 = df[column].value_counts()
    top10 = top10.nlargest(10).sort_values(ascending = True)
    return top10.plot(kind = 'barh', figsize =(12,8))

In [4]:
def group_by(data, column1, column2):
    variable = df.groupby(column1)[column2].sum().sort_values(ascending = True)
    variable = variable.nlargest(10).sort_values(ascending = True)
    return variable.plot(kind = 'barh',figsize =(12,10), legend = True)

<a id='q1'></a>
> ### Highest revenue per year, and from which movie
- The movie that made **1907.005842** Million USD Dollar in revenue is **Jaws**

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.
df = pd.read_csv('clean_dataset.csv')
df.head(2)

In [None]:
df.drop_duplicates(subset = 'id' , inplace = True) 

In [None]:
df[df.budget_adj == 0]

In [None]:
sum(df.id.duplicated())

<a id='single'></a>
## Single Variable Analysis

<a id='qs1'></a>
> ### Movies created per year

In [None]:
#just checking what are the descriptive statistics of the release_year column
df.release_year.describe()

In [None]:
#creating th plot of the column
df['release_year'].value_counts().plot(kind = 'barh', figsize =(18,20))
plotTitle('Genres', 'Movies created', 'Years');

In our previous graph, we can appreciate that **the amount of movies created were incrementing throghout the years**

<a id='qs2'></a>
> ### Top 10 Production Companies

In [None]:
#just checking what are the descriptive statistics of the production_companies column
df.production_companies.describe()

In [None]:
#creating th plot of the column
top10('production_companies')
plotTitle('Top 10 Production Companies', 'Number of movies created', 'Companies');

<a id='qs3'></a>
> ### Top 10 Actors

It looks that **Universal Pictures** is the company that has created the most movies in the dataset

In [None]:
top10('cast')
plotTitle('Top 10 Actors', 'Number of movies', 'Actors');

It looks that **Nicolas Cage** has appeared in a lot of movies!

<a id='multiple'></a>
## Multiple Variable Exploration

#### Creation of two new columns

Before I dive deeper into the analysis, I created two new columns and changed their datatypes to integers:
- ```revenue_in_millions```: The reason for this is so that the reader can read better the numbers in the graphs.
- ```budget_in_millions```: The reason for this is so that the reader can read better the numbers in the graphs.

I also eliminated the negative ```revenue_in_USD_2010``` values, because there were so little negative values, that it was biasing my visualizations.

In [None]:
df["revenue_adj"] = df["revenue_adj"].astype(int)

In [None]:
df["budget_adj"] = df["budget_adj"].astype(int)

In [None]:
df = df.rename(columns = {'revenue_adj': 'revenue_in_USD_2010', 'budget_adj': 'budget_in_USD_2010'}, inplace = False)

In [None]:
df['revenue_in_millions'] = df['revenue_in_USD_2010']/1000000

In [None]:
df['budget_in_millions'] = df['budget_in_USD_2010']/1000000

In [None]:
df = df[(df['revenue_in_millions'] >= 0)]

In [None]:
df.head(2)

#### I created a temporary dataframe with the values of:
- ```release_year```
- ```revenue_in_millions```
- ```original_title```

And then I just sorted the values from ```release_year``` and ```revenue_in_millions```. Once that I did this, I created a **line graph** to evaluate the revenue of the **best movies** from **1960-2015**.

In [None]:
temp_df = df[['release_year', 'revenue_in_millions', 'original_title']].sort_values(['release_year', 'revenue_in_millions'], ascending=False)

<a id='qm1'></a>
> ### Highest Revenue Per Year and from which movie

In [None]:
pd.DataFrame(temp_df.groupby(['release_year']).agg({'revenue_in_millions':[max], 'original_title':['first']}))

<a id='qm2'></a>
> ### Top 10 Movies that generated most revenue

In [None]:
revenue_per_movie = df.groupby('original_title')["revenue_in_millions"].sum().sort_values(ascending = True)
revenue_per_movie = revenue_per_movie.nlargest(10).sort_values(ascending = True)
revenue_per_movie.plot(kind = 'barh',figsize =(12,10), legend = True)
plotTitle("Top 10 Movies that generated most revenue","Revenue in Millions USD", "Production Company")

<a id='qm3'></a>
> ### Revenue in Millions from 1960-2015

In [None]:
sns.set(font_scale=1.4)
temp_df.set_index('release_year')['revenue_in_millions'].plot(figsize=(20, 10), linewidth=2.5, color='blue', xlim = (1960, 2016))
#using the plotTitle Function that I created
plotTitle("Best movies revenues over years","Years", "Revenue in Millions USD (2010)")

<a id='qm4'></a>
> ### Movie that made the most money
I also wanted to review which movie has made most of the money in my dataset

In [None]:
df.loc[df['revenue_in_millions'].idxmax()]

<a id='qm5'></a>
> ### Movie that made the less money

In [None]:
df.loc[df['revenue_in_millions'].idxmin()]

<a id='qm6'></a>
> ### What's the genre that has more movies and which one is the genre that has the best votes?
- To do this I created a groupby of a new column named ```value``` which has all the genres from each movie and I grouped it with the variable ```revenue_in_millions```

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.
genres_revenue = df.groupby('genres')["revenue_in_millions"].sum().sort_values(ascending = True)

In [None]:
#Plotting the revenue for each genre
ax = genres_revenue.plot(kind = 'barh',figsize =(10,8), legend = True)
#using the plotTitle Function that I created
plotTitle("Revenue Per Genre","Genres", "Revenue in Millions USD (2010)")

### Knowing which are the most popular genres
- I'm going to review which genres do better over the years, stayinng with just the top 5 genres.

In [None]:
genres_pop = df.groupby('genres')["popularity"].sum().sort_values(ascending = False)
genres_pop

To **better understand** the data, I drop the genres that I don't want to track, I just want to see the top 8 genres.

> ### Function for dropping columsn

In [None]:
test_df = pd.read_csv('clean_dataset - Copy.csv')
def dropping():
    """
    This function will help me eliminate the genres that I don't want to analyze.
    """
    genres = ['Horror', 'Animation', 'Mystery', 'Music', 'War', 'Western', 'Documentary', 'TV Movie', 'Foreign', 'Fantasy', 'Family', 'Crime']
    for genre in genres:
        genre_over_time = test_df.drop(test_df.index[test_df['genres'] == genre], inplace = True)

# calling the function dropping
dropping()

In [None]:
genre_over_time = test_df.groupby(['release_year','genres'])['popularity'].mean()
genre_over_time

In [None]:
genre_over_time = genre_over_time.unstack()
genre_over_time.head(2)

In [None]:
genre_over_time.plot(figsize= (15,8), linewidth=3, xlim = (1960,2015), ylim = (0,8))
plotTitle("Popularity of genres change over years","Years (1960-2015)", "Average popularity")

<a id='q3'></a>
### Question 3: Revenue per Company

> Revenue per Genre, in order to do this I created a groupby function between ```value``` and ```revenue_in_millions```

### Top 10 Companies with most revenue

In [None]:
revenue_per_comapnie = df.groupby('production_companies')["revenue_in_millions"].sum().sort_values(ascending = True)
revenue_per_comapnie = revenue_per_comapnie.nlargest(10).sort_values(ascending = True)
revenue_per_comapnie.plot(kind = 'barh',figsize =(12,10), legend = True)
plotTitle("Top 10 Companies with most revenue","Revenue in Millions USD", "Production Company")

### Top 10 Companies with less revenue

In [None]:
revenue_per_comapnie = df.groupby('production_companies')["revenue_in_millions"].sum().sort_values(ascending = True)
revenue_per_comapnie = revenue_per_comapnie.nsmallest(10).sort_values(ascending = True)
revenue_per_comapnie.plot(kind = 'barh',figsize =(12,10), legend = True)
plotTitle("Top 10 Companies with less revenue","Revenue in Millions USD", "Production Company")

<a id='q3'></a>
### Question 3: Revenues per actor

> Revenue per Genre, in order to do this I created a groupby function between ```value``` and ```revenue_in_millions```

### Top 10 Best Paid Actors

In [None]:
revenue_per_actor = df.groupby('cast')["revenue_in_millions"].sum().sort_values(ascending = True)
revenue_per_actor = revenue_per_actor.nlargest(10).sort_values(ascending = True)
revenue_per_actor.plot(kind = 'barh',figsize =(12,10), legend = True)
plotTitle("Top 10 Best Paid Actors","Revenue in Millions USD", "Actor")

### Top 10 Worst Paid Actors

In [None]:
revenue_per_actor = df.groupby('cast')["revenue_in_millions"].sum().sort_values(ascending = True)
revenue_per_actor = revenue_per_actor.nsmallest(10).sort_values(ascending = True)
revenue_per_actor.plot(kind = 'barh',figsize =(12,10), legend = True)
plotTitle("Top 10 Worst Paid Actors","Revenue in Millions USD", "Actor")

<a id='q3'></a>
### Question 3: Average Revenue per Genre from 1960-2018

> Revenue per Genre, in order to do this I created a groupby function between ```value``` and ```revenue_in_millions```

In [None]:
genres_avg_revenue = df.groupby('genres')["revenue_in_millions"].mean().sort_values(ascending = True)

In [None]:
ax = genres_avg_revenue.plot(kind = 'barh',figsize =(10,10), legend = True)
plotTitle("Revenue Per Genre","Revenue in Millions USD", "Genres")

<a id='q4'></a>
### Question 4: Correlations between revenue and budget

- To create this correlation, I first filter where all the values from ```revenue_in_millions``` are positive, because the negative values bias the visualization. 

In [None]:
correlation = df[(df['revenue_in_millions'] >= 0)]
correlation.revenue_in_millions

In [None]:
df.plot(x='budget_in_millions', y='revenue_in_millions', kind = 'scatter', figsize = (12,10), ylim = (0,2000), xlim = (0,300))
plotTitle("Correlation between a movie's Budget and its revenue","Budget (in Millions) to create a Movie", "Revenue in Millions")

### Correlation between a Movie's Budget and its Popularity

- We can see that there is a positive relationship between this 2 variables. 

In [None]:
df.plot(x='budget_in_millions', y='popularity', kind = 'scatter', figsize = (12,10), ylim = (0,10), xlim = (0,300))
plotTitle("Correlation between a movie's Budget and its popularity" ,"Budget (in Millions) to create a Movie", "Popularity")

<a id='conclusions'></a>
## Conclusions

> - As I analyzed this dataset, I found out that the better the budget, **the better** the chances for the movie to be more popular. I think is also important to get to know how correlations works.
- I also found out that **Action**, **Adventure**, **Drama**, **Commedy** and **Thriller** are the genres that have generated the most revenue from 1965 to 2015. 
- Overall I had a lot of fun analyzing this dataset, and if the reader might have any suggestions on this work, please let me know.