# Project: TMDB 5000 Movie Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#packages">Import Packages</a></li>    
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> This data set contains information about 5000 movies collected from The Movie Database (TMDb),including user ratings and revenue.

### What we can do ?
> We can exploare this dataset and asking some important questions that can be answered given this information 
 like..
> <ul> 
    <li>Which genres are most popular from year to year ?</li>
    <li>Which movie have high popularity over this years ?</li>
    <li>What kinds of properties are associated with movies that have high revenues ?</li>
    <li>Is there a relation between budget and revenue ?</li>
  </ul>

<a id='packages'></a>
## Import Packages

> In this section we will import necessary packages and vistualization packages to make the process enjoyable and give us valuable insights

In [177]:
import pandas as pd
import matplotlib.pyplot as plt
import json

<a id='wrangling'></a>
## Data Wrangling

> In this section of the report, We will load the data, check for cleanliness, and then trim and clean our dataset for analysis.

In [178]:
# Load our data and print out a few lines.
df = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')
df.head()

In [179]:
# Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df.isnull().sum()

There is a lot if missing data in the coloumn "**homepage**" and "**tagline**" but dont worry we can drop this columns because we don't need it in analysis.
<br>also we can drop any columns or rows don't give us valuable information and fill the missing data that can use in analysis.

In [180]:
df.drop(['homepage','tagline','id'], axis = 1,inplace = True) 

In [181]:
df.shape

There is two missing data in runtime columns we can fill it by the avarage runtime

In [182]:
df['runtime'].fillna(df['runtime'].mean(),inplace=True)

Then drop the rest of rows with missing data

In [183]:
df.dropna(inplace = True)

Now we can make sure we have not any missing data

In [184]:
df.isnull().sum()

In [185]:
df.shape

Also we have not any duplicates 

In [186]:
df.duplicated().sum()

We can take a look at datarypes and edit it if needed 

In [187]:
df.info()

We can add a new column "year" to make it easier while working with date

In [188]:
df['release_date'] = pd.to_datetime(df['release_date'])
df['year'] = df['release_date'].dt.year
df['year']

Now we have a clean dataset we can work on to drive some questons and answer it freely <br> let's explore our data 

<a id='eda'></a>
## Exploratory Data Analysis

> Now let's try to exploare columns and variables and show some statistics
 


In [189]:
# exploare columns
df.head(1)

In [190]:
# showing some statistcs
df.describe()

from this slight statistics <br>
* I think there is a postive correlation between budget spend and revenue
* Also this is amazing to have at least one movie with high vote_average
* Its a little weird to have a movie with 0 run time

### How we can find Movies are most popular from all years or specific year or period time ?


In [191]:
def get_insights_popularity(year_1 = 0 , year_2 = 0) :
    # this function will give us the most popular genres 
    # Parameters :
    #    year_1 : int ( year between 1960 to 2016)
    #           if you not give any number the function will give you insights of all years
    #           if you give a specific year the function will give you insights of this years
    #    year_2 : int (year between year_1 to 2016)
    #            if you give a specific year the function will give you insights between year_1 and year_2
    sorted_popularity = df.sort_values(by = 'popularity',ascending= False)
    if year_2 == 0 :
        year_2 = year_1
    if year_1 != 0 :
        sorted_popularity.query('year >= @year_1 and year <= @year_2',inplace = True)
    plt.subplots(figsize=(10,8))
    plt.gca().invert_yaxis()
    plt.barh(sorted_popularity['title'].head(10),sorted_popularity['popularity'].head(10))

## Which movie have high popularity over all years ?

if you not give the function any parameter the function will give you insights of all years

In [192]:
get_insights_popularity()

**Minions** movie have the most popularity all years

## Which movie have the most popularity in specific year or in period time ?

Here we can find which movie have the most popularity in specific year or for example between 2000 to 2015

In [193]:
get_insights_popularity(2013)

the most popular movie in 2013 is **Frozen**

In [194]:
get_insights_popularity(2013,2015)

the most popular movie from 2013 to 2015 is **Minions**

### How we can find Genres are most popular from all years or specific year or period time ?

In [195]:
def get_insights_genres(year_1 = 0 , year_2 = 0) :
    # this function will give us the most popular genres 
    # Parameters :
    #    year_1 : int ( year between 1960 to 2016)
    #           if you not give any number the function will give you insights of all years
    #           if you give a specific year the function will give you insights of this years
    #    year_2 : int (year between year_1 to 2016)
    #            if you give a specific year the function will give you insights between year_1 and year_2
    
    sorted_popularity = df.sort_values(by = 'popularity',ascending= False)
    if year_2 == 0 :
        year_2 = year_1
    if year_1 != 0 :
        sorted_popularity.query('year >= @year_1 and year <= @year_2',inplace = True)
    z = list()
    print(sorted_popularity.shape)
    counter = 0
    for i in sorted_popularity['genres'] :
        i = json.loads(i)
        for dic in i :
            dict(dic)
        z.append(dic.get('name'))
        counter +=1
        if counter >=sorted_popularity['genres'].shape[0]/2 :
            break
    unique_genres = set(z)
    counts = dict()
    for i in z:
      counts[i] = counts.get(i, 0) + 1 
    plt.subplots(figsize=(10,8))
    plt.gca().invert_yaxis()
    plt.barh(list(counts.keys()),list(counts.values()))

## Which Genres have high popularity over all years ?

In [196]:
get_insights_genres()

## Which Genres have high popularity in specific year or period time ?

In [197]:
get_insights_genres(2008)

In [198]:
get_insights_genres(2010,2012)

### What kinds of properties are associated with movies that have high revenues ?<br> or Is there a relation between budget and revenue ?

In [199]:
def companies_make_moveis_with_high_revenues() :
    sorted_revenues = df.sort_values(by = 'revenue',ascending= False)
    low_revenues = df[df['revenue'] < df['revenue'].median()]
    high_revenues = df[df['revenue'] >= df['revenue'].median()]
    z = list()
    print(high_revenues.shape)
    counter = 0
    for i in high_revenues['production_companies'] :
        i = json.loads(i)
        for dic in i :
            dict(dic)
        z.append(dic.get('name'))
    unique_genres = set(z)
    counts = dict()
    for i in z:
      counts[i] = counts.get(i, 0) + 1 
    s = pd.DataFrame(list(counts.values()),list(counts.keys()),columns=['counts'])
    s= s.sort_values(by='counts',ascending= False)
    print(s)

In [200]:
companies_make_moveis_with_high_revenues()

In [201]:
high_revenues['budget'].mean() , low_revenues['budget'].mean()

Now for sure the more budget the more revenues <br> Also Warner Bros company make a lot of revenues i think this is the best company

### Is there a relation between budget and revenue ?

In [202]:
plt.scatter(df['budget'],df['revenue'])

there is a positive correlation between budget and revenue

<a id='conclusions'></a>
## Conclusions

> we analyzed this amazing dataset and find the best movies in each year and minions movie is the best movie in all years
> Also the popular genres is thriller , drama , romance 
> there is a poisitive correlation between budget and revenues
## Limitations

> we dropped some rows maybe will affect on insights bud dont worry this is only about 5 columns