# IMDB 5000 movie metadata

We will start by loading the dataset into a pandas dataframe, and inspect the attributes of the first entry.

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("./data/movie_metadata.csv")
print(df.iloc[1])


color                                                                    Color
director_name                                                   Gore Verbinski
num_critic_for_reviews                                                     302
duration                                                                   169
director_facebook_likes                                                    563
actor_3_facebook_likes                                                    1000
actor_2_name                                                     Orlando Bloom
actor_1_facebook_likes                                                   40000
gross                                                              3.09404e+08
genres                                                Action|Adventure|Fantasy
actor_1_name                                                       Johnny Depp
movie_title                          Pirates of the Caribbean: At World's End 
num_voted_users                                     

# A detailed explanation of the attributes of the data.

This dataset consists of 28 different attributes and they together hold information about a movie.

## Attribute description

A detailed explination of the attributes is shown in the table below. Where each attribute can be discrete or continous, and each attributes' objects are of different types.
<br>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-amwm">Attribute</th>
    <th class="tg-amwm">Description</th>
    <th class="tg-amwm">Discrete/Continous</th>
    <th class="tg-amwm">Type of attribute</th>
  </tr>
  <tr>
    <td class="tg-yw4l">movie_title</td>
    <td class="tg-yw4l">Holds title of the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">director_name</td>
    <td class="tg-yw4l">Name of director of the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">color</td>
    <td class="tg-yw4l">Shown in color or black and white.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">duration</td>
    <td class="tg-yw4l">Duration of the movie in minutes.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">actor_1_name</td>
    <td class="tg-yw4l">Name of lead actor.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">actor_2_name</td>
    <td class="tg-yw4l">Name of second actor.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">actor_3_name</td>
    <td class="tg-yw4l">Name of third actor.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">title_year</td>
    <td class="tg-yw4l">Year of release.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Interval</td>
  </tr>
  <tr>
    <td class="tg-yw4l">genres</td>
    <td class="tg-yw4l">Genres the movie belongs to.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">aspect_ratio</td>
    <td class="tg-yw4l">Aspect ratio</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">facenumber_in_poster</td>
    <td class="tg-yw4l">Number of faces shown in movie poster.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">language</td>
    <td class="tg-yw4l">Language spoken in the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">country</td>
    <td class="tg-yw4l">Country where the movie is filmed.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">budget</td>
    <td class="tg-yw4l">Cost of the movie.</td>
    <td class="tg-baqh">Continous</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">gross</td>
    <td class="tg-yw4l">Income of the movie.</td>
    <td class="tg-baqh">Continous</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">movie_facebook_likes</td>
    <td class="tg-yw4l">Count of facebook likes for the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">director_facebook_likes</td>
    <td class="tg-yw4l">Count of facebook likes the director has.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">actor_1_facebook_likes</td>
    <td class="tg-yw4l">Facebook likes actor 1 has.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">actor_2_facebook_likes</td>
    <td class="tg-yw4l">Facebook likes actor 2 has.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">actor_3_facebook_likes</td>
    <td class="tg-yw4l">Facebook likes actor 3 has.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">cast_total_facebook_likes</td>
    <td class="tg-yw4l">Total facebook likes for the whole cast of the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">plot_keywords</td>
    <td class="tg-yw4l">Keywords describing the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Nominal</td>
  </tr>
  <tr>
    <td class="tg-yw4l">content_rating</td>
    <td class="tg-yw4l">Rating of the movie.</td>
    <td class="tg-baqh"></td>
    <td class="tg-baqh"></td>
  </tr>
  <tr>
    <td class="tg-yw4l">num_user_for_reviews</td>
    <td class="tg-yw4l">Number of users who wrote reviews.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">num_critic_for_reviews</td>
    <td class="tg-yw4l">Number of critics who wrote reviews.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">num_voted_users</td>
    <td class="tg-yw4l">Count of users who have voted the movie.</td>
    <td class="tg-baqh">Discrete</td>
    <td class="tg-baqh">Ratio</td>
  </tr>
  <tr>
    <td class="tg-yw4l">movie_imdb_link</td>
    <td class="tg-yw4l">Holds a link to the movie on the site imdb.</td>
    <td class="tg-baqh"></td>
    <td class="tg-baqh"></td>
  </tr>
  <tr>
    <td class="tg-yw4l">imdb_score</td>
    <td class="tg-yw4l">Movie score on IMDB.</td>
    <td class="tg-baqh">Continous</td>
    <td class="tg-baqh">Ordinal</td>
  </tr>
</table>

## Summary statistics

A summary over the different attributes of the dataset.

In [13]:
df = df.dropna() # The NA values are not considered in the summary statistics of the attributes
df.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0,3756.0
mean,167.378328,110.257987,807.336528,771.279553,7751.338658,52612820.0,105826.7,11527.10197,1.377263,336.843184,46236850.0,2002.976571,2021.775825,6.465282,2.111014,9353.82934
std,123.45204,22.646717,3068.171683,1894.249869,15519.339621,70317870.0,152035.4,19122.176905,2.041541,411.227368,226010300.0,9.888108,4544.908236,1.056247,0.353068,21462.889123
min,2.0,37.0,0.0,0.0,0.0,162.0,91.0,0.0,0.0,4.0,218.0,1927.0,0.0,1.6,1.18,0.0
25%,77.0,96.0,11.0,194.0,745.0,8270233.0,19667.0,1919.75,0.0,110.0,10000000.0,1999.0,384.75,5.9,1.85,0.0
50%,138.5,106.0,64.0,436.0,1000.0,30093110.0,53973.5,4059.5,1.0,210.0,25000000.0,2004.0,685.5,6.6,2.35,227.0
75%,224.0,120.0,235.0,691.0,13000.0,66881940.0,128602.0,16240.0,2.0,398.25,50000000.0,2010.0,976.0,7.2,2.35,11000.0
max,813.0,330.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,12215500000.0,2016.0,137000.0,9.3,16.0,349000.0
