# <u>TMDB Dataset</u>
  ### *An Initial Look: Exploratory Data Analysis and some key insights*
> Archie Porteous DA13

### <u>Contents</u> 
### **1.** Importing the dataset    
 * **1.1** Importing the csv from a local path into a Pandas DataFrame     
 * **1.2** Quick Look using `.describe()` & `.head()`      

### **2.** Data Cleaning  
 * **2.1** Refromatting the '*genre*', '*keywords*', etc.. columns  
 * **2.2** Checking for nulls
 * **2.3** Deduplicating Rows

### **3.** Exploratory Insights/Keypoints
 *  **3.1** Budget Efficiency (revenue/dollar)
 *  **3.2** What Catergories do Voters like?

<img src = "https://images.unsplash.com/photo-1542204165-65bf26472b9b?q=80&w=1374&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" width=400px height=300px />

# **1.** Importing the dataset

### **1.1** Importing the csv from a local path into a Pandas DataFrame

In [None]:
# Importing Required Libraries
import numpy as np  # maths
import seaborn as sns # visualisation
import matplotlib.pyplot as plt # visualisation
import pandas as pd # dataframes & data analysis
import re # RegEx

In [None]:
path = "TMDB_movies.csv"

df = pd.read_csv(path)

### **1.2** Quick Look using `.describe()` & `.head()`

In [None]:
df.shape

In [None]:
df.describe(include="all")

In [None]:
df.head()

# **2.** Data Cleaning 

### **2.1** Refromatting the '*genre*', '*keywords*', etc.. columns

In [None]:
df.spoken_languages[0]

In [None]:
def clean_str(string):
    return re.findall("\"name\":\s\"([^}{\"]+)\"",string)

In [None]:
clean_str('[{"id": 28, "name": "Action"},{"id": 60, "name": "Adventure"},{"name": "Ingenious Film Partners", "id": 289},{"iso_3166_1": "US", "name": "United States of America"},{"iso_639_1": "en", "name": "English"}]')

In [None]:
df_eda = df.copy()

In [None]:
for column in ['genres','keywords','production_companies','production_countries','spoken_languages']:
    df_eda[column] = df_eda[column].apply(clean_str)

In [None]:
df_eda.head()

### **2.2** Checking for nulls

In [None]:
# counting nulls
null_df = df_eda.isnull() # truth table
null_df.sum()

### **2.3** Checking for Duplicates

In [None]:
unique_id = set(df_eda.id)
len(unique_id)

# **3.** Exploratory Insights/Keypoints

## **3.1** Budget Efficiency (revenue/dollar)

### Top 10 Films by Budget

In [None]:
top_10_budget = df_eda.sort_values('budget',ascending=False)[['budget', 'title']].head(10)
top_10_budget

In [None]:
sns.barplot(x='title',
            y='budget',
            data=top_10_budget,
            palette = 'magma')


plt.xticks(rotation=45,
           horizontalalignment='right',
           fontweight ='light',
           fontsize ='large'
           )

plt.title('Top 10 Budgets')

plt.show()

> ### Conclusion:
> *Pirates of the Caribbean: On Stranger Tides* is the film with the most expensive budget in our dataset. This is due to its many filming locations and extensive use of state-of-the-art visual effects

### Top 10 Films by Revenue

In [None]:
top_10_revenue = df_eda.sort_values('revenue',ascending=False)[['revenue', 'title']].head(10)
top_10_revenue

In [None]:
sns.barplot(x='title',
            y='revenue',
            data=top_10_revenue,
            palette = 'Greens_r')


plt.xticks(rotation=45,
           horizontalalignment='right',
           fontweight ='light',
           fontsize ='large'
           )

plt.title('Top 10 Revenues')

plt.show()

> ### Conclusion:
> *Avatar* was the highest grossing film in our dataset. Avatar had a captivating story and amazing visuals.

### **Top 10 Films by Revenue per Dollar Budget**

In [None]:
print(df_eda[df_eda.budget==0].shape[0])
print(df_eda[df_eda.revenue==0].shape[0])

In [None]:
df_eda[df_eda.budget==0].index

In [None]:
df_zeros = df_eda.copy()
df_zeros.drop(index=df_zeros[df_eda.budget==0].index, inplace=True)
df_zeros.drop(index=df_zeros[df_eda.revenue==0].index, inplace=True)


In [None]:
df_zeros['rev_bud'] = df_zeros['revenue']/df_zeros['budget']
top_10_efficiency = df_zeros.sort_values('rev_bud',ascending=False)[['rev_bud', 'title']].head(10)
top_10_efficiency

In [None]:
sns.barplot(x='title',
            y='rev_bud',
            data=top_10_efficiency,
            palette = 'PuRd_r')


plt.xticks(rotation=45,
           horizontalalignment='right',
           fontweight ='light',
           fontsize ='large'
           )

plt.title('Top 10 Revenue/Budget')

plt.yscale('log')
plt.show()

> ### Conclusion:
> *Modern Times* was one of the American Film Institutes Top 100 Films in 1998 and was first shown in theatres in 1936. This longevity could is a likely factor to its great success. But it is also worth taking into considereation that the dollars in the dataset are not said to be adjusted for inflation!
> *Paranormal Activity* is another interesting example shown in the graph above. It had a budget of a mere $15000 and was shot over just 10 days. It had basic scenes with static camera angles and the script was only made of rough guidlines for the actors.

In [None]:
df_zeros[df_zeros['title']=="Paranormal Activity"]['budget']

## **3.2** What Catergories do Voters like?