# ðŸ“º IMDB MOVIES ANALYSIS

### Introduction
In this notebook, I will be analyzing the IMDB movies dataset. I will be using the [IMDB movies dataset](https://www.kaggle.com/datasets/ashpalsingh1525/imdb-movies-dataset). The dataset contains information on 9.660 movies such as the title, genre, score, actors and more.

Our goal is gain insights about the movies and answers key questions. We will be using the following libraries:
- Pandas
- Matplotlib
- Seaborn
- Numpy

### Table of Contents
 - Questions
 - Data cleaning
 - Exploratory Data Analysis
 - Conclusions

---

### Questions
- Q1: Which genres are the most profitable?
- Q2: Which genres have the highest popularity?
- Q3: What is the relationship between the popularity and profit?
- Q4: Does higher budget result in higher profit?
- Q5: Are the longest movies more popular?
- Q6: Are there trends over time for profit and popularity?. What are they?
---


### Data Cleanning
I will install the libraries and import the data

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv('data/movies.csv')

Now I will explore the data in order to get some basic information about the dataset.

In [17]:
df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"['Action', 'Adventure', 'Fantasy', 'Science Fi...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"['Adventure', 'Fantasy', 'Action']",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"['Action', 'Adventure', 'Crime']",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bondâ€™s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

#### Handle Duplicates

In [19]:
df.duplicated().sum()

np.int64(0)

Therea are no duplicates so we can jump into the nexst step
#### Deleting columns
In this step, I will remove the columns that won't be useful for the analysis.

In [20]:
df.drop(['homepage','id', 'keywords', 'overview','production_companies', 'production_countries', 'spoken_languages', 'tagline', 'status', 'title', 'vote_count', 'vote_average'], axis=1, inplace=True)

#### Handle Missing Values

In [24]:
df.isna().sum() * 100 / len(df)

budget               0.000000
genres               0.000000
original_language    0.000000
original_title       0.000000
popularity           0.000000
release_date         0.020820
revenue              0.000000
runtime              0.041641
dtype: float64

There are only 2 columns with nulls, release_date and runtime. Since the percentage of nulls is low, I will delete the rows with them.

In [26]:
df.dropna(subset=['runtime', 'release_date'], axis=0, inplace=True)

#### Creating new columns
In this case, I will only create a new column called "profit" as a result of the subtraction between the "revenue" and "budget" columns.

In [28]:
df['profit'] = df['budget'] - df['revenue']

##### Changing the type of data
As a final step in the data preparation, I will convert the 'relase_date' column to a date format.

In [29]:
df['release_date'] = pd.to_datetime(df['release_date'])

---

### Exploratory Data Analysis

### Q1: What is the relationship between the score and profit?