# IMDB Top 1000 Movies and TV Shows Analysis

This notebook explores and analyzes the IMDB dataset containing information about top 1000 movies and TV shows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
plt.style.use('seaborn')
sns.set_palette('deep')

  plt.style.use('seaborn')


## Data Loading and Initial Exploration

In [2]:
# Load the dataset
df = pd.read_csv('imdb-agent/data/imdb_top_1000.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())

Dataset Shape: (1000, 16)

Columns:
['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate', 'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director', 'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross']

Data Types:
Poster_Link       object
Series_Title      object
Released_Year     object
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
dtype: object

Missing Values:
Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross   

In [3]:
df.shape

(1000, 16)

In [4]:
import sys
sys.path.append('/Users/curtispond/Documents/imdb-agent')
from text_preprocessing import *

In [5]:
# Load the data
df = pd.read_csv('imdb-agent/data/imdb_top_1000.csv')

# Preprocess the Overview column
df['processed_overview'] = df['Overview'].apply(preprocess_text)

# Extract keywords from overviews
df['keywords'] = df['Overview'].apply(lambda x: extract_keywords(x, n_keywords=5))

In [6]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,processed_overview,keywords
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,two imprisoned men bond number year finding so...,"[years, solace, redemption, number, men]"
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,organized crime dynasty aging patriarch transf...,"[transfers, son, reluctant, patriarch, organized]"
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,menace known joker wreaks havoc chaos people g...,"[wreaks, havoc, accept, batman, chaos]"
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,early life career vito corleone new york city ...,"[york, vito, career, city, corleone]"
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000,jury holdout attempt prevent miscarriage justi...,"[reconsider, prevent, miscarriage, justice, jury]"
