## IMDB Data Analysis

#### First of all we need to import pandas and the dataset and display the first few rows to understand the structure.

In [2]:
import pandas as pd
file_path = 'imdb.csv'
df = pd.read_csv(file_path)
display(df.head())

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


#### 1. Movie with the highest IMDB rating

In [3]:
highest_rated_movie = df.loc[df['IMDB_Rating'].idxmax(), ['Series_Title', 'IMDB_Rating']]
print("Movie with the highest IMDB rating is", highest_rated_movie)

Movie with the highest IMDB rating is Series_Title    The Shawshank Redemption
IMDB_Rating                          9.3
Name: 0, dtype: object


Another way.

In [4]:
highest_idx = df['IMDB_Rating'].idxmax()
highest_title = df.at[highest_idx, 'Series_Title']
highest_rating = df.at[highest_idx, 'IMDB_Rating']
print(f"Movie with the highest IMDB rating is {highest_title} with the rating of {highest_rating}.")

Movie with the highest IMDB rating is The Shawshank Redemption with the rating of 9.3.


#### 2. Difference between release years of the oldest and newest movies

Need to ensure *Released_Year* is a numeric column before the calculations.

In [5]:
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')
df = df.dropna(subset=['Released_Year'])
df['Released_Year'] = df['Released_Year'].astype(int)
oldest_year = df['Released_Year'].min()
newest_year = df['Released_Year'].max()
year_difference = newest_year - oldest_year
print("Difference between oldest and newest movie years is", year_difference, "years.")

Difference between oldest and newest movie years is 100 years.


#### 3. Average runtime of all movies

In [6]:
df['Runtime'] = df['Runtime'].astype(str).str.replace(' min', '')
df['Runtime'] = pd.to_numeric(df['Runtime'], errors='coerce')
avg_runtime = df['Runtime'].mean()
print("Average runtime of all movies is", avg_runtime, "minutes.")

Average runtime of all movies is 122.87387387387388 minutes.


#### 4. Total revenue made by all movies

In [7]:
df['Gross'] = df['Gross'].astype(str).str.replace(',', '')
df['Gross'] = pd.to_numeric(df['Gross'], errors='coerce')
total_revenue = df['Gross'].sum(skipna=True)
print("Total revenue made by all movies is", total_revenue)

Total revenue made by all movies is 56363040043.0


I also wanted to change the format for the better understandig of what number we have.

In [8]:
formatted_revenue = f"{total_revenue:,.0f}"
print("Total revenue made by all movies is", formatted_revenue)

Total revenue made by all movies is 56,363,040,043


####  5. Bottom 10 movies by Meta score

In [9]:
bottom_10_meta_score = df.nsmallest(10, 'Meta_score')[['Series_Title', 'Meta_score']]
display(bottom_10_meta_score)

Unnamed: 0,Series_Title,Meta_score
788,I Am Sam,28.0
942,The Butterfly Effect,30.0
356,Tropa de Elite,33.0
917,Seven Pounds,36.0
735,Kai po che!,40.0
957,Fear and Loathing in Las Vegas,41.0
648,The Boondock Saints,44.0
677,Predator,45.0
760,Flipped,45.0
935,Jeux d'enfants,45.0


#### 6. Average IMDB rating for each genre

In [15]:
df_genres = df.dropna(subset=['Genre', 'IMDB_Rating']).copy()
df_genres['Genre'] = df_genres['Genre'].str.split(', ')
df_genres = df_genres.explode('Genre')
genre_ratings = df_genres.groupby('Genre')['IMDB_Rating'].mean()
print(genre_ratings)

Genre
Action       7.948677
Adventure    7.953846
Animation    7.930488
Biography    7.935780
Comedy       7.903433
Crime        7.954545
Drama        7.959889
Family       7.912500
Fantasy      7.931818
Film-Noir    7.989474
History      7.960000
Horror       7.887500
Music        7.914286
Musical      7.947059
Mystery      7.967677
Romance      7.925600
Sci-Fi       7.977612
Sport        7.926316
Thriller     7.909489
War          8.013725
Western      8.000000
Name: IMDB_Rating, dtype: float64


#### 7. Frequency distribution of each genre (individually)

In [12]:
genre_split = df['Genre'].dropna().str.split(', ')
genre_exploded = genre_split.explode()
genre_frequency = genre_exploded.value_counts()
print(genre_frequency)

Drama        723
Comedy       233
Crime        209
Adventure    195
Action       189
Thriller     137
Romance      125
Biography    109
Mystery       99
Animation     82
Sci-Fi        67
Fantasy       66
Family        56
History       55
War           51
Music         35
Horror        32
Western       20
Film-Noir     19
Sport         19
Musical       17
Name: Genre, dtype: int64


####  8. Actor with the most star points

In [27]:
melted_df = df.melt(value_vars=['Star1', 'Star2', 'Star3', 'Star4'], var_name='Star_Position', value_name='Actor')
weight_map = {'Star1': 4, 'Star2': 3, 'Star3': 2, 'Star4': 1}
melted_df['Points'] = melted_df['Star_Position'].map(weight_map)
actor_points_df = melted_df.groupby('Actor')['Points'].sum().reset_index()
most_starred_actor = actor_points_df.loc[actor_points_df['Points'].idxmax()]
print(f"Actor with the most star points is {most_starred_actor['Actor']} with {most_starred_actor['Points']} points.")

Actor with the most star points is Robert De Niro with 59 points.
