**Movie Dataset Project**

**By: Ivan Yu**

**Goal (Defining the problem): This project is going to focus on finding information in a movies data set,
where we try to extract meaningful data and hidden meanings/correlations/causations behind trends.**


**Source of data set: https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset/code?datasetId=7181&sortBy=voteCount**

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("movie_metadata.csv")
df.head()

#2. Data Collection and Preparation

In [None]:
# Let's see if there's any correlation between gross and budget of the movie
# 3. Data Exploration and Analysis
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(df['budget'], df['gross'], alpha=0.5)
plt.xlabel("budget")
plt.ylabel("gross")
plt.title("Budget vs Gross")
plt.show()
# low correlation

**We can see that just because a movie has a higher budget, it does not mean that the gross will be higher**

**Let's clean the table a little bit, dropping rows with missing data**

In [None]:
df = df.dropna(subset=["budget", "gross"]) #Drops the rows with missing values
df.head()

In [None]:
# Taking a look at any correlation between score and gross
plt.figure(figsize=(7,6))
plt.scatter(df['imdb_score'],df['gross'], alpha=1)
plt.xlabel('imdb_score')
plt.ylabel('gross')
plt.title("Score vs Gross")
plt.show()
#Correlation found! 

**Nice! It looks like there is a positive correlation between the score the movie recieved and the gross!**

**Let's use one more graph for data exploration**

In [None]:
df["Profit"] = df["gross"] - df["budget"]

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(df['title_year'], df["Profit"], alpha = 0.7)
plt.xlabel("Year")
plt.ylabel("Profit")
plt.title("Year vs Profit")
plt.show()

**Wow! How interesting. So according to this data, movies were struggling to even break even up until around the year 2000, where profits really started to shine. There are some outliers to this assumption though for sure!**

In [None]:
import numpy as np
df.sort_values("Profit", ascending=False).head(1) #The outlier

**For the data exploration chapter, we notice that movies struggle to break even, but since then has done better starting around the year 2000.
We also noticed that just because a movie has a high budget does not correlate with high gross.
And finally, we found out there is in fact a POSITIVE correlation between movie gross and score.**

In [None]:
df.head(1) #Just to take a look at the columns to find more correlations

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(df['duration'], df['Profit'])
plt.xlabel('Duration')
plt.ylabel('Profit')
plt.title('Duration vs Profit')
plt.show()

**Not Much correlation here**

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(df['imdb_score'],df['Profit'])
plt.xlabel('imdb_score')
plt.ylabel('Profit')
plt.title('imdb_score Vs Profit')
plt.show()

**Again, not much correlation here**

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(df['num_critic_for_reviews'],df['Profit'])
plt.xlabel('Number of Critics')
plt.ylabel('Profit')
plt.title('# of Critics Vs Profit')
plt.show()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
subset = df[['Profit', 'gross', 'title_year', 'director_name', 'cast_total_facebook_likes','movie_facebook_likes', 'genres']]
subset.head()

**Much easier to look at with columns that are relevant**

In [None]:
subset.shape

In [None]:
subset.duplicated().sum()

In [None]:
#Drop Duplicates to make subset Table a little less messy
subset.drop_duplicates()


In [None]:
plt.figure(figsize=(10,5))
plt.scatter(subset['movie_facebook_likes'],subset['Profit'])
plt.xlabel('Facebook Likes')
plt.ylabel('Profit')
plt.title('Facebook likes vs Profit')
plt.show()

In [None]:
#Let's use a funcftion for find more correlation from the 'df' table
#df.corr() #(Error because some columns could not be converted into float value)
#Let's drop some columns
#df = df.drop('actor_2_name', axis =1) #already ran, same with the color column
df.head()
#df.corr()

In [None]:
#Selecting all columns with numeric values
#df.drop(['genres','actor_1_name','movie_title','language','country','content_rating',], axis=1, inplace=True)
df.head(1)
#Perfect!

In [None]:
#df.drop(['actor_3_name','plot_keywords','movie_imdb_link'], axis=1, inplace=True)

In [None]:
df_numeric = df # Differentaite between the original df from earlier and the numeric table where we will run the df.corr() function to find correlations

In [None]:
df_numeric.drop(['actor_2_name'],axis=1,inplace=True)
df_numeric.head(1)

In [None]:
df_numeric.drop(['genres'],axis=1,inplace=True)

In [None]:
#df_numeric.drop(['actor_3_name','plot_keywords','movie_imdb_link'],axis=1,inplace=True)
#df_numeric.drop(['actor_3_name','plot_keywords','movie_imdb_link'],axis=1,inplace=True)
df_numeric_corr = df_numeric.corr()
#df_numeric_corr.sort_values('Profit', ascending=False)
#df_numeric.shape (3891,17)

In [None]:
df_numeric_corr

**Let's do some feature engineering; Basically combining existing data into newer data to extract more insights.**

In [None]:
df_numeric['Star Power'] = (df_numeric['actor_3_facebook_likes']+ df_numeric['actor_1_facebook_likes'] + df_numeric['director_facebook_likes'] + df_numeric['actor_2_facebook_likes'])

In [None]:
df_numeric.head()

**Nice, so now let's run the 'corr' function again to see if we have can have a better understanding of what's happening here between the shared amount of likes and profits.**

In [None]:
df_numeric.sort_values('Profit', ascending=False)

In [None]:
#Let's put the "profit" column on the left side of the table for better readability
df_numeric = df_numeric[['Profit'] + [col for col in df_numeric.columns if col !='Profit']]
df_numeric.head()

In [None]:
df_numeric.corr() #Nice!

In [None]:
df.describe()

In [None]:
df_numeric.describe()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = df.dropna()

In [None]:
df.head()

In [None]:
yearly = df.groupby("title_year").size()

In [None]:
plt.figure(figsize=(10,5))
yearly.plot(kind='bar')
plt.title('Titles Added Per Year')
plt.xlabel('Year')
plt.ylabel('Movies Added #')
plt.show()

In [None]:
category = subset.groupby('genres').size()
category.sort_values(ascending=False)

In [None]:
category_range = category.dropna()
category_range.sort_values(ascending=True)
genre_popularity = category_range.head(10)

In [None]:
plt.figure(figsize=(10,5))
genre_popularity.plot(kind='barh')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.title('Genre by popularity')
plt.show()

### **Conclusion**

This project offered a valuable hands-on experience working with real-world movie data. Through data cleaning, exploratory analysis, and visualizations, I gained deeper insight into the distribution of genres, content trends across time, and factors that potentially influence popularity. Although this dataset had its quirks and limitations, it provided a strong foundation for building my data science workflow. I'm proud of what I learned and produced, even if it isn’t perfect. With this project complete, I’m excited to move on and apply what I’ve learned to new datasets and fresh challenges.