## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<div style="border-radius: 10px; border: 2px solid #3498db; padding: 20px; background-color: #f0f0f0; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.1);">
   <div style="display: flex; align-items: flex-start;">
      <div style="flex: 1;">
         <h3 style="color: #2c3e50; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); font-weight: bold;">Steps of Data Science Project</h3>
         <div style="margin-top: 20px; text-align: left;">
            <ul style="list-style-type: decimal; margin-left: 10px; font-size: 18px; color: #333;">
            <li>Askning Questions &#128587;</li>
            <li>Data Collection &#128194;</li>
            <li>Data Cleaning &#128269;</li>
            <li>Exploratory Data Analysis (EDA) &#128220;</li>
            <li>Comunicate Results &#128640;</li>
        </ul>
         </div>
      </div>
   </div>
</div>

<a id='intro'></a>
<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 20px; background-color: #E8F6EF; box-shadow: 0px 6px 12px rgba(0, 0, 0, 0.1); text-align: left;">
<h1>Introduction to the Movie Dataset</h1>
<p>This notebook explores a comprehensive dataset containing information about various movies. The dataset encompasses a wide range of attributes for each movie, providing valuable insights into the world of cinema.</p>

<h2>Dataset Overview:</h2>

<ul>
  <li><strong>Movie Details</strong>: Such as title, release date, and runtime.</li>
  <li><strong>Financial Metrics</strong>: Including budget and revenue figures.</li>
  <li><strong>Popularity Metrics</strong>: Such as popularity scores and vote counts.</li>
  <li><strong>Casting Information</strong>: Names of the main cast members.</li>
  <li><strong>Production Details</strong>: Involving production companies and directors.</li>
  <li><strong>Genre Classification</strong>: Categorizing movies into different genres.</li>
  <li><strong>Additional Information</strong>: Such as taglines, keywords, and an overview of the movie.</li>
</ul>

<h2>Purpose:</h2>

<p>The primary purpose of this notebook is to conduct exploratory data analysis (EDA) on the movie dataset. Through EDA, we aim to:</p>

<ol>
  <li>Gain insights into the distribution and characteristics of movies in the dataset.</li>
  <li>Explore trends and patterns in movie budgets, revenues, and popularity scores.</li>
  <li>Analyze the relationship between different attributes such as budget and revenue, genre popularity, and more.</li>
  <li>Identify interesting correlations or anomalies within the dataset.</li>
  <li>Extract actionable insights that could be useful for various stakeholders in the film industry, including producers, directors, and investors.</li>
</ol>

<h2>Methodology:</h2>

<p>Our analysis will involve various data visualization techniques, statistical summaries, and exploratory techniques to delve deeper into the dataset. We'll use Python programming language along with popular data analysis and visualization libraries such as Pandas, Matplotlib, and Seaborn.</p>

<h2>Conclusion:</h2>

<p>By the end of this analysis, we aim to have a better understanding of the movie industry landscape, key factors influencing the success of movies, and potentially uncover hidden insights that could inform decision-making processes in the entertainment industry.</p>
</div>

<div style="background:#fff7f7;padding:10px;border-radius:6px;border:2px blue solid;margin:10px;">
  <h2>Movie Dataset Description</h2>
  <ol>
    <li><strong>id:</strong> An integer value serving as a unique identifier for each entry in the dataset.</li>
    <li><strong>imdb_id:</strong> A unique identifier provided by IMDb for each movie.</li>
    <li><strong>popularity:</strong> A float value indicating the popularity score of the movie.</li>
    <li><strong>budget:</strong> An integer value representing the budget of the movie.</li>
    <li><strong>revenue:</strong> An integer value representing the revenue generated by the movie.</li>
    <li><strong>original_title:</strong> The original title of the movie.</li>
    <li><strong>cast:</strong> Names of the main cast members of the movie.</li>
    <li><strong>homepage:</strong> The URL of the movie's official website, if available.</li>
    <li><strong>director:</strong> The name of the director of the movie.</li>
    <li><strong>tagline:</strong> A short memorable phrase associated with the movie, often used in marketing.</li>
    <li><strong>keywords:</strong> Keywords or phrases associated with the movie for indexing and searching purposes.</li>
    <li><strong>overview:</strong> A brief summary or description of the movie.</li>
    <li><strong>runtime:</strong> The duration of the movie in minutes.</li>
    <li><strong>genres:</strong> Categories or genres that the movie belongs to.</li>
    <li><strong>production_companies:</strong> Names of the production companies involved in making the movie.</li>
    <li><strong>release_date:</strong> The date when the movie was released.</li>
    <li><strong>vote_count:</strong> The count of votes given to the movie.</li>
    <li><strong>vote_average:</strong> The average rating given to the movie.</li>
    <li><strong>release_year:</strong> The year when the movie was released.</li>
    <li><strong>budget_adj:</strong> The budget of the movie adjusted for inflation.</li>
    <li><strong>revenue_adj:</strong> The revenue generated by the movie adjusted for inflation.</li>
  </ol>
</div>

<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 20px; background-color: #E8F6EF; box-shadow: 0px 6px 12px rgba(0, 0, 0, 0.1); text-align: left;">
    <h2 style="color: #17A05D; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); font-size: 24px; font-weight: bold; margin-bottom: 10px;">
        Question(s) for Analysis</h2>
        <ol style="font-size: 18px; color: #333;">
            <li>What is the total popularity by genre?</li>
            <li>How many movies are there in each genre?</li>
            <li>What is the average popularity by genre?</li>
            <li>What is the average revenue by genre?</li>
            <li>What is the average vote count by genre?</li>
            <li>What is the average vote average by genre?</li>
            <li>What is the average profit by genre?</li>
            <li>Which genre is the overall best-performing?</li>
            <li>What are the top 10 production companies by profit?</li>
            <li>Who are the top directors with the highest counts of movies?</li>
            <li>Who are the top directors with popular movies?</li>
            <li>How does the count of movies vary over the years?</li>
            <li>What are the top 10 movies by profit?</li>
            <li>What are the top 10 movies by popularity?</li>
            <li>What are the top 10 movies by vote count?</li>
            <li>What are the top 10 movies by vote average?</li>
            <li>What are the top 10 movies by score?</li>
            <li>What is the distribution of runtime in movies?</li>
            <li>How many movies are there for each duration category?</li>
            <li>How many movies are released each month?</li>
            <li>On which day are most movies released?</li>
            <li>What percentage of movies are released in each decade?</li>
            <li>What is the total profit by decade?</li>
            <li>Is there a correlation between revenue and budget?</li>
            <li>Is there a correlation between popularity and revenue?</li>
            <li>Is there a correlation between popularity and runtime?</li>
        </ol>
</div>

<div style="text-align: center; padding: 15px; background-color: #333; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); overflow: hidden;">
    <p style="font-weight: bold; color: #ffcc00; font-size: 24px; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);">✨ If you find my notebook helpful, please consider upvoting it! 🚀</p>
</div>


## <div style="text-align: left; background-color: #BB4ED8 ; font-family: Trebuchet MS; color: white; padding: 15px; line-height:1;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 25px; border-radius: 8px 8px 0 0;  ">1| Import Libraries📚</div>

In [None]:
# I plan to use these pakages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

<a id='wrangling'></a>
## <div style="text-align: left; background-color: #BB4ED8 ; font-family: Trebuchet MS; color: white; padding: 15px; line-height:1;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 25px; border-radius: 8px 8px 0 0;  ">2| Data Wrangling</div>

In [None]:
df = pd.read_csv("/kaggle/input/tmdb-movies/tmdb-movies.csv")

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>Some Information</b></font>
</div>

In [None]:
df.info()

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>Statistics</b></font>
</div>

In [None]:
df.describe()

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>How many unique items on each row</b></font>
</div>

In [None]:
df.nunique()

<a id="2"></a>
## <div style="text-align: left; background-color: #BB4ED8 ; font-family: Trebuchet MS; color: white; padding: 15px; line-height:1;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 25px; border-radius: 8px 8px 0 0;  ">3| Data Cleaning</div>

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.1 | Drop Unnecessary Colmuns</b></font>
</div>

In [None]:
# I will not use these columns in my notebook
droped_colmuns = ['homepage','keywords', 'tagline']
df.drop(columns=droped_colmuns, inplace=True)

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.2 | Covert the release_date to Date Time</b></font>
</div>

In [None]:
# we need to convert these columns to suitable datatypes
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_year'] = df['release_year'].astype(int)

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.3 | Check For Duplicates</b></font>
</div>

In [None]:
df[df.duplicated()]

In [None]:
# Drop the duplicates because it is duplicated in everything even imdb_id
df.drop_duplicates(inplace=True)

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.4 | Handeling nan values</b></font>
</div>

In [None]:
# Counting nans
df.isna().sum()

In [None]:
# Dop NaN Values Because they are not a lot
df.dropna(inplace=True)

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.5 | Handeling Pipeline sperated values</b></font>
</div>

In [None]:
# we used the apply method to split every item into list
df['cast'] = df['cast'].apply(lambda x: x.split('|'))
df['genres'] = df['genres'].apply(lambda x: x.split('|'))
df['production_companies'] = df['production_companies'].apply(lambda x: x.split('|'))
df["genres"].head()

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.6 | Handeling Unrealistic Values</b></font>
</div>

In [None]:
# we note that we should found unrealistic like zero in budget and revenue
zero_budget = df.loc[df["budget"] == 0]
zero_budget.head()

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">We have 4749 Rows With Zero budget, That is Too Wierd </li>
        <li style="margin-bottom: 10px;">We can count these Values as outlires </li>
        <li style="margin-bottom: 10px;">I can't drop them because they are a lot , about 50% of the Dataset</li>
        <li style="margin-bottom: 10px;">Budget Graphs Won't be good and it will be misleading</li>
        <li style="margin-bottom: 10px;">So let's Know more information about these Zero Budgets</li>
    </ul>
</div>


In [None]:
yrs_counts_zero_budget = zero_budget["release_year"].value_counts()

# I Choosed to make it as perctages because counts of movies is increasing over time
# Percentages Are More Accurate
budget_zero_percent = (yrs_counts_zero_budget / df["release_year"].value_counts()) *100


sns.set_style("darkgrid")
sns.set_palette("deep")

sns.lineplot(x=budget_zero_percent.index,y=budget_zero_percent.values)

plt.title('Movies Counts By Years')
plt.xlabel('Year')
plt.ylabel('Number Of Movies')

plt.show()

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">It Seems To be that zeroes are random over years</li>
        <li style="margin-bottom: 10px;">But from late 90s to early 2000s are low</li>
    </ul>
</div>


In [None]:
zero_revenue = df.loc[df["revenue"] == 0]
zero_revenue.size

In [None]:
"""I Replace the zeros with nans so it won't effect our EDA, if you want to drop them you can uncomment the last line"""

df['budget'] = df['budget'].replace(0, pd.NA)
df['revenue'] = df['revenue'].replace(0, pd.NA)
# df.dropna(subset=['budget', 'revenue'], inplace=True)

<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.5 | Outlires</b></font>
</div>

In [None]:
# We used the iqr to detect the outlires and then return the counts of the outlires in the column
def detect_outliers(df: pd.DataFrame, column: str):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return f"{column} : {outliers[column].size}"

columns_to_check = ['popularity', 'budget', 'revenue', 'runtime',  'vote_count', 'vote_average',
        'budget_adj', 'revenue_adj']

# Detect outliers for each column and store the results in a list
all_outlires = [detect_outliers(df,column) for column in columns_to_check]
all_outlires

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">Outlires here is many, this is too acceptable because there are movies that have more budgets, revenues,... than others</li>
        <li style="margin-bottom: 10px;">runtime is the only colmun that i think that oulires is accepted because most of movies are close together</li>
        <li style="margin-bottom: 10px;">But i won't remove anything</li>
    </ul>
</div>


<div style='background-color: #fff7f7; border: 2px solid; padding :8px; border-radius: 8px 8px 0 0;'>
    <font size="+2" color="green" ><b>3.6 | Make Profit Column</b></font>
</div>

In [None]:
# Making the profit column to use it in the EDA
df["profit"] = df["revenue"] - df["budget"]
df["profit_adj"] = df["revenue_adj"] - df["budget_adj"]

<a id='eda'></a>
## <div style="text-align: left; background-color: #BB4ED8; font-family: Trebuchet MS; color: white; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 25px">4|Exploratory Data Analysis</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px; color:#333">Research Question 1 | What is the total popularity by genre?</div>

In [None]:
def groupby_plots(df: pd.DataFrame,
                  by: str,
                  column: str,
                  operation: str,
                  count:[int] = None) -> pd.Series:
    """
    Generate a grouped bar plot based on a specified operation (mean or sum) for a given DataFrame.

    Parameters:
    - df (DataFrame): The DataFrame containing the data.
    - by (str): The column to group by.
    - column (str): The column to perform the operation on and plot.
    - operation (str): The operation to perform. Options: 'mean' or 'sum'.
    - count (int, optional): The number of groups to plot (based on the sorted values). Default is None (plot all).

    Returns:
    - Shows the bar plot
    - grouped (Series): The result of the grouped operation.

    Example:
    groupby_plots(df, by="genres", column="popularity", operation="sum", count=10)
    """
    operation_val = "Average" if operation == "mean" else "Total"
    
    grouped = df.groupby(by)[column].agg(operation).sort_values(ascending=False)[:count]

    plt.figure(figsize=(10, 6))
    sns.barplot(x=grouped.index,y=grouped.values,palette="viridis")
    plt.title(f'{operation_val} {column.title()} by {by.title()}')
    plt.xlabel(f'{by.title()}')
    plt.ylabel(f'{operation_val} {column.title()}')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    return grouped
    
groupby_plots(df=df.explode('genres'),by="genres",column="popularity",operation="sum")


<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">I choose to get the total popularity of each Gener</li>
        <li style="margin-bottom: 10px;">Drama is the most popular gener</li>
        <li style="margin-bottom: 10px;">TV Movie is the least popular gener</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px">Research Question 2 |How many movies are there in each genre?</div>

In [None]:
def column_counts(df: pd.DataFrame,
                  column: str,
                  xlabel: str,
                  ylabel: str,
                  count: int = None,
                  kind: str = "bar",
                  fig_x: int = 10,
                  fig_y: int = 6,
                  palette: str = "viridis") -> pd.Series:   
    """
    Generate a count plot or line plot based on the values of a specified column in a DataFrame.

    Parameters:
    - df (DataFrame): The DataFrame containing the data.
    - column (str): The column for which to generate counts.
    - xlabel (str): Label for the x-axis.
    - ylabel (str): Label for the y-axis.
    - kind (str, optional): Type of plot to generate. Options: 'bar' (default) or 'line'.
    - fig_x (int, optional): Width of the figure in inches. Default is 10.
    - fig_y (int, optional): Height of the figure in inches. Default is 6.

    Returns:
    - Show the plot
    - value_counts (Series): The counts of unique values in the specified column.

    Example:
    column_counts(df, column="genres", xlabel="Genre", ylabel="Number of Movies", kind="bar", fig_x=12, fig_y=8)
    """
    value_counts = df[column].value_counts()[:count]
    plt.figure(figsize=(fig_x, fig_y))
    
    if kind == "bar":
        sns.barplot(x=value_counts.index,y=value_counts.values,palette="viridis")
        plt.xticks(rotation=45, ha='right')
    elif kind == "line":
        sns.lineplot(x=value_counts.index,y=value_counts.values)
    
    plt.title(f'Movies Counts By {xlabel}')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

    plt.tight_layout()
    plt.show()
    return value_counts

count_movies_gener = column_counts(df=df.explode('genres'),column="genres",xlabel="Gener",ylabel="Counts")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">Drama has biggest amount of Movies, so it has more popularity</li>
        <li style="margin-bottom: 10px;">Science Fiction has low movies but more popularity</li>
        <li style="margin-bottom: 10px;">So to be more accurate i will get the popularity mean of each gener</li>        
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px">Research Question 3 |What is the average popularity by genre?</div>

In [None]:
popularity_mean_gener = groupby_plots(df=df.explode('genres'),by="genres",column="popularity",operation="mean")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">This graph is more Accurate than the previous one</li>
        <li style="margin-bottom: 10px;">Adventure have the biggest Average Popularity</li>
        <li style="margin-bottom: 10px;">There is no big diffrence in popularity</li
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px">Research Question 4 |What is the average revenue by genre?</div>

In [None]:
revenue_mean_gener = groupby_plots(df=df.explode('genres'),by="genres",column="revenue_adj",operation="mean")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">Fantasy & Adventure are high</li>
        <li style="margin-bottom: 10px;">From Family to Sc-Fi is nearly the same</li>
        <li style="margin-bottom: 10px;">From Crime to Drama is nearly the same</li>
        <li style="margin-bottom: 10px;">Documentary & Foregin have too low Revenues</li>
        <li style="margin-bottom: 10px;">Tv Movie is the lowest</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 5 |What is the average vote count by genre?</div>

In [None]:
vote_count_mean_gener = groupby_plots(df=df.explode('genres'),by="genres",column="vote_count",operation="mean")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">Adventure has the most vote counts</li>
        <li style="margin-bottom: 10px;">Foreign Genre has the least vote counts</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 6 |What is the average vote average by genre?</div>

In [None]:
vote_average_mean_gener = groupby_plots(df=df.explode('genres'),by="genres",column="vote_average",operation="mean")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">Most Values are too close</li>
        <li style="margin-bottom: 10px;">There is no any insights from this graph</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 7 |What is the average profit by genre?</div>

In [None]:
profit_mean_gener = groupby_plots(df=df.explode('genres'),by="genres",column="profit_adj",operation="mean")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">This is not too accurate from our life, because there is a lot of misleading values in budget and revenue, like a lot of Zeros</li>
        <li style="margin-bottom: 10px;">So this is what dataset says Not the actual life</li>
        <li style="margin-bottom: 10px;">Adventure is too high</li>
        <li style="margin-bottom: 10px;">Foreign Movies is losing money!</li>
        <li style="margin-bottom: 10px;">Tv Movie has 0 profit</li>
        <li style="margin-bottom: 10px;">Documentary have the lowest profit</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 8 |Which genre is the overall best-performing?</div>

In [None]:
# I will use rank method to rank the values from 0 to 20 and then supstract them from 21 
profit_edited = 21 - profit_mean_gener.rank(method='min', ascending=False)
vote_average_edited = 21 - vote_average_mean_gener.rank(method='min', ascending=False)
vote_count_edited = 21 - vote_count_mean_gener.rank(method='min', ascending=False)
popularity_edited = 21 - popularity_mean_gener.rank(method='min', ascending=False)
movie_counts_edited = 21 - count_movies_gener.rank(method='min', ascending=False)

# I will get the mean of all the values and then plot these values
overall_mean_gener = (profit_edited+popularity_edited+vote_average_edited+vote_count_edited+movie_counts_edited) / 5
overall_mean_gener.sort_values()

plt.figure(figsize=(10, 6))
sns.barplot(x=overall_mean_gener.sort_values().index,y=overall_mean_gener.sort_values().values,palette="viridis")
plt.title('Overall best geners')
plt.xlabel('Geners')
plt.ylabel('Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        this grph is getting the average of all geners accoding to (Popularity, Profit, vote count, vote avrage,Counts of Movies)
        </li>
        <li style="margin-bottom: 10px;">The adventure is the best gener according to this dataset</li>
        <li style="margin-bottom: 10px;">Results are not surprising not really diffrent form other graphs</li>
        <li style="margin-bottom: 10px;">There is nothing called the best gener, it depends from one to one</li>
        <li style="margin-bottom: 10px;">it is just what the numbers says</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 9 |What are the top 10 production companies by profit?</div>

In [None]:
groupby_plots(df=df.explode('production_companies'),by="production_companies",column="profit_adj",operation="sum",count=20)

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        I choose the total of the profit because i think it is more accurate, because the total consider the counts and the profit for each movie
        </li>
        <li style="margin-bottom: 10px;">Warner bros. Make the biggest profit</li>
        <li style="margin-bottom: 10px;">the last 10 are too close together</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 10 |Who are the top directors with the highest counts of movies?</div>

In [None]:
column_counts(df=df,column="director",xlabel="Director",ylabel="Counts",count=15)

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph show the top directors with the highest counts of movies
        </li>
        <li style="margin-bottom: 10px;">Woody Allen has the highest counts of movies (42)</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 10 |Who are the top directors with popular movies?</div>

In [None]:
groupby_plots(df,by="director",column="popularity",operation="sum",count=15)

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows the total popularity of movies by The director
        </li>
        <li style="margin-bottom: 10px;">Christopher Nolan has the largest popularity movies</li>
        <li style="margin-bottom: 10px;">the last 10 are too close together, where values between (35--28)</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 11 | How does the count of movies vary over the years?</div>

In [None]:
column_counts(df=df,column="release_year",xlabel="Year",ylabel="Counts",kind="line")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows the counts of movies over years
        </li>
        <li style="margin-bottom: 10px;">2013 has the highest counts of movies (635)</li>
        <li style="margin-bottom: 10px;">1969 has the lowest counts of movies (29)</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 12 | What are the top 10 movies by profit?</div>

In [None]:
def get_top_10_movies(df: pd.DataFrame,by_col: str):
    """
    Display the top 10 movies based on a specified column in a DataFrame.

    Parameters:
    - df (DataFrame): The DataFrame containing movie data.
    - by_col (str): The column name based on which the movies are sorted.

    Returns:
    - None: Displays a bar plot showing the top 10 movies.
    """
    top_movies = df.sort_values(by=by_col, ascending=False)[:10]
    top_movies["title_with_year"] = top_movies["original_title"] + " (" + top_movies["release_year"].astype(str) + ")"

    plt.figure(figsize=(10, 6))
    p = sns.barplot(x=top_movies[by_col], y=top_movies["title_with_year"], palette="viridis")

    plt.xlabel(f"{by_col.title()}")
    plt.ylabel("Movie Title")
    plt.title(f"Top 10 Movies by {by_col.title()}")

    for container in p.containers:
        p.bar_label(container,color="yellow",
                    bbox = {'boxstyle': 'square', 'facecolor': 'black', 'edgecolor': 'red'})
    
    plt.show()

get_top_10_movies(df,"profit_adj")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows moveis with the highest profits
        </li>
        <li style="margin-bottom: 10px;">Star Wars has the highest profit of all the movies (2.7 Bilions)</li>
        <li style="margin-bottom: 10px;">Every Decade 2-3 Movies is Making a lot of money</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 13 | What are the top 10 movies by popularity?</div>

In [None]:
get_top_10_movies(df,"popularity")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows moveis with the highest popularity
        </li>
        <li style="margin-bottom: 10px;">Most popular movies were in 2015 and 2014</li>
        <li style="margin-bottom: 10px;">Star Wars were the only old Movie</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 14 | What are the top 10 movies by vote count?</div>

In [None]:
get_top_10_movies(df,"vote_count")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows moveis with the highest vote counts
        </li>
        <li style="margin-bottom: 10px;">Movies with high vote counts were between (2008-2014)</li>
        <li style="margin-bottom: 10px;">Inception was the movie with highest vote counts</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 15 | What are the top 10 movies by vote average?</div>

In [None]:
get_top_10_movies(df,"vote_average")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows moveis with the highest vote average
        </li>
        <li style="margin-bottom: 10px;">When we use the average ratings of the movie as the score but using this won't be fair enough since a movie with 8.9 average rating and only 3 votes cannot be considered better than the movie with 7.8 as as average rating but 40 votes. </li>
        <li style="margin-bottom: 10px;">So to get the best movies we can use IMDB's weighted rating</li>
        <img src="https://image.ibb.co/jYWZp9/wr.png">
        <li style="margin-bottom: 10px;">where,
        </li><li>v is the number of votes for the movie;</li>
        <li>m is the minimum votes required to be listed in the chart;</li>
        <li>R is the average rating of the movie;</li>
        <li>C is the mean vote across the whole report</li> 
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Imdb Weighted Rating Formula</div>

In [None]:
# We Already Have V and R 
# C is too easy 
C = df["vote_average"].mean()
print(C)

# M is the minimum votes

M = df["vote_count"].quantile(.9)
print(M)
# So i will take only the top 10 percent of the dataset to plot it / above than (582)

#Filter the dataset acordding to the m

top_10p_movies = df.loc[df["vote_count"] >= M]

top_10p_movies

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">We Setup all the required Variables</li>
        <li style="margin-bottom: 10px;">So let's build the rating function</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 16 | What are the top 10 movies by score?</div>

In [None]:
def weighted_rating(df, M=M, C=C):
    v = df['vote_count']
    R = df['vote_average']
    
    # Calculation based on the IMDB formula
    return (v/(v+M) * R) + (M/(M+v) * C)

# Applying the function to DataFrame and we chosse the axis=1 because we need it to apply on the rows
top_10p_movies["score"] = top_10p_movies.apply(weighted_rating,axis=1)


get_top_10_movies(top_10p_movies,"score")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">
        This Graph shows moveis with the highest score
        </li>
        <li style="margin-bottom: 10px;">Most of the movies are old</li>
        <li style="margin-bottom: 10px;">Interstellar and Inception and The Darknight are the newest movies</li>
        <li style="margin-bottom: 10px;">The movie with the highest score is The Shawshaw Redemption</li>
        <li style="margin-bottom: 10px;">The score is Rely on vote count and vote Average</li>
    </ul>
</div>

## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 17 | What is the distribution of runtime in movies?</div>

In [None]:
plt.hist(df["runtime"])
plt.show()

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">Most of the durations occur between 90m to 150m</li>
        <li style="margin-bottom: 10px;">But let's seprate them into 3 categories as i will do
            <ul>
                <li>Less than 1.5 hours</li>
                <li>Between 1.5 - 2.5 hours</li>
                <li>More than 2.5 hours</li>
            </ul>
        </li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 18 | How many movies are there for each duration category?</div>

In [None]:
runtime = df["runtime"]

# Create New Colmun Called duration_bin
df.loc[runtime.loc[runtime < 90].index, 'duration_bin'] = 'Less than 1.5 hours'
df.loc[runtime.loc[(runtime >= 90) & (runtime < 150)].index, 'duration_bin'] = 'Between 1.5 - 2.5 hours'
df.loc[runtime.loc[runtime >= 150].index, 'duration_bin'] = 'More than 2.5 hours'
df.head()

In [None]:
column_counts(df=df,column="duration_bin",xlabel="Duration",ylabel="Counts",kind="bar",fig_x=4)


<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">It is like the previous graph</li>
        <li style="margin-bottom: 10px;">But this is more detailed</li>
        <li style="margin-bottom: 10px;">7498 Movies are between 1.5 and 2.5 hours</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 19 | How many movies are released each month?</div>

In [None]:
df["month"] = df["release_date"].apply(lambda x: x.month)
sns.histplot(df["month"],bins=12,kde=True)
plt.show()

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">The 9th month has the highest Movies</li>
        <li style="margin-bottom: 10px;">The 2nd month has the lowest Movies</li>
        <li style="margin-bottom: 10px;">The histogram maybe is semtric</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 20 | On which day are most movies released?</div>

In [None]:
df["day_name"] = df["release_date"].dt.day_name()
column_counts(df=df,column="day_name",xlabel="Day",ylabel="Counts",kind="bar")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">The Friday has the highest Movies, because the next days are the weekends</li>
        <li style="margin-bottom: 10px;">The Monday has the lowest Movies, which is the start of the work</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 21 | What percentage of movies are released in each decade?</div>

In [None]:
df['decade'] = df['release_year'] // 10 * 10
decade_counts = df["decade"].value_counts()
plt.pie(decade_counts, labels=decade_counts.index.astype(str), autopct='%1.1f%%')
plt.show()

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">About 2 thirds of the movies is occur between 2000 and 2015</li>
        <li style="margin-bottom: 10px;">1960 is the lowest decade</li>
        <li style="margin-bottom: 10px;">There is an increasing of movies between every Decade</li>
        <li style="margin-bottom: 10px;">But 2010s is just 5 years not a whole Decade</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 22 |What is the total profit by decade?</div>

In [None]:
groupby_plots(df=df,by="decade",column="profit_adj",operation="sum")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">2000s has the highest Total Profit</li>
        <li style="margin-bottom: 10px;">1960 is the lowest decade</li>
        <li style="margin-bottom: 10px;">This Graph is affected by counts of movies in every dacade</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 23 | Is there a correlation between revenue and budget?</div>

In [None]:
def plot_and_correlate(df: pd.DataFrame, x_col: str, y_col: str) -> float:
    """
    Plot a scatter plot with a regression line and calculate the correlation coefficient.

    Parameters:
    - df (DataFrame): The DataFrame containing the data.
    - x_col (str): The column for the x-axis.
    - y_col (str): The column for the y-axis.

    Returns:
    - correlation (float): The Pearson correlation coefficient between the two columns.
    """
    correlation = df[x_col].corr(df[y_col])
    print(f"Pearson correlation coefficient between {x_col} and {y_col}: {correlation:.2f}")

    fig, ax = plt.subplots(figsize=(8, 6))
    sns.regplot(x=df[x_col], y=df[y_col], line_kws={"color": "r"}, ax=ax)
    ax.set_xlabel(x_col.capitalize())
    ax.set_ylabel(y_col.capitalize())
    ax.set_title(f"Relationship between {x_col.capitalize()} and {y_col.capitalize()}")

    plt.show()
    return correlation

plot_and_correlate(df=df,x_col="revenue_adj",y_col="budget_adj")


<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">There is a relation between budget and revenue</li>
        <li style="margin-bottom: 10px;">but because there is a lot of zeros the graph is not the bestt</li>
        <li style="margin-bottom: 10px;">As the budget increase the revenue also increase</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 24 | Is there a correlation between popularity and revenue?</div>

In [None]:
plot_and_correlate(df=df,x_col="popularity",y_col="revenue_adj")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">There is a relation between populartiy and revenue</li>
        <li style="margin-bottom: 10px;">As the populartiy increase the revenue also increase</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #ffd700; font-family: Trebuchet MS; padding: 15px; line-height: 1; border-radius: 1px; margin-bottom: 0em; text-align: center; font-size: 22px;color:#333;">Research Question 25 | Is there a correlation between popularity and runtime?</div>

In [None]:
plot_and_correlate(df=df,x_col="popularity",y_col="runtime")

<div style="position: relative; background-color: #EAEAEA; font-size: 20px; font-family: Georgia; border: 3px solid #FF5733; padding: 10px; margin: 10px 0; color: #333;">
    <div style="position: absolute; top: -20px; left: 40%; transform: translateX(-50%); background-color: #FF5733; padding: 5px 15px; font-weight: bold; font-size: 24px; color: white;">Insights</div>
    <ul style="list-style-type: disc; padding-left: 20px;">
        <li style="margin-bottom: 10px;">There is no any relation between populartiy and runtime</li>
    </ul>
</div>


## <div style="text-align: left; background-color: #BB4ED8 ; font-family: Trebuchet MS; color: white; padding: 15px; line-height:1;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 25px; border-radius: 8px 8px 0 0;  ">6| Communicate Results</div>

<a id='conclusions'></a>
<div style="border-radius: 10px; border: 2px solid #FFD700; padding: 20px; background-color: #E8F6EF; box-shadow: 0px 6px 12px rgba(0, 0, 0, 0.1); text-align: left;">
    <h2 style="color: #17A05D; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); font-size: 24px; font-weight: bold; margin-bottom: 10px;">
        Conclusions</h2>
        <ol style="font-size: 18px; color: #333;">
            <li>Drama has biggest amount of Movies</li>
            <li>Adventure has the biggest Average Popularity</li>
            <li>Adventure makes the biggest Profit</li>
            <li>Warner bros. Make the biggest profit</li>
            <li>Woody Allen has the highest counts of movies (42)</li>
            <li>Christopher Nolan has the biggest popular movies</li>
            <li>2013 has the highest counts of movies (635)</li>
            <li>1969 has the lowest counts of movies (29)</li>
            <li>Star Wars has the highest profit of all the movies (2.7 Bilions)</li>
            <li>Most popular movies were in 2015 and 2014</li>
            <li>Star Wars were the only old Movie</li>
            <li>Inception has the highest vote counts</li>
            <li>The movie with the highest score is The Shawshaw Redemption</li>
            <li>Interstellar is the newest movie and have high score</li>
            <li>Most of the movies runtimes occur between 90m to 150m</li>
            <li>The Septemper has the highest counts of movies</li>
            <li>The Februray has the lowest counts of movies</li>
            <li>The Friday has the highest counts of movies</li>
            <li>About 2 thirds of the movies is occur between 2000 and 2015</li>
            <li>2000s has the highest Total Profit</li>
            <li>In most of the cases, As the budget increase the revenue also increase</li>
        </ol>
</div>

<div style="background:#fff7f7;padding:10px;border-radius:6px;border:2px blue solid;margin:10px;">
  <h2>Limitations</h2>
    <p>There is a lot of Zeroes (Revenue,Budget) in this dataset nearly 50% of it, if we drop them, we will lose a lot of other data,so it will the profit graphs will not be accurate, There is mising data and incorrect datatypes
    </p>
</div>

<div style="background-color:white;font-size:15px;font-family:Georgia;border-style: solid;border-color: #ffd700;border-width:3px;padding:10px;margin: 1px;color:#254E58;overflow:hidden"> 
<div style="position: absolute; top: -0px; left: 50%; transform: translateX(-50%); background-color: #ffd700; padding: 5px 15px; font-weight: bold; font-size: 24px; color: #333;">Some information:</div>

<h4><b style = 'color: red;'>Author :</b> Obay Mohammed </h4>


<b>Read more projects :</b> https://www.kaggle.com/obayprogrammer <br>
<b>Email me :</b>  obay.zamir@gmail.com<br>
<b>Connect on LinkedIn :</b> https://www.linkedin.com/in/obaym <br>
    
<hr>
    
<center> <strong> If you liked this Notebook, please do upvote. </strong>
<center> <strong> thanks to you <a href="https://www.kaggle.com/zabihullah18">My Markdown style Was from zabihullah18<a/></strong>

<img src="https://raw.githubusercontent.com/ntclai/PictureForMyProject/main/87481-of-thanks-letter-text-logo-calligraphy-drawing%20(1).png">