# Data Analysis Project: IMDb Movie Data
- This notebook is designed as a summative project for you to demonstrate your data analysis skills using the IMDb movie dataset.
- Complete the tasks provided below, following the instructions for each part. Good luck!

3## Task Instructions and Questions

### Data Overview
1. Load the dataset and display the first five rows.
2. Describe the dataset showing the basic statistics of numerical columns.
3. How many unique directors are represented in the dataset?
4. What is the average movie duration?

### Data Cleaning
5. Are there any missing values in the dataset? If yes, fill them with appropriate values.
6. Are there any duplicate rows? If yes, remove them.

### Exploratory Data Analysis
7. Calculate the average IMDb score.
8. Find the top 5 movies with the highest gross earnings.
9. Which movie has the longest duration and what is its IMDb score?
10. Create a histogram of IMDb scores.

### Visualization Tasks
11. Plot the distribution of movie durations.
12. Create a scatter plot to analyze the relationship between 'duration' and 'imdb_score'.
13. Group the data by 'title_year' and plot the average 'imdb_score' for each year.
14. Create a bar chart showing the top 5 countries with the most movies produced.
15. Visualize the correlation matrix of the numerical features in the dataset.

### Further Analysis
16. Which director has the highest average IMDb score and how many movies have they directed?
17. What is the trend of movie durations over the years?

### Conclusions
18. Summarize your findings from the above tasks.
19. Reflect on what the data tells us about the trends in movie ratings and durations over the years.

### Dashboard Creation
20. Create a dashboard that displays three distinct visualizations. Your dashboard should allow users to interact with the visualizations using controllers.

> Requirements:

- Develop a Dashboard and utilize ipywidgets or a similar library to construct a dashboard that integrates the plots you've created.
- Include three different visualizations. Each should represent a unique dataset or aspect of your data.
- Add Interactive Controllers. Implement at least three controllers (e.g., sliders, dropdowns, buttons) that allow users to manipulate the visualizations dynamically.

### Use of AI

> Avoid using AI this time to ensure you show your full understanding of the Data Analysis concepts you learned from the previous notebooks.
    >> You may use all the codes you have in your project files


In [10]:
# Write all your codes below

### Task 1: Data Overview

import pandas as pd
import numpy as np

xls = pd.ExcelFile('imdb.xlsx')
df2 = xls.parse('directors')

df = xls.parse(xls.sheet_names[0])

print("First five rows of the dataset:")
print(df.head())

print("\nBasic statistics of numerical columns:")
print(df.describe())

print("\nUnique Directors")
print(df2['director_name'].nunique())

print("\nAverage Movie Duration")
print(df['duration'].mean())


### Task 2: Data Overview

# Import pandas
import pandas as pd

df = pd.read_excel("/Users/bdawg/PycharmProjects/PythonProject/PythonProject2/Fundamental2/Project_IMDb/imdb.xlsx")

missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

# Fill missing values appropriately
df.fillna({
    'duration': df['duration'].median(),  # Fill missing durations with median
    'imdb_score': df['imdb_score'].mean(),  # Fill missing IMDb scores with mean
    'director_name': 'Unknown',  # Replace missing director names with 'Unknown'
}, inplace=True)

print("\nMissing values after filling:\n", df.isnull().sum())

# 6. Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

# Remove duplicate rows
df.drop_duplicates(inplace=True)

print("\nDuplicate rows removed. New dataset shape:", df.shape)










First five rows of the dataset:
                movie_title  director_id  country_id content_rating  \
0  The Shawshank Redemption           34           1              R   
1            The Green Mile           34           1              R   
2             The Godfather           33           1              R   
3    The Godfather: Part II           33           1              R   
4            Apocalypse Now           33           1              R   

   title_year  imdb_score      gross  duration  
0        1994         9.3   28341469       142  
1        1999         8.5  136801374       189  
2        1972         9.2  134821952       175  
3        1974         9.0   57300000       220  
4        1979         8.5   78800000       289  

Basic statistics of numerical columns:
       director_id  country_id   title_year  imdb_score         gross  \
count   178.000000  178.000000   178.000000  178.000000  1.780000e+02   
mean     60.460674    3.275281  1996.292135    8.294382  1.03