# Assignment 01: Exploratory Data Analysis & Data Preprocessing
### Seasons of Code - AnimAI
---
## Introduction

Welcome to your first assignment for the AnimAI project! In this assignment, you'll work with a dataset containing popularity statistics for various cartoon characters across different countries. This dataset has intentionally been made "messy" with outliers, missing values, and inconsistencies to simulate real-world data challenges.

The skills you learn in this assignment will form the foundation for more advanced machine learning and AI applications in later weeks of the project.

>**Objective**: To perform exploratory data analysis and preprocessing on a cartoon character popularity dataset, applying fundamental concepts of data cleaning, visualization, and statistical analysis.


## Dataset Overview

The dataset `cartoon_popularity_data.csv` contains information about cartoon character popularity across various countries with the following columns:

- `Character`: Name of the cartoon character
- `Country`: Country where the popularity was measured
- `Popularity_Score`: A rating from 0-100 indicating popularity (though some entries may fall outside this range)
- `Avg_Episodes_Watched_Per_Year`: Average number of episodes watched per viewer per year
- `Merchandise_Revenue_MillionUSD`: Revenue generated from character merchandise in millions of USD

First, let's import the necessary libraries and load our dataset:

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set figure size for better readability
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['axes.grid'] = True

# Load the dataset
file_path = 'cartoon_popularity_data.csv' # Fill here with the path to your dataset

# Check if the file exists
try:
    with open(file_path, 'r') as f:
        pass
except FileNotFoundError:
    print(f"File {file_path} not found. Please check the path and try again.")
    exit()

# Read the dataset
df = pd.read_csv(file_path)

# Display the first few rows
print(f"Dataset shape: {df.shape}")
df.head(20)

Dataset shape: (10000, 5)


Unnamed: 0,Character,Country,Popularity_Score,Avg_Episodes_Watched_Per_Year,Merchandise_Revenue_MillionUSD
0,Shinchan,Canada,74.06677446676758,80.0,3.949924724368964
1,Paw Patrol,France,80.58192518328079,48.0,22.31606244865129
2,SpongeBob SquarePants,Russia,37.85343772083535,127.5,258.92694939230086
3,Motu Patlu,UK,83.41104266407503,50.0,27.430804382862224
4,Mr Bean,Egypt,76.83135775923722,9.0,40.715313915210565
5,Mr Bean,Russia,74.49889820916066,100.0,89.88446547664522
6,Shinchan,China,42.215996679968406,53.0,47.33693889068305
7,Shinchan,France,29.350012864742613,5.0,87.70944109744121
8,Motu Patlu,Australia,53.937903011962575,72.0,60.29551386156268
9,Motu Patlu,Saudi Arabia,48.59904633166138,73.0,13.710754985476516


## Part 1: Exploratory Data Analysis (EDA)

### Task 1.1: Basic Data Exploration

- Display the shape of the dataset
- Check the data types of each column
- Generate basic statistics using `describe()`
- Check for missing values in each column

In [None]:
# TODO: Your code here

### Task 1.2: Data Visualization

Create appropriate visualizations to explore the dataset:

1. Distribution of popularity scores (histogram)
2. Average episodes watched by character (bar chart)
3. Merchandise revenue by country (box plot)
4. Correlation heatmap between numerical variables

In [None]:
# TODO: Your code here

### Task 1.3: Identifying Data Issues

Based on your exploration:
- List all data quality issues you've found
- Categorize them (missing values, outliers, inconsistent formats, etc.)
- Explain how each issue might affect your analysis

In [None]:
# TODO: Your code here

## Part 2: Data Cleaning and Preprocessing

### Task 2.1: Handling Missing Values

Implement strategies to handle missing values in the dataset:

- For categorical columns: Replace with mode or a placeholder
- For numerical columns: Replace with mean, median, or a calculated value
- Document your approach and justify your choices

In [None]:
# TODO: Your code here

### Task 2.2: Handling Inconsistent Data

Fix inconsistencies in the dataset:

- Standardize character names (capitalization, extra spaces)
- Correct country name spellings
- Convert any string values in numerical columns to appropriate numeric types

In [None]:
# TODO: Your code here

### Task 2.3: Outlier Detection and Handling

Detect and handle outliers in numerical columns:

- Use visualization methods (box plots) to identify outliers
- Use statistical methods (Z-score or IQR) to confirm outliers
- Implement an appropriate strategy (capping, removing, or transforming)

In [None]:
# TODO: Your code here

### Task 2.4: Data Transformation

Apply appropriate transformations to prepare the data for analysis:

- Standardize or normalize numerical features if needed
- Create any useful derived features
- Encode categorical variables if necessary

In [None]:
# TODO: Your code here

## Part 3: Advanced Analysis (Optional)

### Task 3.1: Multicollinearity Analysis using VIF

Variance Inflation Factor (VIF) helps identify correlated features in your dataset:

> You'll need to install statsmodels first: `pip install statsmodels`

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a new dataframe with only numeric columns
numeric_df = df.select_dtypes(include=[np.number])

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = numeric_df.columns
vif_data["VIF"] = [variance_inflation_factor(numeric_df.values, i) for i in range(len(numeric_df.columns))]

print(vif_data)

> **Note**: VIF values > 5 indicate high multicollinearity. [Learn more about VIF here](https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/)

### Task 3.2: Dimensionality Reduction with PCA (Optional)

If you'd like to explore dimensionality reduction:
> **Note**: [Learn more about PCA here](https://www.kaggle.com/code/vipulgandhi/pca-beginner-friendly-detailed-explanation)

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_df)

# Apply PCA
pca = PCA()
pca_data = pca.fit_transform(scaled_data)

# Create a scree plot
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Components')
plt.grid(True)
plt.show()

## Part 4: Conclusion and Documentation

### Task 4.1: Summarize Your Findings

Write a summary of:
- The initial state of the data
- All issues identified
- Methods used to address each issue
- The final state of the cleaned dataset
- Any insights gained during the process

### Task 4.2: Save Your Cleaned Dataset

Save your cleaned and preprocessed dataset:

In [None]:
# TODO: Your code here


# END OF TODO
df_cleaned.to_csv('cartoon_popularity_cleaned.csv', index=False)

## Submission Guidelines

1. Submit your completed Jupyter notebook (.ipynb file)
2. Include the original and cleaned datasets
3. Make sure all code cells are executed and outputs are visible
4. Add appropriate markdown cells explaining your approach and findings
5. Ensure your notebook is well-organized and follows a logical flow



## Resources

### Pandas and Data Manipulation
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Kaggle: Pandas Tutorial](https://www.kaggle.com/learn/pandas)

### Data Visualization
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
- [Matplotlib Gallery](https://matplotlib.org/stable/gallery/index.html)
- [Kaggle: Data Visualization](https://www.kaggle.com/learn/data-visualization)

### Data Cleaning and Preprocessing
- [Handling Missing Values](https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b)
- [Outlier Detection Methods](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/)
- [GeeksforGeeks: Data Preprocessing](https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/)

### Advanced Topics
- [Understanding VIF](https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/)
- [PCA Explained](https://www.kaggle.com/code/vipulgandhi/pca-beginner-friendly-detailed-explanation)

---

**Good luck with your assignment! Remember, data preprocessing is an art as much as it is a science – there are often multiple valid approaches to handle data issues, so freely use your intution wherever you need.**