In [3]:
# 📚 Intro to "Category" Data Type in Python 🏷️
# Let's explore **categorical data** and how it can optimize data analysis and visualization.

# -----------------------------
# 1️⃣ Importing Required Libraries
# -----------------------------
import pandas as pd

# -----------------------------
# 2️⃣ Understanding Categorical Data
# -----------------------------
"""
📝 **What is a Categorical Data Type?**
- **Definition**: A **categorical** data type represents a fixed number of discrete values or labels.
  Examples:
  - Gender: "Male", "Female", "Other"
  - Movie Genre: "Comedy", "Action", "Drama"
  - Income Groups: "Low", "Middle", "High"

🎯 **Why Use Categorical Data Types?**
1. **Memory Efficiency**:
   - Strings are memory-intensive, but categories store unique values once.
   - Great for columns with repeated values (e.g., "Genre").
   
2. **Performance Boost**:
   - Faster computations during filtering, grouping, and aggregations.

3. **Semantic Clarity**:
   - Categories preserve meaningful order (e.g., "Low" < "Middle" < "High").
"""

# -----------------------------
# 3️⃣ Loading Example Data: Movie Ratings 🎥
# -----------------------------
# Imagine we have a dataset of movies with various features.
movies = pd.DataFrame({
    "Film": ["Inception", "The Dark Knight", "Interstellar", "Joker", "Titanic"],
    "Genre": ["Action", "Action", "Sci-Fi", "Drama", "Romance"],
    "CriticRating": [91, 94, 93, 88, 89],
    "AudienceRating": [92, 96, 94, 87, 85],
    "BudgetMillions": [160, 185, 165, 70, 200],
    "Year": [2010, 2008, 2014, 2019, 1997]
})

# Display the dataset
print("🎥 Movie Ratings Dataset:")
print(movies)

# -----------------------------
# 4️⃣ Converting a Column to "Category" Type
# -----------------------------
# Let's optimize the "Genre" column using the category type.
movies["Genre"] = movies["Genre"].astype("category")

# Verify the change
print("\n🛠️ Data Types After Conversion:")
print(movies.dtypes)

# Observe memory usage
print("\n📊 Memory Usage Before and After:")
print(movies.info())

# -----------------------------
# 5️⃣ Adding Ordered Categories: Income Group Example
# -----------------------------
# Ordered categories add ranking to categorical data.
income_levels = pd.Categorical(
    ["Low income", "Lower middle income", "Upper middle income", "High income"],
    categories=["Low income", "Lower middle income", "Upper middle income", "High income"],
    ordered=True
)

# Create a dataset with Income Groups
income_data = pd.DataFrame({
    "Country": ["Country A", "Country B", "Country C", "Country D"],
    "IncomeGroup": ["High income", "Low income", "Upper middle income", "Lower middle income"]
})

# Convert "IncomeGroup" to ordered category
income_data["IncomeGroup"] = income_data["IncomeGroup"].astype("category")
income_data["IncomeGroup"] = income_data["IncomeGroup"].cat.set_categories(income_levels.categories, ordered=True)

# Sort by income level
sorted_income_data = income_data.sort_values("IncomeGroup")

# Display the result
print("\n🔢 Ordered Income Group Dataset:")
print(sorted_income_data)

# -----------------------------
# ✨ Key Takeaways ✨
# -----------------------------
"""
1️⃣ **Categorical Data Types**:
   - Ideal for repetitive text data (e.g., "Genre", "IncomeGroup").
   - Save memory and enhance computational efficiency.

2️⃣ **Ordered Categories**:
   - Useful for ranked or hierarchical data (e.g., "Low" < "Medium" < "High").
   - Enables logical sorting and comparison.

3️⃣ **Real-World Applications**:
   - Demographics analysis (e.g., income brackets).
   - Product categories in e-commerce.
   - User preferences in recommendation systems.

🛠️ **Pro Tip**:
   Always convert columns with a limited number of unique values into categories. It’s a small optimization with big performance benefits!
"""

🎥 Movie Ratings Dataset:
              Film    Genre  CriticRating  AudienceRating  BudgetMillions  \
0        Inception   Action            91              92             160   
1  The Dark Knight   Action            94              96             185   
2     Interstellar   Sci-Fi            93              94             165   
3            Joker    Drama            88              87              70   
4          Titanic  Romance            89              85             200   

   Year  
0  2010  
1  2008  
2  2014  
3  2019  
4  1997  

🛠️ Data Types After Conversion:
Film                object
Genre             category
CriticRating         int64
AudienceRating       int64
BudgetMillions       int64
Year                 int64
dtype: object

📊 Memory Usage Before and After:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Film        

'\n1️⃣ **Categorical Data Types**:\n   - Ideal for repetitive text data (e.g., "Genre", "IncomeGroup").\n   - Save memory and enhance computational efficiency.\n\n2️⃣ **Ordered Categories**:\n   - Useful for ranked or hierarchical data (e.g., "Low" < "Medium" < "High").\n   - Enables logical sorting and comparison.\n\n3️⃣ **Real-World Applications**:\n   - Demographics analysis (e.g., income brackets).\n   - Product categories in e-commerce.\n   - User preferences in recommendation systems.\n\n🛠️ **Pro Tip**:\n   Always convert columns with a limited number of unique values into categories. It’s a small optimization with big performance benefits!\n'