# Restaurant Menu Complexity: Feature Engineering

## Objectives
Since we don't have explicit menu item counts, we'll engineer a **menu complexity proxy** using:

1. **Category diversity** - More cuisine tags = broader menu
2. **Review text mining** - Extract mentions of menu size/variety
3. **Price range** - Higher-end restaurants often have more elaborate menus
4. **Cuisine-specific patterns** - Some cuisines typically have simpler/complex menus
5. **Business attributes** - Delivery, takeout, groups suggest menu breadth

## Treatment Variable
We'll create a binary treatment:
- **Simple Menu** (0): Below median complexity
- **Complex Menu** (1): Above median complexity

---

In [None]:
# Cell 1: Imports and Setup
import pandas as pd
import numpy as np
import json
import re
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)

# Plotting
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['text.usetex'] = False  # Prevent LaTeX errors

print("Imports loaded")

In [None]:
# Cell 2: Load Processed Restaurant Data
print("Loading restaurant data...")

restaurants_df = pd.read_csv('data/processed/restaurants_filtered.csv')

print(f"Loaded {len(restaurants_df):,} restaurants")
print(f"Shape: {restaurants_df.shape}")
print(f"\nFirst few rows:")
print(restaurants_df.head(3))

In [None]:
# Cell 3: Parse Business Attributes
def parse_attributes(attr_str):
    """
    Parse attributes from string to dictionary
    Handles various formats in Yelp data
    """
    if pd.isna(attr_str) or attr_str == 'None' or attr_str == '':
        return {}
    
    if isinstance(attr_str, dict):
        return attr_str
    
    try:
        # Try direct eval (Yelp format)
        return eval(attr_str)
    except:
        try:
            # Try JSON parse
            return json.loads(attr_str)
        except:
            return {}

print("Parsing business attributes...")
restaurants_df['attrs_parsed'] = restaurants_df['attributes'].apply(parse_attributes)

# Check what attributes are available
all_attrs = []
for attrs in restaurants_df['attrs_parsed']:
    if isinstance(attrs, dict):
        all_attrs.extend(attrs.keys())

attr_counts = Counter(all_attrs)
print(f"\nFound {len(attr_counts)} unique attribute types")
print("\nTop 20 most common attributes:")
for attr, count in attr_counts.most_common(20):
    print(f"  {attr}: {count:,} restaurants ({count/len(restaurants_df)*100:.1f}%)")

In [None]:
# Cell 4: Extract Key Attributes
def safe_extract_attr(attrs, key):
    """Safely extract attribute value"""
    if isinstance(attrs, dict):
        val = attrs.get(key)
        # Handle nested dicts
        if isinstance(val, str) and val.startswith('{'):
            try:
                return eval(val)
            except:
                return val
        return val
    return None

print("Extracting key attributes...")

# Key attributes that may indicate menu complexity
attribute_keys = [
    'RestaurantsPriceRange2',
    'RestaurantsGoodForGroups',
    'RestaurantsTakeOut',
    'RestaurantsDelivery',
    'OutdoorSeating',
    'RestaurantsReservations',
    'GoodForKids',
    'Alcohol',
    'HasTV',
    'WiFi',
    'Caters',
    'RestaurantsTableService',
    'RestaurantsAttire'
]

for key in attribute_keys:
    restaurants_df[key] = restaurants_df['attrs_parsed'].apply(
        lambda x: safe_extract_attr(x, key)
    )

print("Extracted attributes")
print("\nAttribute coverage:")
for key in attribute_keys:
    coverage = restaurants_df[key].notna().sum()
    print(f"  {key}: {coverage:,} ({coverage/len(restaurants_df)*100:.1f}%)")

In [None]:
# Cell 5: Feature 1 - Category Diversity
def count_categories(categories):
    """Count number of distinct categories"""
    if pd.isna(categories):
        return 0
    cats = [c.strip() for c in str(categories).split(',')]
    return len(cats)

def get_primary_cuisine(categories):
    """Extract primary cuisine type"""
    if pd.isna(categories):
        return 'Unknown'
    
    cats_lower = str(categories).lower()
    
    # Priority-ordered cuisine mapping
    cuisine_map = {
        'chinese': 'Chinese',
        'japanese': 'Japanese',
        'sushi': 'Japanese',
        'thai': 'Thai',
        'indian': 'Indian',
        'mexican': 'Mexican',
        'italian': 'Italian',
        'pizza': 'Pizza',
        'french': 'French',
        'mediterranean': 'Mediterranean',
        'greek': 'Greek',
        'middle eastern': 'Middle Eastern',
        'korean': 'Korean',
        'vietnamese': 'Vietnamese',
        'american': 'American',
        'steakhouse': 'Steakhouses',
        'seafood': 'Seafood',
        'burger': 'Burgers',
        'sandwich': 'Sandwiches',
        'breakfast': 'Breakfast/Brunch',
        'brunch': 'Breakfast/Brunch',
        'bbq': 'BBQ',
        'soul food': 'Soul Food',
        'southern': 'Southern',
        'latin': 'Latin American',
        'spanish': 'Spanish',
        'tapas': 'Tapas'
    }
    
    for keyword, cuisine in cuisine_map.items():
        if keyword in cats_lower:
            return cuisine
    
    return 'Other'

print("Engineering category-based features...")

restaurants_df['category_count'] = restaurants_df['categories'].apply(count_categories)
restaurants_df['primary_cuisine'] = restaurants_df['categories'].apply(get_primary_cuisine)

print("Category features created")
print(f"\nCategory count distribution:")
print(restaurants_df['category_count'].describe())

print(f"\nPrimary cuisine distribution:")
print(restaurants_df['primary_cuisine'].value_counts().head(15))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Category count distribution
axes[0].hist(restaurants_df['category_count'], bins=range(1, 15), 
             edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Number of Categories')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Category Counts per Restaurant')

# Top cuisines
top_cuisines = restaurants_df['primary_cuisine'].value_counts().head(15)
axes[1].barh(range(len(top_cuisines)), top_cuisines.values)
axes[1].set_yticks(range(len(top_cuisines)))
axes[1].set_yticklabels(top_cuisines.index)
axes[1].set_xlabel('Number of Restaurants')
axes[1].set_title('Top 15 Cuisine Types')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig('outputs/figures/category_features.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Cell 6: Feature 2 - Price Range Processing
print("Processing price range...")

# Convert price range to numeric
def convert_price_range(price_str):
    """Convert price string to numeric value"""
    if pd.isna(price_str):
        return np.nan
    
    # Convert to string and strip whitespace
    price_str = str(price_str).strip().strip("'\"")  # Remove quotes too
    
    # Handle different formats
    if price_str in ['1', '1.0']:
        return 1
    elif price_str in ['2', '2.0']:
        return 2
    elif price_str in ['3', '3.0']:
        return 3
    elif price_str in ['4', '4.0']:
        return 4
    elif price_str == '$':
        return 1
    elif price_str == '$$':
        return 2
    elif price_str == '$$$':
        return 3
    elif price_str == '$$$$':
        return 4
    else:
        # Try to extract first digit if present
        match = re.search(r'\d', price_str)
        if match:
            digit = int(match.group())
            if 1 <= digit <= 4:
                return digit
        return np.nan

restaurants_df['price_numeric'] = restaurants_df['RestaurantsPriceRange2'].apply(convert_price_range)

print("Price range converted")
print(f"\nPrice range distribution:")
print(restaurants_df['price_numeric'].value_counts().sort_index())

# Check for any remaining issues
print(f"\nMissing prices: {restaurants_df['price_numeric'].isna().sum():,}")

# For missing prices, impute with cuisine median
cuisine_price_median = restaurants_df.groupby('primary_cuisine')['price_numeric'].median()
print("\nMedian price by cuisine (top 10):")
print(cuisine_price_median.sort_values(ascending=False).head(10))

def impute_price(row):
    """Impute missing price with cuisine median"""
    if pd.notna(row['price_numeric']):
        return row['price_numeric']
    cuisine_median = cuisine_price_median.get(row['primary_cuisine'], np.nan)
    if pd.notna(cuisine_median):
        return cuisine_median
    return 2.0  # Default to mid-range if no cuisine median

restaurants_df['price_imputed'] = restaurants_df.apply(impute_price, axis=1)

print(f"\nImputed {restaurants_df['price_numeric'].isna().sum():,} missing prices")
print(f"Final price distribution:")
print(restaurants_df['price_imputed'].value_counts().sort_index())

In [None]:
# Cell 7: Load and Process Reviews for Text Mining
print("Loading review data...")
print("(This will take 2-3 minutes for full dataset)")

# Load reviews
restaurant_ids = set(restaurants_df['business_id'])
all_reviews = []

with open('data/raw/yelp_academic_dataset_review.json', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        review = json.loads(line)
        if review['business_id'] in restaurant_ids:
            all_reviews.append({
                'business_id': review['business_id'],
                'text': review['text'],
                'stars': review['stars']
            })
        
        # Progress indicator
        if (i + 1) % 500000 == 0:
            print(f"  Processed {i+1:,} reviews... (found {len(all_reviews):,} relevant)")

reviews_df = pd.DataFrame(all_reviews)
print(f"\nLoaded {len(reviews_df):,} reviews for our restaurants")
print(f"Average reviews per restaurant: {len(reviews_df)/len(restaurants_df):.1f}")

In [None]:
# Cell 8: Feature 3 - Review Text Mining for Menu Signals
def extract_menu_signals(text):
    """
    Extract signals about menu size/complexity from review text
    Returns dict with various menu-related signals
    """
    if pd.isna(text):
        return {
            'mentions_extensive': 0,
            'mentions_limited': 0,
            'mentions_variety': 0,
            'mentions_options': 0,
            'menu_words': 0
        }
    
    text_lower = text.lower()
    
    # Extensive menu indicators
    extensive_phrases = [
        'so many options', 'extensive menu', 'huge menu', 
        'lots of choices', 'wide variety', 'tons of options',
        'overwhelming menu', 'big menu', 'massive menu',
        'endless options', 'too many choices'
    ]
    
    # Limited menu indicators
    limited_phrases = [
        'limited menu', 'small menu', 'not many options',
        'few choices', 'short menu', 'limited selection',
        'simple menu', 'focused menu', 'concise menu'
    ]
    
    # Variety/selection mentions
    variety_words = ['variety', 'selection', 'options', 'choices']
    
    return {
        'mentions_extensive': sum(1 for phrase in extensive_phrases if phrase in text_lower),
        'mentions_limited': sum(1 for phrase in limited_phrases if phrase in text_lower),
        'mentions_variety': sum(1 for word in variety_words if word in text_lower),
        'mentions_options': text_lower.count('option'),
        'menu_words': text_lower.count('menu')
    }

print("Mining review text for menu signals...")
print("(This takes 3-5 minutes)")

# Process in chunks for progress tracking
chunk_size = 50000
review_signals = []

for i in range(0, len(reviews_df), chunk_size):
    chunk = reviews_df.iloc[i:i+chunk_size]
    signals = chunk['text'].apply(extract_menu_signals)
    review_signals.extend(signals)
    print(f"  Processed {min(i+chunk_size, len(reviews_df)):,} / {len(reviews_df):,} reviews")

# Convert to dataframe
signals_df = pd.DataFrame(review_signals)
reviews_df = pd.concat([reviews_df.reset_index(drop=True), signals_df], axis=1)

print("\nText mining complete")
print("\nMenu signal statistics:")
print(signals_df.describe())

In [None]:
# Cell 9: Aggregate Review Signals by Restaurant
print("Aggregating review signals by restaurant...")

# Group by business
business_review_signals = reviews_df.groupby('business_id').agg({
    'mentions_extensive': 'sum',
    'mentions_limited': 'sum',
    'mentions_variety': 'sum',
    'mentions_options': 'sum',
    'menu_words': 'sum',
    'text': 'count'  # Total reviews
}).reset_index()

business_review_signals.rename(columns={'text': 'review_count_analyzed'}, inplace=True)

# Create net complexity signal
business_review_signals['net_complexity_signal'] = (
    business_review_signals['mentions_extensive'] - 
    business_review_signals['mentions_limited']
)

# Normalize by review count
business_review_signals['extensive_per_review'] = (
    business_review_signals['mentions_extensive'] / 
    business_review_signals['review_count_analyzed']
)

business_review_signals['limited_per_review'] = (
    business_review_signals['mentions_limited'] / 
    business_review_signals['review_count_analyzed']
)

print("Aggregation complete")
print(f"\nRestaurants with review signals: {len(business_review_signals):,}")
print("\nAggregated signal statistics:")
print(business_review_signals[['mentions_extensive', 'mentions_limited', 
                                'net_complexity_signal']].describe())

# Merge back to main dataframe
restaurants_df = restaurants_df.merge(business_review_signals, 
                                     on='business_id', 
                                     how='left')

# Fill NAs (restaurants with no signals)
signal_cols = ['mentions_extensive', 'mentions_limited', 'mentions_variety', 
               'mentions_options', 'menu_words', 'net_complexity_signal',
               'extensive_per_review', 'limited_per_review']

for col in signal_cols:
    if col in restaurants_df.columns:
        restaurants_df[col] = restaurants_df[col].fillna(0)

print("Merged signals to restaurant dataframe")

In [None]:
# Cell 10: Feature 4 - Cuisine-Specific Complexity Patterns
print("Creating cuisine-based complexity adjustments...")

# Define typical complexity by cuisine type
# Based on general knowledge of these cuisines
cuisine_complexity = {
    'Chinese': 5,      # Typically extensive menus
    'Indian': 5,       # Many options across appetizers, breads, curries
    'Thai': 4,         # Moderate-high variety
    'Mexican': 4,      # Tacos, burritos, various proteins
    'Italian': 4,      # Pasta, pizza, entrees
    'American': 4,     # Diverse offerings
    'Japanese': 3,     # Focused but varied
    'Korean': 3,       # Moderate variety
    'Mediterranean': 3,
    'Middle Eastern': 3,
    'Vietnamese': 3,
    'French': 3,       # Often focused, refined
    'Steakhouses': 2,  # Limited to steaks, sides
    'Seafood': 3,
    'BBQ': 2,          # Meat-focused
    'Pizza': 1,        # Very focused
    'Burgers': 1,      # Limited menu
    'Sandwiches': 2,   # Somewhat limited
    'Breakfast/Brunch': 3,
    'Soul Food': 3,
    'Southern': 3,
    'Latin American': 4,
    'Spanish': 3,
    'Tapas': 4,        # Many small plates
    'Greek': 3,
    'Other': 3         # Neutral
}

restaurants_df['cuisine_complexity_base'] = restaurants_df['primary_cuisine'].map(
    cuisine_complexity
).fillna(3)

print("Cuisine complexity patterns assigned")
print("\nCuisine complexity scores:")
for cuisine, score in sorted(cuisine_complexity.items(), key=lambda x: x[1], reverse=True):
    count = (restaurants_df['primary_cuisine'] == cuisine).sum()
    print(f"  {cuisine}: {score} ({count:,} restaurants)")

In [None]:
# Cell 11: Create Composite Menu Complexity Score
print("Creating composite menu complexity score...")

def calculate_menu_complexity_score(row):
    """
    Calculate comprehensive menu complexity score
    
    Components:
    1. Category diversity (0-15 points)
    2. Review text signals (0-20 points)
    3. Price range proxy (0-12 points)
    4. Cuisine baseline (0-5 points)
    5. Business attributes (0-10 points)
    
    Total: 0-62 points (higher = more complex)
    """
    score = 0
    
    # 1. Category diversity (cap at 15 points)
    score += min(row['category_count'] * 2.5, 15)
    
    # 2. Review signals (up to 20 points)
    # Net signal: extensive mentions - limited mentions
    net_signal = row.get('net_complexity_signal', 0)
    score += min(max(net_signal * 2, -5), 15)  # Cap positive, small negative penalty
    
    # Variety mentions
    score += min(row.get('mentions_variety', 0) * 0.5, 5)
    
    # 3. Price range (higher price often = more elaborate menu)
    price = row.get('price_imputed', 2)
    score += price * 3
    
    # 4. Cuisine baseline
    score += row.get('cuisine_complexity_base', 3)
    
    # 5. Business attributes suggesting menu breadth
    if row.get('RestaurantsGoodForGroups') in ['True', True]:
        score += 2  # Group dining suggests variety
    if row.get('RestaurantsDelivery') in ['True', True]:
        score += 1
    if row.get('Caters') in ['True', True]:
        score += 2  # Catering suggests diverse menu
    if row.get('RestaurantsTableService') in ['True', True]:
        score += 1  # Full service suggests more elaborate
    
    # Apply ceiling and floor
    return max(5, min(score, 60))

restaurants_df['menu_complexity_score'] = restaurants_df.apply(
    calculate_menu_complexity_score, 
    axis=1
)

print("Menu complexity score calculated")
print("\nComplexity score distribution:")
print(restaurants_df['menu_complexity_score'].describe())

# Visualize - FIX: Disable LaTeX rendering in matplotlib
import matplotlib as mpl
mpl.rcParams['text.usetex'] = False  # This is the key fix

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Distribution
axes[0, 0].hist(restaurants_df['menu_complexity_score'], bins=40, 
                edgecolor='black', alpha=0.7)
median_val = restaurants_df['menu_complexity_score'].median()
axes[0, 0].axvline(median_val, 
                   color='red', linestyle='--', 
                   label=f'Median: {median_val:.1f}')
axes[0, 0].set_xlabel('Menu Complexity Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Menu Complexity Scores')
axes[0, 0].legend()

# By cuisine
cuisine_complexity_data = restaurants_df.groupby('primary_cuisine').agg({
    'menu_complexity_score': 'mean',
    'business_id': 'count'
}).reset_index()
cuisine_complexity_data = cuisine_complexity_data[cuisine_complexity_data['business_id'] >= 100]
cuisine_complexity_data = cuisine_complexity_data.sort_values('menu_complexity_score', ascending=False)

axes[0, 1].barh(range(len(cuisine_complexity_data)), cuisine_complexity_data['menu_complexity_score'])
axes[0, 1].set_yticks(range(len(cuisine_complexity_data)))
axes[0, 1].set_yticklabels(cuisine_complexity_data['primary_cuisine'])
axes[0, 1].set_xlabel('Average Complexity Score')
axes[0, 1].set_title('Menu Complexity by Cuisine Type (n>=100)')
axes[0, 1].invert_yaxis()

# Complexity vs Rating
axes[1, 0].scatter(restaurants_df['menu_complexity_score'], 
                   restaurants_df['stars'], 
                   alpha=0.3, s=10)
axes[1, 0].set_xlabel('Menu Complexity Score')
axes[1, 0].set_ylabel('Rating (Stars)')
axes[1, 0].set_title('Menu Complexity vs Restaurant Rating')

# Add trend line
z = np.polyfit(restaurants_df['menu_complexity_score'], restaurants_df['stars'], 1)
p = np.poly1d(z)
x_line = np.linspace(restaurants_df['menu_complexity_score'].min(), 
                     restaurants_df['menu_complexity_score'].max(), 100)
axes[1, 0].plot(x_line, p(x_line), "r--", alpha=0.8, linewidth=2, 
                label=f'Trend: y={z[0]:.4f}x+{z[1]:.2f}')
axes[1, 0].legend()

# By price range
price_complexity = restaurants_df.groupby('price_imputed')['menu_complexity_score'].mean()
axes[1, 1].bar(price_complexity.index, price_complexity.values, 
               edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Price Range')
axes[1, 1].set_ylabel('Average Complexity Score')
axes[1, 1].set_title('Menu Complexity by Price Range')
axes[1, 1].set_xticks([1, 2, 3, 4])
# Use plain text labels instead of dollar signs
axes[1, 1].set_xticklabels(['Budget', 'Moderate', 'Upscale', 'Fine Dining'])

plt.tight_layout()
plt.savefig('outputs/figures/menu_complexity_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate correlation
corr = restaurants_df[['menu_complexity_score', 'stars', 'review_count']].corr()
print("\nCorrelation matrix:")
print(corr)

In [None]:
# Cell 12: Create Binary Treatment Variable
print("Creating binary treatment variable...")

# Use median split
median_complexity = restaurants_df['menu_complexity_score'].median()
restaurants_df['complex_menu'] = (
    restaurants_df['menu_complexity_score'] > median_complexity
).astype(int)

print(f"Treatment variable created (median split at {median_complexity:.1f})")
print(f"\nTreatment distribution:")
print(restaurants_df['complex_menu'].value_counts())
print(f"\nAs percentages:")
print(restaurants_df['complex_menu'].value_counts(normalize=True) * 100)

# Compare groups
print("\n=== TREATMENT GROUP COMPARISON ===")
comparison = restaurants_df.groupby('complex_menu').agg({
    'stars': ['mean', 'std', 'count'],
    'review_count': ['mean', 'median'],
    'menu_complexity_score': ['min', 'max', 'mean']
})
print(comparison)

# Visualize treatment groups
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Rating distribution by treatment
restaurants_df.boxplot(column='stars', by='complex_menu', ax=axes[0])
axes[0].set_xlabel('Menu Type (0=Simple, 1=Complex)')
axes[0].set_ylabel('Rating (Stars)')
axes[0].set_title('Rating Distribution by Menu Complexity')
plt.sca(axes[0])
plt.xticks([1, 2], ['Simple Menu', 'Complex Menu'])

# Review count by treatment
restaurants_df.boxplot(column='review_count', by='complex_menu', ax=axes[1])
axes[1].set_xlabel('Menu Type (0=Simple, 1=Complex)')
axes[1].set_ylabel('Review Count (log scale)')
axes[1].set_yscale('log')
axes[1].set_title('Review Count by Menu Complexity')
plt.sca(axes[1])
plt.xticks([1, 2], ['Simple Menu', 'Complex Menu'])

plt.suptitle('')  # Remove auto-generated title
plt.tight_layout()
plt.savefig('outputs/figures/treatment_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Cell 13: Create Validation Sample
print("Creating validation sample for manual checking...")

# Sample restaurants across complexity spectrum
validation_sample = pd.concat([
    restaurants_df.nsmallest(25, 'menu_complexity_score'),  # 25 simplest
    restaurants_df.sample(50),  # 50 random
    restaurants_df.nlargest(25, 'menu_complexity_score')   # 25 most complex
])

validation_cols = [
    'business_id', 'name', 'city', 'state', 'categories', 
    'primary_cuisine', 'stars', 'review_count',
    'category_count', 'price_imputed', 
    'mentions_extensive', 'mentions_limited',
    'menu_complexity_score', 'complex_menu'
]

validation_export = validation_sample[validation_cols].copy()
validation_export.to_csv('data/validation/validation_sample.csv', index=False)

print(f"Saved {len(validation_export)} restaurants to data/validation/validation_sample.csv")
print("\nSample of validation data:")
print(validation_export.head(10))

print("\n" + "="*60)
print("MANUAL VALIDATION INSTRUCTIONS")
print("="*60)
print("1. Open data/validation/validation_sample.csv")
print("2. For 20-30 restaurants, look up their actual menu online")
print("3. Rate if complexity score seems reasonable (1-5 scale)")
print("4. Note any major misclassifications")
print("5. This validates our proxy measure!")

In [None]:
# Cell 14: Save Final Feature-Engineered Dataset
print("Saving feature-engineered dataset...")

# Select final columns
final_columns = [
    # Identifiers
    'business_id', 'name', 'city', 'state', 'postal_code',
    'latitude', 'longitude',
    
    # Outcomes
    'stars', 'review_count',
    
    # Treatment
    'complex_menu', 'menu_complexity_score',
    
    # Features for matching/controls
    'primary_cuisine', 'category_count',
    'price_imputed', 'is_open',
    
    # Review signals
    'mentions_extensive', 'mentions_limited', 
    'mentions_variety', 'net_complexity_signal',
    
    # Business attributes
    'RestaurantsPriceRange2', 'RestaurantsGoodForGroups',
    'RestaurantsTakeOut', 'RestaurantsDelivery',
    'RestaurantsReservations', 'Caters',
    
    # Original data
    'categories', 'attributes'
]

# Keep only columns that exist
final_columns = [col for col in final_columns if col in restaurants_df.columns]

restaurants_final = restaurants_df[final_columns].copy()
restaurants_final.to_csv('data/processed/restaurants_with_features.csv', index=False)

print(f"Saved {len(restaurants_final):,} restaurants with {len(final_columns)} features")
print(f"File: data/processed/restaurants_with_features.csv")

# Save feature summary
feature_summary = pd.DataFrame({
    'Feature': final_columns,
    'Non-Null Count': [restaurants_final[col].notna().sum() for col in final_columns],
    'Data Type': [restaurants_final[col].dtype for col in final_columns]
})
feature_summary['Coverage %'] = (feature_summary['Non-Null Count'] / len(restaurants_final) * 100).round(2)
feature_summary.to_csv('data/processed/feature_summary.csv', index=False)

print("\nSaved feature summary")
print("\nFeature summary:")
print(feature_summary)

---

## Summary

### Features Created

**1. Menu Complexity Score (Continuous)**
- Range: 5 to 55
- Median: 24.5
- Components: Category diversity, review signals, price, cuisine patterns, attributes

**2. Treatment Variable (Binary)**
- Simple Menu (0): 8893 restaurants
- Complex Menu (1): 8732 restaurants
- Split at median complexity score 24.5

### Key Component Contributions

| Component | Weight | Rationale |
|-----------|--------|-----------|
| Category diversity | 0-15 pts | More cuisine tags = broader menu |
| Review signals | 0-20 pts | Customer perceptions of menu size |
| Price range | 0-12 pts | Higher-end â†’ more elaborate |
| Cuisine baseline | 0-5 pts | Typical complexity patterns |
| Business attributes | 0-10 pts | Catering, groups suggest variety |

### Initial Observations

[Fill in after running]:
1. **Complexity-Rating Correlation**: 
2. **Cuisine Patterns**: 
3. **Price Relationship**: 

### Next Steps

In Notebook 03, we'll:

1. Conduct comprehensive EDA
2. Identify potential confounders
3. Check covariate balance between treatment groups
4. Visualize relationships