### Final Thoughts: From Data to Business Action

The journey from raw data to actionable business insights involves several key steps we've practiced today:

1. **Data Understanding** → Know your data quality, structure, and business context
2. **Exploratory Analysis** → Discover patterns and relationships in your data  
3. **Model Selection** → Choose appropriate techniques based on your business question
4. **Interpretation** → Translate technical results into business language
5. **Action Planning** → Convert insights into concrete business strategies

### Practice Exercises (Try These!)

1. **Classification Challenge**: Try predicting a different outcome using the same dataset
2. **Clustering Experiment**: Use different numbers of clusters (k=3, k=5) and compare results  
3. **Feature Engineering**: Create new features combining existing ones (e.g., ratios, categories)
4. **Business Scenarios**: Apply these techniques to your own industry or organization

### Resources for Continued Learning

- **Python Practice**: Continue with basic Python tutorials and pandas documentation
- **Business Analytics**: Focus on translating business problems to data questions
- **Domain Knowledge**: The more you understand your business context, the better your analysis will be
- **Experimentation**: Try different approaches and compare results

Remember: The goal isn't just to run algorithms, but to generate insights that drive better business decisions!

---

**End of Notebook**

*Happy analyzing! 🚀📊*

In [None]:
# Anomaly Detection using Isolation Forest
from sklearn.ensemble import IsolationForest

print("Anomaly Detection for Business Intelligence")
print("="*45)

# Use the same scaled data from clustering
iso_forest = IsolationForest(contamination=0.1, random_state=42)  # Expect 10% anomalies
anomaly_labels = iso_forest.fit_predict(scaled_data)

# Convert to binary labels (1 = normal, 0 = anomaly)
anomaly_binary = (anomaly_labels == 1).astype(int)
n_anomalies = sum(anomaly_labels == -1)

print(f"Detected {n_anomalies} anomalies out of {len(anomaly_labels)} samples")
print(f"Anomaly rate: {n_anomalies/len(anomaly_labels)*100:.1f}%")

# Add anomaly information to our dataframe
df_with_anomalies = df_clustered.copy()
df_with_anomalies['Is_Anomaly'] = (anomaly_labels == -1)

# Analyze anomalies
print(f"\nAnomalous Data Points:")
print("-" * 25)

anomalous_data = df_with_anomalies[df_with_anomalies['Is_Anomaly'] == True]

if len(anomalous_data) > 0:
    print("Characteristics of anomalous data points:")
    for feature in clustering_features[:3]:  # Show top 3 features
        normal_mean = df_with_anomalies[df_with_anomalies['Is_Anomaly'] == False][feature].mean()
        anomaly_mean = anomalous_data[feature].mean()
        print(f"  {feature}:")
        print(f"    Normal: {normal_mean:.2f}")
        print(f"    Anomalies: {anomaly_mean:.2f}")
        print(f"    Difference: {anomaly_mean - normal_mean:.2f}")

    # Visualize anomalies
    if len(clustering_features) >= 2:
        plt.figure(figsize=(12, 5))
        
        # Plot 1: Anomalies in feature space
        plt.subplot(1, 2, 1)
        
        if len(clustering_features) > 2:
            # Use PCA data if available
            normal_points = pca_data[anomaly_labels == 1]
            anomaly_points = pca_data[anomaly_labels == -1]
            
            plt.scatter(normal_points[:, 0], normal_points[:, 1], 
                       c='blue', alpha=0.6, label='Normal', s=20)
            plt.scatter(anomaly_points[:, 0], anomaly_points[:, 1], 
                       c='red', alpha=0.8, label='Anomaly', s=60, marker='x')
            plt.xlabel('PC1')
            plt.ylabel('PC2')
        else:
            feature1, feature2 = clustering_features[0], clustering_features[1]
            normal_data = df_with_anomalies[df_with_anomalies['Is_Anomaly'] == False]
            anomalous_data = df_with_anomalies[df_with_anomalies['Is_Anomaly'] == True]
            
            plt.scatter(normal_data[feature1], normal_data[feature2], 
                       c='blue', alpha=0.6, label='Normal', s=20)
            plt.scatter(anomalous_data[feature1], anomalous_data[feature2], 
                       c='red', alpha=0.8, label='Anomaly', s=60, marker='x')
            plt.xlabel(feature1)
            plt.ylabel(feature2)
        
        plt.title('Anomaly Detection Results')
        plt.legend()
        
        # Plot 2: Anomaly distribution by cluster
        plt.subplot(1, 2, 2)
        
        anomaly_by_cluster = df_with_anomalies.groupby('Cluster')['Is_Anomaly'].sum()
        total_by_cluster = df_with_anomalies.groupby('Cluster').size()
        anomaly_rate_by_cluster = (anomaly_by_cluster / total_by_cluster) * 100
        
        bars = anomaly_rate_by_cluster.plot(kind='bar', color='orange', alpha=0.7)
        plt.title('Anomaly Rate by Customer Segment')
        plt.xlabel('Cluster')
        plt.ylabel('Anomaly Rate (%)')
        plt.xticks(rotation=0)
        
        # Add value labels
        for i, v in enumerate(anomaly_rate_by_cluster):
            plt.text(i, v + max(anomaly_rate_by_cluster)*0.02, f'{v:.1f}%', 
                     ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()
        
        print(f"\nBusiness Applications of Anomaly Detection:")
        print("=" * 45)
        print("🔍 Fraud Detection: Identify unusual transaction patterns")
        print("⚠️  Quality Control: Detect products outside normal specifications")
        print("📊 Performance Monitoring: Spot unusual business metrics")
        print("🛡️  Risk Management: Identify high-risk customers or accounts")
        print("\n💡 Next Step: Investigate these anomalies to understand if they represent:")
        print("   • Data entry errors that need correction")
        print("   • Genuinely unusual but legitimate cases")
        print("   • Potential fraud or problems requiring immediate attention")

else:
    print("No anomalies detected in the current dataset.")

### Additional Unsupervised Learning Example: Anomaly Detection

Anomaly detection is another powerful unsupervised learning technique that can help businesses identify:
- Fraudulent transactions
- Unusual spending patterns  
- Equipment failures
- Quality control issues

Let's implement a simple anomaly detection example to identify unusual patterns in our dataset.

---

## Part 5: Summary and Next Steps

### What We've Learned Today

In this notebook, we've covered the fundamentals of translating business problems into data science solutions:

#### 🐍 Python Basics
- **Variables & Calculations**: Storing and manipulating business data
- **Lists & Dictionaries**: Organizing multiple data points
- **Conditional Logic**: Making data-driven decisions  
- **Loops**: Processing multiple records efficiently
- **Functions**: Creating reusable business calculations

#### 📊 Data Analysis with Pandas
- Loading and exploring business datasets
- Understanding data structure and quality
- Calculating summary statistics
- Creating meaningful visualizations

#### 🎯 Supervised Learning (Classification)
- **Business Application**: Predicting business success
- **Models Used**: Logistic Regression and Random Forest
- **Key Insight**: Understanding which factors most influence outcomes
- **Business Value**: Make informed decisions about resource allocation

#### 👥 Unsupervised Learning (Clustering)
- **Business Application**: Customer segmentation
- **Method Used**: K-means clustering  
- **Key Insight**: Identifying distinct customer groups
- **Business Value**: Targeted marketing strategies and personalized services

### Key Business Insights

1. **Data-Driven Decisions**: Instead of relying on intuition alone, we can use historical data to predict future outcomes
2. **Pattern Recognition**: Machine learning helps identify hidden patterns that might not be obvious to human analysis
3. **Customer Understanding**: Segmentation reveals different customer types, enabling more effective business strategies
4. **Predictive Power**: Models can help anticipate business outcomes and inform strategic planning

In [None]:
# Analyze cluster characteristics
print("Customer Segment Analysis")
print("="*30)

# Calculate cluster statistics
cluster_stats = df_clustered.groupby('Cluster')[clustering_features].agg(['mean', 'std', 'count'])
cluster_stats = cluster_stats.round(2)

# Display cluster characteristics
for cluster in sorted(df_clustered['Cluster'].unique()):
    print(f"\nCluster {cluster} Profile:")
    print("-" * 20)
    cluster_data = df_clustered[df_clustered['Cluster'] == cluster]
    print(f"Size: {len(cluster_data)} customers ({len(cluster_data)/len(df_clustered)*100:.1f}%)")
    
    # Show top characteristics for this cluster
    if len(clustering_features) > 0:
        print("Key characteristics:")
        for feature in clustering_features[:5]:  # Show top 5 features
            mean_val = cluster_data[feature].mean()
            overall_mean = df_clustered[feature].mean()
            if abs(mean_val) > 0.01:  # Only show if meaningful value
                if mean_val > overall_mean * 1.2:
                    print(f"  ▲ High {feature}: {mean_val:.2f} (avg: {overall_mean:.2f})")
                elif mean_val < overall_mean * 0.8:
                    print(f"  ▼ Low {feature}: {mean_val:.2f} (avg: {overall_mean:.2f})")
                else:
                    print(f"  = Average {feature}: {mean_val:.2f}")

# Visualize cluster characteristics
if len(clustering_features) >= 2:
    # Use PCA for visualization if we have many features
    if len(clustering_features) > 2:
        pca = PCA(n_components=2)
        pca_data = pca.fit_transform(scaled_data)
        feature1, feature2 = 'PC1', 'PC2'
        x_data, y_data = pca_data[:, 0], pca_data[:, 1]
        
        print(f"\nPCA Analysis:")
        print(f"PC1 explains {pca.explained_variance_ratio_[0]:.1%} of variance")
        print(f"PC2 explains {pca.explained_variance_ratio_[1]:.1%} of variance")
        print(f"Total explained: {sum(pca.explained_variance_ratio_):.1%}")
        
    else:
        # Use first two features directly
        feature1, feature2 = clustering_features[0], clustering_features[1]
        x_data = df_clustered[feature1]
        y_data = df_clustered[feature2]
    
    # Create cluster visualization
    plt.figure(figsize=(15, 5))
    
    # Plot 1: Scatter plot of clusters
    plt.subplot(1, 3, 1)
    scatter = plt.scatter(x_data, y_data, c=cluster_labels, cmap='viridis', alpha=0.6)
    plt.colorbar(scatter)
    plt.title('Customer Segments Visualization')
    plt.xlabel(feature1)
    plt.ylabel(feature2)
    
    # Add cluster centers if using original features
    if len(clustering_features) <= 2:
        centers = kmeans.cluster_centers_
        if len(clustering_features) == 2:
            # Transform centers back to original scale
            centers_original = scaler.inverse_transform(centers)
            plt.scatter(centers_original[:, 0], centers_original[:, 1], 
                       c='red', marker='x', s=200, linewidth=3, label='Centers')
            plt.legend()
    
    # Plot 2: Cluster size comparison
    plt.subplot(1, 3, 2)
    cluster_counts.plot(kind='pie', autopct='%1.1f%%', colors=plt.cm.viridis(np.linspace(0, 1, len(cluster_counts))))
    plt.title('Cluster Size Distribution')
    plt.ylabel('')
    
    # Plot 3: Business success by cluster (if isOpen exists)
    plt.subplot(1, 3, 3)
    if 'isOpen' in df_clustered.columns:
        success_by_cluster = df_clustered.groupby('Cluster')['isOpen'].mean()
        bars = success_by_cluster.plot(kind='bar', color='lightcoral')
        plt.title('Business Success Rate by Cluster')
        plt.xlabel('Cluster')
        plt.ylabel('Success Rate')
        plt.xticks(rotation=0)
        plt.ylim(0, 1)
        
        # Add percentage labels
        for i, v in enumerate(success_by_cluster):
            plt.text(i, v + 0.02, f'{v:.1%}', ha='center', va='bottom')
    else:
        plt.text(0.5, 0.5, 'isOpen column\nnot available', ha='center', va='center')
        plt.title('Business Success by Cluster')
    
    plt.tight_layout()
    plt.show()

# Create business personas for each cluster
print("\n" + "="*50)
print("BUSINESS PERSONAS")
print("="*50)

personas = {
    0: "Conservative Customers",
    1: "High-Value Customers", 
    2: "Growing Customers",
    3: "Budget-Conscious Customers"
}

for cluster in sorted(df_clustered['Cluster'].unique()):
    cluster_name = personas.get(cluster, f"Cluster {cluster}")
    cluster_size = len(df_clustered[df_clustered['Cluster'] == cluster])
    
    print(f"\n🎯 {cluster_name}")
    print(f"   Size: {cluster_size} customers ({cluster_size/len(df_clustered)*100:.1f}% of total)")
    
    if 'isOpen' in df_clustered.columns:
        success_rate = df_clustered[df_clustered['Cluster'] == cluster]['isOpen'].mean()
        print(f"   Business Success Rate: {success_rate:.1%}")
    
    print("   Marketing Strategy:")
    if cluster == 0:
        print("   → Focus on trust-building and reliability")
        print("   → Offer stable, proven products")
    elif cluster == 1:
        print("   → Provide premium services and exclusive offers")
        print("   → Focus on personalized experiences")
    elif cluster == 2:
        print("   → Nurture growth with educational content")
        print("   → Offer scalable solutions")
    else:
        print("   → Emphasize value and cost-effectiveness")
        print("   → Provide budget-friendly options")

### 4.2 Analyzing Customer Segments
Let's examine the characteristics of each cluster to understand different customer types.

In [None]:
# Import clustering libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

print("Preparing Data for Customer Segmentation")
print("="*45)

# Select features for clustering (use numeric columns)
clustering_features = df.select_dtypes(include=[np.number]).columns.tolist()
if 'isOpen' in clustering_features:
    clustering_features.remove('isOpen')  # Remove target variable

print(f"Using {len(clustering_features)} features for clustering:")
for feature in clustering_features[:5]:  # Show first 5
    print(f"  - {feature}")
if len(clustering_features) > 5:
    print(f"  ... and {len(clustering_features) - 5} more")

# Prepare clustering data
cluster_data = df[clustering_features].fillna(0)

# Standardize the features (important for K-means)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(cluster_data)

print(f"\nData scaled and ready for clustering")
print(f"Shape: {scaled_data.shape[0]} samples, {scaled_data.shape[1]} features")

# Determine optimal number of clusters using elbow method
print("\nFinding optimal number of clusters...")

inertias = []
k_range = range(2, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(scaled_data)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertias, 'bo-')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.grid(True, alpha=0.3)

# Add annotations for key points
for i, inertia in enumerate(inertias):
    if i % 2 == 0:  # Annotate every other point to avoid crowding
        plt.annotate(f'{inertia:.0f}', (k_range[i], inertia), 
                    textcoords="offset points", xytext=(0,10), ha='center')

# Choose optimal k (for demonstration, we'll use 4)
optimal_k = 4
print(f"Using k = {optimal_k} clusters for analysis")

# Perform clustering
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(scaled_data)

# Add cluster labels to original data
df_clustered = df.copy()
df_clustered['Cluster'] = cluster_labels

print(f"\nClustering completed!")
print(f"Cluster distribution:")
cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
for cluster, count in cluster_counts.items():
    percentage = (count / len(cluster_labels)) * 100
    print(f"  Cluster {cluster}: {count} samples ({percentage:.1f}%)")

plt.subplot(1, 2, 2)
cluster_counts.plot(kind='bar', color='skyblue')
plt.title('Cluster Size Distribution')
plt.xlabel('Cluster')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)

# Add value labels on bars
for i, count in enumerate(cluster_counts):
    plt.text(i, count + max(cluster_counts)*0.01, str(count), 
             ha='center', va='bottom')

plt.tight_layout()
plt.show()

### 4.1 Preparing Data for Clustering
We'll use K-means clustering to segment our data into meaningful groups.

---

## Part 4: Unsupervised Learning - Clustering

**Unsupervised Learning** is like finding hidden patterns in data without knowing the "right answer" beforehand. In business, this could be:
- Segmenting customers based on spending behavior
- Identifying unusual transaction patterns
- Grouping products by similarity

We'll perform customer segmentation to help with targeted marketing strategies.

In [None]:
# Analyze feature importance (works with Random Forest)
if 'Random Forest' in results:
    rf_model = results['Random Forest']['model']
    
    # Get feature importance
    importance = rf_model.feature_importances_
    feature_names = X.columns
    
    # Create a DataFrame for easier analysis
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importance
    }).sort_values('importance', ascending=False)
    
    print("Top 10 Most Important Features for Business Success:")
    print("="*55)
    
    for i, (_, row) in enumerate(importance_df.head(10).iterrows(), 1):
        print(f"{i:2d}. {row['feature']:<25} {row['importance']:.4f}")
    
    # Visualize feature importance
    plt.figure(figsize=(12, 8))
    top_features = importance_df.head(10)
    
    plt.subplot(2, 1, 1)
    bars = plt.bar(range(len(top_features)), top_features['importance'])
    plt.title('Top 10 Feature Importance for Predicting Business Success')
    plt.xlabel('Features')
    plt.ylabel('Importance Score')
    plt.xticks(range(len(top_features)), top_features['feature'], rotation=45, ha='right')
    
    # Add value labels on bars
    for i, bar in enumerate(bars):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001, 
                f'{top_features.iloc[i]["importance"]:.3f}', 
                ha='center', va='bottom', fontsize=8)
    
    # Model performance comparison
    plt.subplot(2, 1, 2)
    model_names = list(results.keys())
    accuracies = [results[name]['accuracy'] for name in model_names]
    
    bars = plt.bar(model_names, accuracies, color=['lightblue', 'lightgreen'])
    plt.title('Model Performance Comparison')
    plt.ylabel('Accuracy')
    plt.ylim(0, 1)
    
    # Add percentage labels on bars
    for i, bar in enumerate(bars):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                f'{accuracies[i]:.1%}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.show()
    
    # Business insights
    print(f"\nBusiness Insights:")
    print("-" * 20)
    print(f"The Random Forest model identified the following key success factors:")
    
    top_3_features = importance_df.head(3)
    for i, (_, row) in enumerate(top_3_features.iterrows(), 1):
        print(f"{i}. {row['feature']} (importance: {row['importance']:.3f})")
    
    print(f"\nThese features account for {top_3_features['importance'].sum():.1%} "
          f"of the model's decision-making process.")
    
else:
    print("Feature importance analysis requires Random Forest model.")

### 3.3 Feature Importance Analysis
Understanding which factors most influence business success can provide valuable insights for decision-making.

In [None]:
# Train different classification models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
    print(f"Training {name}...")
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'predictions': y_pred
    }
    
    print(f"{name} Accuracy: {accuracy:.3f}")

print("\n" + "="*50)
print("Model Comparison:")
print("-" * 20)

for name, result in results.items():
    print(f"{name}: {result['accuracy']:.1%} accuracy")

# Detailed analysis of the best model
best_model_name = max(results.keys(), key=lambda x: results[x]['accuracy'])
best_model = results[best_model_name]['model']
best_predictions = results[best_model_name]['predictions']

print(f"\nDetailed Analysis - {best_model_name}")
print("="*40)

# Classification report
print("Classification Report:")
print(classification_report(y_test, best_predictions, 
                          target_names=['Closed', 'Open']))

# Confusion Matrix
cm = confusion_matrix(y_test, best_predictions)
print(f"\nConfusion Matrix:")
print("Actual →")
print("        Closed  Open")
print(f"Closed    {cm[0,0]:4d}   {cm[0,1]:3d}")
print(f"Open      {cm[1,0]:4d}   {cm[1,1]:3d}")

# Business interpretation
correct_predictions = cm[0,0] + cm[1,1]
total_predictions = cm.sum()
print(f"\nBusiness Impact:")
print(f"  Correctly identified {cm[1,1]} businesses that stayed open")
print(f"  Correctly identified {cm[0,0]} businesses that closed")
print(f"  Misclassified {cm[0,1]} closing businesses as staying open")
print(f"  Misclassified {cm[1,0]} successful businesses as closing")

### 3.2 Training Classification Models
Let's train two different types of models to predict business success and compare their performance.

In [None]:
# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# Check if we have the target column
if 'isOpen' not in df.columns:
    print("Warning: 'isOpen' column not found. Creating a sample target for demonstration.")
    # Create a sample target based on some business logic
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        # Create target based on above-median performance in first numeric column
        df['isOpen'] = (df[numeric_cols[0]] > df[numeric_cols[0]].median()).astype(int)
    else:
        # Random target for demonstration
        np.random.seed(42)
        df['isOpen'] = np.random.choice([0, 1], size=len(df))

# Prepare features for machine learning
print("Preparing data for machine learning...")
print("=" * 40)

# Select numeric columns for features
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
if 'isOpen' in numeric_features:
    numeric_features.remove('isOpen')

print(f"Using {len(numeric_features)} numeric features:")
for feature in numeric_features:
    print(f"  - {feature}")

# Handle categorical columns if they exist
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
processed_df = df.copy()

if categorical_features:
    print(f"\nProcessing {len(categorical_features)} categorical features...")
    label_encoders = {}
    
    for col in categorical_features[:3]:  # Limit to first 3 categorical columns
        le = LabelEncoder()
        processed_df[col + '_encoded'] = le.fit_transform(processed_df[col].fillna('Unknown'))
        label_encoders[col] = le
        numeric_features.append(col + '_encoded')
        print(f"  - Encoded {col}")

# Prepare X (features) and y (target)
X = processed_df[numeric_features].fillna(0)  # Fill missing values with 0
y = processed_df['isOpen']

print(f"\nFinal feature set: {X.shape[1]} features")
print(f"Target distribution:")
print(f"  Open businesses: {sum(y == 1)} ({sum(y == 1)/len(y)*100:.1f}%)")
print(f"  Closed businesses: {sum(y == 0)} ({sum(y == 0)/len(y)*100:.1f}%)")

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nData split complete:")
print(f"  Training set: {X_train.shape[0]} samples")
print(f"  Testing set: {X_test.shape[0]} samples")

### 3.1 Preparing Data for Machine Learning
Before we can train a model, we need to prepare our data by selecting features and cleaning it.

---

## Part 3: Supervised Learning - Classification

**Supervised Learning** is like learning from examples with known answers. In business, this could be:
- Predicting if a business will stay open based on financial indicators
- Determining if a customer will default on a loan
- Classifying transactions as fraudulent or legitimate

We'll use the `isOpen` column as our target to predict business success.

In [None]:
# Create visualizations to understand our data better

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Business Data Analysis Dashboard', fontsize=16, y=1.02)

# Plot 1: Business Status Distribution (if exists)
if 'isOpen' in df.columns:
    status_counts = df['isOpen'].value_counts()
    labels = ['Closed' if x == 0 else 'Open' for x in status_counts.index]
    axes[0, 0].pie(status_counts.values, labels=labels, autopct='%1.1f%%', startangle=90)
    axes[0, 0].set_title('Business Status Distribution')
else:
    # Alternative plot if isOpen doesn't exist
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        axes[0, 0].hist(df[numeric_cols[0]], bins=20, alpha=0.7)
        axes[0, 0].set_title(f'Distribution of {numeric_cols[0]}')

# Plot 2: Spending distribution (first numeric column)
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    spending_col = numeric_cols[0]
    axes[0, 1].hist(df[spending_col], bins=20, alpha=0.7, color='skyblue')
    axes[0, 1].set_title(f'Distribution of {spending_col}')
    axes[0, 1].set_xlabel(spending_col)
    axes[0, 1].set_ylabel('Frequency')

# Plot 3: Correlation heatmap (if we have multiple numeric columns)
if len(numeric_cols) >= 2:
    correlation_matrix = df[numeric_cols[:5]].corr()  # Limit to 5 columns for readability
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])
    axes[1, 0].set_title('Correlation Matrix')
else:
    axes[1, 0].text(0.5, 0.5, 'Not enough numeric\ncolumns for correlation', 
                    ha='center', va='center', transform=axes[1, 0].transAxes)
    axes[1, 0].set_title('Correlation Matrix')

# Plot 4: Category analysis (if categorical columns exist)
if len(categorical_cols) > 0:
    cat_col = categorical_cols[0]
    category_counts = df[cat_col].value_counts().head(10)  # Top 10 categories
    category_counts.plot(kind='bar', ax=axes[1, 1])
    axes[1, 1].set_title(f'Top Categories in {cat_col}')
    axes[1, 1].set_xlabel(cat_col)
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].tick_params(axis='x', rotation=45)
else:
    axes[1, 1].text(0.5, 0.5, 'No categorical\ncolumns found', 
                    ha='center', va='center', transform=axes[1, 1].transAxes)
    axes[1, 1].set_title('Category Analysis')

plt.tight_layout()
plt.show()

# Additional summary statistics
print("\nKey Business Metrics:")
print("=" * 30)
if len(numeric_cols) > 0:
    for col in numeric_cols[:3]:  # Show top 3 numeric columns
        print(f"{col}:")
        print(f"  Mean: {df[col].mean():.2f}")
        print(f"  Std Dev: {df[col].std():.2f}")
        print(f"  Range: {df[col].min():.2f} to {df[col].max():.2f}")
        print()

### 2.3 Data Visualization
Visual representations help us understand patterns in our business data more easily than looking at numbers alone.

In [None]:
# Explore categorical columns
print("Unique values in categorical columns:")
print("-" * 40)

categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")
    print(f"Values: {list(df[col].unique())}")
    print()

# Check for missing values
print("Missing values:")
print(df.isnull().sum())

print("\n" + "="*50)

# Business insights from the data
print("Business Analysis:")
print("-" * 20)

if 'isOpen' in df.columns:
    business_status = df['isOpen'].value_counts()
    print(f"Business Status Distribution:")
    for status, count in business_status.items():
        percentage = (count / len(df)) * 100
        print(f"  {'Open' if status == 1 else 'Closed'}: {count} ({percentage:.1f}%)")

# Analyze spending patterns if spending columns exist
spending_cols = [col for col in df.columns if 'spend' in col.lower() or 'amount' in col.lower()]
if spending_cols:
    print(f"\nSpending Analysis:")
    for col in spending_cols[:3]:  # Show first 3 spending columns
        print(f"{col}:")
        print(f"  Average: ${df[col].mean():,.2f}")
        print(f"  Median: ${df[col].median():,.2f}")
        print(f"  Min: ${df[col].min():,.2f}")
        print(f"  Max: ${df[col].max():,.2f}")
        print()

### 2.2 Data Exploration and Analysis
Let's explore our business data to understand customer patterns, spending behavior, and business performance.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# Load the dataset
df = pd.read_csv('../data/w2--dataset.csv')

print("Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
print("\n" + "="*50)

# First look at the data
print("First 5 rows:")
print(df.head())

print("\n" + "="*50)
print("Column information:")
print(df.info())

print("\n" + "="*50)
print("Basic statistics:")
print(df.describe())

---

## Part 2: Working with Data Using Pandas

Now that we understand Python basics, let's learn how to work with real business data using Pandas - Python's most popular data analysis library. Think of Pandas as Excel, but much more powerful!

### 2.1 Loading and Exploring Our Dataset

In [None]:
# Functions for common business calculations

def calculate_roi(initial_investment, final_value):
    """Calculate Return on Investment (ROI) as a percentage"""
    roi = ((final_value - initial_investment) / initial_investment) * 100
    return roi

def categorize_expense(amount):
    """Categorize expenses as Low, Medium, or High"""
    if amount < 1000:
        return "Low"
    elif amount < 5000:
        return "Medium"
    else:
        return "High"

def calculate_break_even(fixed_costs, price_per_unit, variable_cost_per_unit):
    """Calculate break-even point in units"""
    if price_per_unit <= variable_cost_per_unit:
        return None  # No break-even possible
    break_even_units = fixed_costs / (price_per_unit - variable_cost_per_unit)
    return break_even_units

# Using the functions
print("ROI Analysis:")
investments = [
    {"name": "Project A", "initial": 50000, "final": 65000},
    {"name": "Project B", "initial": 30000, "final": 42000},
    {"name": "Project C", "initial": 80000, "final": 85000}
]

for investment in investments:
    roi = calculate_roi(investment["initial"], investment["final"])
    print(f"{investment['name']}: {roi:.1f}% ROI")

print("\nExpense Categorization:")
expenses = [800, 2500, 15000, 450, 8500]
for expense in expenses:
    category = categorize_expense(expense)
    print(f"${expense:,} - {category}")

print("\nBreak-Even Analysis:")
fixed_costs = 100000
price_per_unit = 50
variable_cost = 20

break_even = calculate_break_even(fixed_costs, price_per_unit, variable_cost)
if break_even:
    print(f"Fixed Costs: ${fixed_costs:,}")
    print(f"Price per unit: ${price_per_unit}")
    print(f"Variable cost per unit: ${variable_cost}")
    print(f"Break-even point: {break_even:.0f} units")
    print(f"Break-even revenue: ${break_even * price_per_unit:,.0f}")
else:
    print("Break-even not possible - price too low!")

### 1.6 Functions - Reusable Code
Functions are like formulas in Excel that you can use over and over again with different inputs. They help organize code and avoid repetition.

In [None]:
# For loops - processing multiple items
departments = ['Sales', 'Marketing', 'Operations', 'Finance', 'HR']
budgets = [150000, 80000, 120000, 100000, 60000]

print("Department Budget Analysis:")
total_budget = 0
for i in range(len(departments)):
    dept = departments[i]
    budget = budgets[i]
    total_budget += budget
    percentage = (budget / sum(budgets)) * 100
    print(f"{dept}: ${budget:,} ({percentage:.1f}% of total)")

print(f"Total Budget: ${total_budget:,}")

print("\n" + "="*40)

# While loop - compound interest calculation
principal = 10000
interest_rate = 0.05
target_amount = 15000
years = 0

print(f"Growing ${principal:,} at {interest_rate*100}% annual interest:")
print("Year\tAmount")
print("----\t------")

current_amount = principal
while current_amount < target_amount and years < 20:
    print(f"{years}\t${current_amount:,.0f}")
    current_amount = current_amount * (1 + interest_rate)
    years += 1

print(f"{years}\t${current_amount:,.0f}")
print(f"\nIt takes {years} years to reach ${target_amount:,}")

### 1.5 Loops - Repeating Tasks
Loops help us perform the same task multiple times, like calculating monthly payments or processing multiple transactions.

In [None]:
# If statements for business decisions
credit_score = 720
annual_income = 85000
debt_to_income = 0.25

# Credit approval logic
if credit_score >= 750 and debt_to_income < 0.3:
    approval = "Approved - Excellent"
    interest_rate = 3.5
elif credit_score >= 700 and debt_to_income < 0.4:
    approval = "Approved - Good"
    interest_rate = 4.2
elif credit_score >= 650 and debt_to_income < 0.5:
    approval = "Approved - Fair"
    interest_rate = 5.8
else:
    approval = "Declined"
    interest_rate = None

print("Credit Application Review:")
print(f"Credit Score: {credit_score}")
print(f"Annual Income: ${annual_income:,}")
print(f"Debt-to-Income Ratio: {debt_to_income*100:.1f}%")
print(f"Decision: {approval}")
if interest_rate:
    print(f"Interest Rate: {interest_rate}%")

# Performance categorization
quarterly_sales = 180000
target_sales = 150000

if quarterly_sales >= target_sales * 1.2:
    performance = "Excellent"
elif quarterly_sales >= target_sales * 1.1:
    performance = "Above Target"
elif quarterly_sales >= target_sales:
    performance = "Met Target"
else:
    performance = "Below Target"

print(f"\nQuarterly Performance:")
print(f"Sales: ${quarterly_sales:,}")
print(f"Target: ${target_sales:,}")
print(f"Performance: {performance}")
print(f"Achievement: {(quarterly_sales/target_sales)*100:.1f}% of target")

### 1.4 If Statements - Making Decisions
If statements help us make decisions based on conditions, like determining credit approval or categorizing performance.

In [None]:
# Dictionaries - like a lookup table
company_info = {
    'name': 'ABC Retail Corp',
    'employees': 250,
    'revenue': 2500000,
    'founded': 2010,
    'industry': 'Retail'
}

# Monthly expenses by category
monthly_expenses = {
    'Rent': 15000,
    'Utilities': 3500,
    'Supplies': 2800,
    'Marketing': 8000,
    'Insurance': 1200
}

print("Company Information:")
for key, value in company_info.items():
    print(f"{key.capitalize()}: {value}")

print(f"\nMonthly Expenses:")
total_expenses = 0
for category, amount in monthly_expenses.items():
    print(f"{category}: ${amount:,}")
    total_expenses += amount

print(f"Total Monthly Expenses: ${total_expenses:,}")

# Accessing specific values
print(f"\nThe company {company_info['name']} spends ${monthly_expenses['Marketing']:,} on marketing each month.")

### 1.3 Dictionaries - Key-Value Pairs
Dictionaries are like a two-column table where each key has a corresponding value. Perfect for storing related business information.

In [None]:
# Lists - like columns in Excel
monthly_sales = [45000, 52000, 48000, 61000, 55000, 49000]
expense_categories = ['Rent', 'Utilities', 'Supplies', 'Marketing', 'Insurance']

print("Monthly Sales:")
print(monthly_sales)
print(f"Total Sales: ${sum(monthly_sales):,}")
print(f"Average Monthly Sales: ${sum(monthly_sales)/len(monthly_sales):,.0f}")
print(f"Highest Month: ${max(monthly_sales):,}")
print(f"Lowest Month: ${min(monthly_sales):,}")

print("\nExpense Categories:")
print(expense_categories)
print(f"Number of categories: {len(expense_categories)}")
print(f"First category: {expense_categories[0]}")
print(f"Last category: {expense_categories[-1]}")

### 1.2 Lists - Collections of Data
Lists are like a column in a spreadsheet. They can store multiple values, such as monthly sales figures or expense categories.

In [None]:
# Variables - like account balances
revenue = 150000
expenses = 120000
tax_rate = 0.25

# Basic calculations
gross_profit = revenue - expenses
net_profit = gross_profit * (1 - tax_rate)

print(f"Revenue: ${revenue:,}")
print(f"Expenses: ${expenses:,}")
print(f"Gross Profit: ${gross_profit:,}")
print(f"Net Profit: ${net_profit:,}")
print(f"Profit Margin: {(net_profit/revenue)*100:.1f}%")

# Week 2: Business Problems → Data Solutions

## Learning Objectives
- Understand basic Python programming concepts  
- Learn how to translate accounting/finance problems into data science tasks
- Differentiate between supervised and unsupervised learning
- Apply simple machine learning models to accounting datasets

---

## Part 1: Introduction to Python Basics

Before we dive into data analysis, let's cover essential Python concepts using accounting and business examples. Don't worry if you're new to programming - we'll start with the basics!

### 1.1 Variables and Basic Calculations
Variables are like labeled containers that store values. In accounting, think of them as account names that hold monetary values.