# Pandas Integration Examples

This notebook demonstrates how to use pranaam with pandas DataFrames for real-world data processing and analysis.

We'll cover:
1. Basic DataFrame processing
2. Data analysis with predictions
3. Confidence-based filtering
4. Saving and exporting results

Let's start by importing our dependencies:

In [None]:
import pandas as pd

import pranaam

print(f"Pandas version: {pd.__version__}")
print(f"Pranaam version: {pranaam.__version__ if hasattr(pranaam, '__version__') else 'latest'}")

## Creating Sample Data

First, let's create a sample employee dataset to work with:

In [None]:
def create_sample_data():
    """Create sample employee data for demonstration."""
    return pd.DataFrame({
        "employee_id": [1001, 1002, 1003, 1004, 1005, 1006],
        "name": [
            "Shah Rukh Khan",
            "Priya Sharma",
            "Mohammed Ali",
            "Raj Patel",
            "Fatima Khan",
            "Amitabh Bachchan",
        ],
        "department": [
            "Engineering",
            "Marketing",
            "Finance",
            "HR",
            "Engineering",
            "Management",
        ],
        "salary": [75000, 65000, 70000, 60000, 80000, 120000],
    })

# Create our sample data
df = create_sample_data()
print("Original employee data:")
print(df)

## üìä Basic DataFrame Processing

Now let's add religion predictions to our DataFrame using pranaam:

In [None]:
# Get predictions for the name column
print("Getting predictions for all names...")
predictions = pranaam.pred_rel(df["name"], lang="eng")
print("\nPredictions:")
print(predictions)

In [None]:
# Merge predictions back to original DataFrame
# Note: pranaam returns name, pred_label, pred_prob_muslim
df_with_predictions = df.merge(
    predictions[["name", "pred_label", "pred_prob_muslim"]],
    on="name",
    how="left"
)

print("Combined data with predictions:")
print(df_with_predictions)

## üìà Data Analysis with Predictions

Now let's perform some analysis using the religion predictions:

In [None]:
# Basic statistics
print("Religion distribution in our dataset:")
religion_counts = df_with_predictions["pred_label"].value_counts()
print(religion_counts)
print("\nPercentage breakdown:")
print(religion_counts / len(df_with_predictions) * 100)

In [None]:
# Average salary by predicted religion
print("Salary analysis by predicted religion:")
salary_by_religion = df_with_predictions.groupby("pred_label")["salary"].agg([
    'mean', 'median', 'min', 'max', 'count'
])
print(salary_by_religion)

In [None]:
# Department distribution by predicted religion
print("Department vs Religion cross-tabulation:")
dept_religion = pd.crosstab(
    df_with_predictions["department"],
    df_with_predictions["pred_label"],
    margins=True
)
print(dept_religion)

## üéØ Confidence-Based Analysis

Not all predictions are equally certain. Let's analyze the confidence levels and filter based on them:

In [None]:
# Add confidence score calculation
# Higher numbers mean more confident predictions
df_with_predictions['confidence'] = df_with_predictions['pred_prob_muslim'].apply(
    lambda x: max(x, 100 - x)
)

# Show confidence distribution
print("Detailed prediction analysis:")
print("=" * 70)
print(f"{'Name':<18} | {'Prediction':<10} | {'Muslim %':<8} | {'Confidence':<10}")
print("-" * 70)

for _, row in df_with_predictions.iterrows():
    print(f"{row['name']:<18} | {row['pred_label']:<10} | {row['pred_prob_muslim']:>6.1f}% | {row['confidence']:>8.1f}%")

In [None]:
# Filter high-confidence predictions (>90%)
high_confidence_mask = df_with_predictions['confidence'] > 90
high_confidence_df = df_with_predictions[high_confidence_mask]

print("High-confidence predictions (confidence > 90%):")
print(f"Found {len(high_confidence_df)} out of {len(df_with_predictions)} predictions")
print("\nHigh-confidence results:")
print(high_confidence_df[['name', 'pred_label', 'pred_prob_muslim', 'confidence']])

In [None]:
# Confidence level categorization
df_with_predictions['confidence_level'] = pd.cut(
    df_with_predictions['confidence'],
    bins=[0, 70, 85, 95, 100],
    labels=['Low', 'Medium', 'High', 'Very High'],
    include_lowest=True
)

print("Confidence level distribution:")
conf_dist = df_with_predictions['confidence_level'].value_counts().sort_index()
print(conf_dist)
print("\nPercentage:")
print(conf_dist / len(df_with_predictions) * 100)

## üíæ Saving and Exporting Results

Let's save our enriched dataset to various formats:

In [None]:
# Prepare final dataset with clean column names
final_df = df_with_predictions[[
    'employee_id', 'name', 'department', 'salary',
    'pred_label', 'pred_prob_muslim', 'confidence', 'confidence_level'
]].rename(columns={
    'pred_label': 'predicted_religion',
    'pred_prob_muslim': 'muslim_probability',
    'confidence': 'prediction_confidence'
})

print("Final dataset with clean column names:")
print(final_df)
print(f"\nDataset shape: {final_df.shape}")

In [None]:
# Save to CSV (most common format)
output_file = "employee_predictions.csv"
final_df.to_csv(output_file, index=False)
print(f"‚úÖ Results saved to {output_file}")

# Show what was saved
print("\nSaved data preview:")
saved_df = pd.read_csv(output_file)
print(saved_df.head())
print("\nFile info:")
print(f"- Rows: {len(saved_df)}")
print(f"- Columns: {list(saved_df.columns)}")

In [None]:
# Clean up the demo file
import os

if os.path.exists(output_file):
    os.remove(output_file)
    print(f"üßπ Demo file {output_file} removed")

## üîç Advanced Analytics Example

Let's create a summary report of our analysis:

In [None]:
# Create a comprehensive summary
print("üìä EMPLOYEE RELIGION PREDICTION ANALYSIS REPORT")
print("=" * 60)

# Dataset overview
total_employees = len(final_df)
print("\nüìã Dataset Overview:")
print(f"   Total employees analyzed: {total_employees}")
print(f"   Departments: {final_df['department'].nunique()} ({', '.join(final_df['department'].unique())})")
print(f"   Salary range: ${final_df['salary'].min():,} - ${final_df['salary'].max():,}")

# Religion predictions
religion_summary = final_df['predicted_religion'].value_counts()
print("\nüîÆ Religion Predictions:")
for religion, count in religion_summary.items():
    pct = count / total_employees * 100
    print(f"   {religion.title()}: {count} employees ({pct:.1f}%)")

# Confidence analysis
avg_confidence = final_df['prediction_confidence'].mean()
high_conf_count = (final_df['prediction_confidence'] > 90).sum()
print("\nüìà Confidence Analysis:")
print(f"   Average confidence: {avg_confidence:.1f}%")
print(f"   High confidence predictions (>90%): {high_conf_count}/{total_employees} ({high_conf_count/total_employees*100:.1f}%)")

# Department insights
dept_analysis = final_df.groupby('department').agg({
    'predicted_religion': lambda x: x.value_counts().index[0],  # most common religion
    'prediction_confidence': 'mean',
    'salary': 'mean'
})
print("\nüè¢ Department Analysis:")
for dept in dept_analysis.index:
    most_common = dept_analysis.loc[dept, 'predicted_religion']
    avg_conf = dept_analysis.loc[dept, 'prediction_confidence']
    avg_sal = dept_analysis.loc[dept, 'salary']
    print(f"   {dept}: Mostly {most_common} (avg confidence: {avg_conf:.1f}%, avg salary: ${avg_sal:,.0f})")

print("\n‚úÖ Analysis complete!")

## Key Takeaways

üêº **Pandas Integration**: Pranaam works seamlessly with pandas DataFrames and Series  
üîó **Easy Merging**: Use `.merge()` to combine predictions with existing data  
üìä **Rich Analytics**: Leverage pandas' groupby, crosstab, and aggregation functions  
üéØ **Confidence Filtering**: Use confidence scores to filter reliable predictions  
üíæ **Export Ready**: Save enriched datasets to CSV, Excel, or other formats  
üìà **Business Insights**: Transform name data into actionable demographic insights  

## Next Steps

- **[CSV Processing](csv_processing.ipynb)**: Learn to process large CSV files
- **[Performance Benchmarks](performance_benchmarks.ipynb)**: Optimize for large datasets
- **[Basic Usage](basic_usage.ipynb)**: Review fundamental concepts

## Best Practices

1. **Always check confidence scores** - Don't trust all predictions equally
2. **Use batch processing** - Process multiple names at once for efficiency
3. **Handle missing data** - Check for NaN values in name columns before processing
4. **Validate results** - Spot-check predictions against domain knowledge
5. **Document assumptions** - Note the model's limitations and biases in your analysis