# 🤖 AI-Powered Data Analytics Demo

Welcome to the future of data analysis! This demo showcases how AI can accelerate and enhance your data analytics workflows.

## 📋 What You'll Learn
- Set up AI tools for data work (OpenAI/Azure OpenAI)
- Use AI for automated data cleaning and exploration
- Generate analysis code through natural language prompts
- Create visualizations with AI assistance
- Build comprehensive reports using AI
- Understand best practices and limitations

## ⚡ Prerequisites
- Python 3.7+
- OpenAI API key (or Azure OpenAI access)
- Basic understanding of data analysis concepts

## 🚀 Let's Begin!
Follow along as we transform how you approach data analytics with AI assistance.

## 🔧 Section 1: Setting Up AI Tools for Data Work

Before we begin, we need to configure our environment for AI-assisted data analysis.

### Step 1: Install Required Libraries

In [None]:
# Install required libraries (run this once)
# !pip install openai pandas numpy matplotlib seaborn plotly python-dotenv

# Import essential libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from dotenv import load_dotenv
import openai
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"OpenAI version: {openai.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# Load environment variables from .env file
load_dotenv()

# Set up OpenAI client
try:
    # Option 1: Standard OpenAI API
    if os.getenv('OPENAI_API_KEY'):
        client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        model_name = os.getenv('OPENAI_MODEL', 'gpt-4')
        print("✅ OpenAI API configured successfully!")
    
    # Option 2: Azure OpenAI (uncomment if using Azure)
    # elif os.getenv('AZURE_OPENAI_KEY'):
    #     client = openai.AzureOpenAI(
    #         api_key=os.getenv('AZURE_OPENAI_KEY'),
    #         api_version=os.getenv('AZURE_OPENAI_VERSION', '2024-02-15-preview'),
    #         azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT')
    #     )
    #     model_name = os.getenv('AZURE_OPENAI_MODEL', 'gpt-4')
    #     print("✅ Azure OpenAI configured successfully!")
    
    else:
        print("❌ No API key found! Please check your .env file.")
        print("Make sure you have either OPENAI_API_KEY or AZURE_OPENAI_KEY set.")
        
except Exception as e:
    print(f"❌ Error setting up AI client: {e}")
    print("Please check your API key and try again.")

In [None]:
# Helper function for AI API calls
def ask_ai(prompt, temperature=0.3, max_tokens=1000):
    """
    Send a prompt to the AI and get a response
    """
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are an expert data analyst assistant. Provide clear, accurate, and actionable insights."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {e}"

# Test the connection
test_response = ask_ai("Say 'Hello! AI is ready for data analytics!' in a professional tone.")
print("🤖 AI Response:", test_response)

## 🎯 Section 2: First AI-Assisted Data Analysis Demo

Let's dive into our first AI-assisted data analysis using the retail sales sample data!

In [None]:
# Load the sample retail sales data
df = pd.read_csv('retail-sales-sample.csv')

print("📊 Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nBasic Statistics:")
print(df.describe())

In [None]:
# Let's ask AI to analyze our dataset structure
dataset_info = f"""
Dataset: Retail Sales Data
Columns: {list(df.columns)}
Shape: {df.shape}
Data types: {df.dtypes.to_dict()}
Sample data: {df.head(3).to_dict()}
"""

ai_analysis = ask_ai(f"""
Analyze this retail sales dataset and provide:
1. Key insights about the data structure
2. Potential data quality issues to watch for
3. 3 interesting business questions we could explore
4. Recommended next steps for analysis

Dataset info: {dataset_info}
""")

print("🤖 AI Analysis:")
print(ai_analysis)

## 🧹 Section 3: AI for Data Cleaning

Now let's use AI to help us clean and prepare our data for analysis.

In [None]:
# Ask AI to generate comprehensive data quality check code
quality_check_prompt = f"""
Create Python code to perform comprehensive data quality checks on this retail sales dataset:
Columns: {list(df.columns)}
Data types: {df.dtypes.to_dict()}

Include checks for:
1. Missing values
2. Duplicate records
3. Outliers in numerical columns
4. Data type consistency
5. Date format validation
6. Categorical value validation

Return only the Python code with comments.
"""

ai_code = ask_ai(quality_check_prompt, temperature=0.1)
print("🤖 AI-Generated Data Quality Check Code:")
print(ai_code)

In [None]:
# Let's run our own data quality checks based on AI suggestions
print("📋 Data Quality Assessment")
print("=" * 50)

# 1. Missing Values Check
print("1. Missing Values:")
missing_data = df.isnull().sum()
print(missing_data[missing_data > 0] if missing_data.sum() > 0 else "✅ No missing values found")

# 2. Duplicate Records Check
print(f"\n2. Duplicate Records: {df.duplicated().sum()}")

# 3. Date Validation
print(f"\n3. Date Format Check:")
try:
    df['date_parsed'] = pd.to_datetime(df['date'])
    print("✅ Date format is valid")
    print(f"Date range: {df['date_parsed'].min()} to {df['date_parsed'].max()}")
except:
    print("❌ Date format issues detected")

# 4. Numerical Outliers (using IQR method)
print(f"\n4. Outlier Detection:")
numerical_cols = ['price', 'quantity', 'customer_satisfaction', 'return_rate']
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))]
    print(f"  {col}: {len(outliers)} outliers detected")

# 5. Categorical Value Validation
print(f"\n5. Categorical Data Summary:")
categorical_cols = ['category', 'region', 'sales_rep']
for col in categorical_cols:
    print(f"  {col}: {df[col].nunique()} unique values")
    print(f"    Values: {list(df[col].unique())}")

print("\n✅ Data quality assessment complete!")

## 🔍 Section 4: AI-Powered Data Exploration

Let's use natural language to explore our data and generate insights!

In [None]:
# Natural Language Data Query Function
def query_data_with_ai(question, dataframe=df):
    """
    Ask questions about the data in natural language
    """
    data_context = f"""
    Dataset info:
    - Shape: {dataframe.shape}
    - Columns: {list(dataframe.columns)}
    - Sample data: {dataframe.head(3).to_dict()}
    - Data types: {dataframe.dtypes.to_dict()}
    """
    
    prompt = f"""
    You are a data analyst. Answer this question about the retail sales dataset:
    
    Question: {question}
    
    {data_context}
    
    Provide:
    1. A direct answer to the question
    2. Python code to verify your answer (if applicable)
    3. Any interesting insights or patterns you notice
    
    Be specific and actionable.
    """
    
    return ask_ai(prompt)

# Let's try some natural language queries
questions = [
    "Which region has the highest average sales?",
    "What's the relationship between price and customer satisfaction?",
    "Which sales rep is performing best?",
    "Are there any seasonal patterns in the data?"
]

for i, question in enumerate(questions, 1):
    print(f"\n{'='*60}")
    print(f"❓ Question {i}: {question}")
    print("🤖 AI Response:")
    response = query_data_with_ai(question)
    print(response)

## 🎯 Section 5: Prompt Engineering for Data Tasks

Learn how to craft effective prompts for better AI assistance in data analysis.

In [None]:
# Prompt Engineering Examples

# Example 1: Vague vs Specific Prompts
print("🔴 POOR PROMPT:")
poor_prompt = "Analyze the data"
print(f"Prompt: '{poor_prompt}'")
response = ask_ai(poor_prompt)
print("Response:", response[:200] + "...")

print("\n" + "="*50)

print("\n🟢 GOOD PROMPT:")
good_prompt = f"""
Analyze this retail sales dataset and provide:

Context: E-commerce ice cream sales data with {df.shape[0]} records
Goal: Identify top 3 business improvement opportunities

Dataset summary:
- Columns: {list(df.columns)}
- Date range: {df['date'].min()} to {df['date'].max()}
- Revenue metrics: price, quantity, customer_satisfaction

Required output:
1. Top 3 specific findings with supporting data
2. Python code to verify each finding
3. Business recommendations

Focus on actionable insights that could increase revenue or customer satisfaction.
"""

print(f"Prompt structure demonstrated above")
response = ask_ai(good_prompt)
print("🤖 AI Response:")
print(response)

## 💻 Section 6: AI for Code Generation

Watch AI generate analysis code from natural language descriptions!

In [None]:
# AI Code Generator Function
def generate_analysis_code(description):
    """
    Generate Python code for data analysis tasks
    """
    prompt = f"""
    Generate Python pandas code for this analysis task:
    
    Task: {description}
    
    Dataset variable name: df
    Available columns: {list(df.columns)}
    
    Requirements:
    1. Include comments explaining each step
    2. Use proper pandas/matplotlib syntax
    3. Handle potential errors gracefully
    4. Return only the Python code
    
    Code:
    """
    
    return ask_ai(prompt, temperature=0.1)

# Example code generation requests
tasks = [
    "Create a summary table showing average sales by region and category",
    "Build a correlation matrix for numerical columns with visualization",
    "Find the top 5 best-performing products by total revenue",
    "Create a time series plot showing sales trends over time"
]

for i, task in enumerate(tasks, 1):
    print(f"\n{'='*60}")
    print(f"📝 Task {i}: {task}")
    print("\n🤖 Generated Code:")
    code = generate_analysis_code(task)
    print(code)
    print("\n" + "-"*40 + " EXECUTING CODE " + "-"*40)
    
    try:
        # Note: In a real scenario, you'd want to be more careful about executing AI-generated code
        exec(code)
    except Exception as e:
        print(f"Error executing code: {e}")
        print("Code may need manual review and adjustment.")

## 📊 Section 7: AI-Generated Visualizations

Let AI suggest and create the perfect visualizations for your data!

In [None]:
# AI Visualization Recommender
def recommend_visualizations(analysis_goal):
    """
    Get AI recommendations for the best visualizations
    """
    prompt = f"""
    I want to analyze: {analysis_goal}
    
    Dataset context:
    - Retail sales data with {df.shape[0]} records
    - Columns: {list(df.columns)}
    - Numerical: price, quantity, customer_satisfaction, return_rate
    - Categorical: name, product, category, sales_rep, region
    - Temporal: date
    
    Recommend:
    1. The 2 best chart types for this analysis
    2. Python code using matplotlib/seaborn/plotly
    3. Why these charts are most effective
    4. What insights they might reveal
    
    Include complete, executable Python code.
    """
    
    return ask_ai(prompt, temperature=0.2)

# Example visualization requests
viz_goals = [
    "Compare performance across different regions",
    "Show the relationship between price and customer satisfaction",
    "Identify sales trends over time",
    "Display the distribution of return rates by category"
]

for goal in viz_goals:
    print(f"\n{'='*60}")
    print(f"🎯 Analysis Goal: {goal}")
    print("\n🤖 AI Recommendation:")
    recommendation = recommend_visualizations(goal)
    print(recommendation)

In [None]:
# Let's create some quick visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Regional Performance
regional_sales = df.groupby('region').agg({
    'quantity': 'sum',
    'customer_satisfaction': 'mean'
}).round(2)

axes[0,0].bar(regional_sales.index, regional_sales['quantity'])
axes[0,0].set_title('Total Quantity Sold by Region')
axes[0,0].set_ylabel('Quantity')

# 2. Price vs Satisfaction
axes[0,1].scatter(df['price'], df['customer_satisfaction'], alpha=0.6)
axes[0,1].set_title('Price vs Customer Satisfaction')
axes[0,1].set_xlabel('Price ($)')
axes[0,1].set_ylabel('Customer Satisfaction')

# 3. Sales Trends
df['date_parsed'] = pd.to_datetime(df['date'])
daily_sales = df.groupby('date_parsed')['quantity'].sum()
axes[1,0].plot(daily_sales.index, daily_sales.values)
axes[1,0].set_title('Sales Trends Over Time')
axes[1,0].set_xlabel('Date')
axes[1,0].set_ylabel('Quantity Sold')
axes[1,0].tick_params(axis='x', rotation=45)

# 4. Return Rate Distribution
df.boxplot(column='return_rate', by='category', ax=axes[1,1])
axes[1,1].set_title('Return Rate Distribution by Category')
axes[1,1].set_xlabel('Category')
axes[1,1].set_ylabel('Return Rate')

plt.tight_layout()
plt.show()

print("📊 Visualizations created! AI can help you choose the right charts for your analysis goals.")

## 📈 Section 8: AI for Statistical Analysis

Use AI to choose appropriate statistical tests and interpret results!

In [None]:
# AI Statistical Analysis Advisor
def get_statistical_advice(research_question):
    """
    Get AI advice on statistical analysis approach
    """
    prompt = f"""
    Research Question: {research_question}
    
    Dataset Context:
    - Sample size: {df.shape[0]} records
    - Variables: {list(df.columns)}
    - Numerical variables: price, quantity, customer_satisfaction, return_rate
    - Categorical variables: region, category, sales_rep
    - Data appears to be retail sales transactions
    
    Provide:
    1. Appropriate statistical test(s) to use
    2. Assumptions to check before running the test
    3. Python code to perform the analysis
    4. How to interpret the results
    5. Potential limitations or caveats
    
    Be specific and practical.
    """
    
    return ask_ai(prompt, temperature=0.1)

# Example statistical questions
stat_questions = [
    "Is there a significant difference in customer satisfaction between regions?",
    "Does price have a statistically significant correlation with return rate?",
    "Are sales representatives performing significantly differently from each other?"
]

for question in stat_questions:
    print(f"\n{'='*60}")
    print(f"❓ Research Question: {question}")
    print("\n🤖 AI Statistical Advice:")
    advice = get_statistical_advice(question)
    print(advice)

## 📝 Section 9: Automated Reporting with AI

Generate comprehensive analysis reports automatically!

In [None]:
# AI Report Generator
def generate_executive_report():
    """
    Generate a comprehensive executive report using AI
    """
    # Gather key statistics
    total_revenue = (df['price'] * df['quantity']).sum()
    avg_satisfaction = df['customer_satisfaction'].mean()
    top_region = df.groupby('region')['quantity'].sum().idxmax()
    best_product = df.groupby('product')['quantity'].sum().idxmax()
    
    report_data = f"""
    Dataset Summary:
    - Total Records: {df.shape[0]}
    - Date Range: {df['date'].min()} to {df['date'].max()}
    - Total Revenue: ${total_revenue:,.2f}
    - Average Customer Satisfaction: {avg_satisfaction:.2f}/5.0
    - Top Performing Region: {top_region}
    - Best Selling Product: {best_product}
    
    Regional Performance:
    {df.groupby('region').agg({'quantity': 'sum', 'customer_satisfaction': 'mean'}).to_string()}
    
    Category Analysis:
    {df.groupby('category').agg({'quantity': 'sum', 'return_rate': 'mean'}).to_string()}
    """
    
    prompt = f"""
    Create an executive summary report for retail sales performance based on this data analysis:
    
    {report_data}
    
    Structure the report with:
    1. Executive Summary (key findings)
    2. Performance Highlights
    3. Areas of Concern
    4. Strategic Recommendations
    5. Next Steps
    
    Write in a professional business tone suitable for C-level executives.
    Include specific metrics and actionable insights.
    """
    
    return ask_ai(prompt, max_tokens=1500)

# Generate the report
print("📊 AUTOMATED EXECUTIVE REPORT")
print("=" * 80)

report = generate_executive_report()
print(report)

print("\n" + "=" * 80)
print("✅ Report generated automatically using AI!")
print("💡 This report can be customized, exported to PDF, or integrated into dashboards.")

## ⚠️ Section 10: Best Practices and Limitations

Understanding when and how to use AI effectively in data analytics.

In [None]:
# Best Practices for AI-Assisted Data Analysis

print("🔍 VERIFICATION STRATEGIES")
print("=" * 50)
print("""
1. Always verify AI-generated code before executing
2. Cross-check AI insights with manual analysis
3. Use multiple AI queries for complex questions
4. Validate statistical claims with domain knowledge
5. Test AI code on subset of data first
""")

print("\n✅ WHEN TO USE AI:")
print("""
• Exploratory data analysis and pattern discovery
• Code generation for routine tasks
• Documentation and report writing
• Statistical test selection guidance
• Visualization recommendations
• Data cleaning automation
""")

print("\n❌ WHEN NOT TO USE AI:")
print("""
• Critical business decisions without human review
• Sensitive or proprietary data analysis
• Complex statistical modeling without validation
• Final production code without testing
• Regulatory compliance reporting
• Domain-specific analysis requiring expertise
""")

print("\n🛡️ DATA PRIVACY CONSIDERATIONS:")
print("""
• Remove or anonymize PII before AI analysis
• Check company policies on AI tool usage
• Consider data residency and storage policies
• Be aware of AI model training implications
• Use synthetic data for training and demos
""")

# Example: Verifying AI insights
print("\n🔬 VERIFICATION EXAMPLE:")
print("AI claimed: 'North region has highest customer satisfaction'")

# Manual verification
regional_satisfaction = df.groupby('region')['customer_satisfaction'].mean().sort_values(ascending=False)
print(f"\nManual verification:")
print(regional_satisfaction)

top_region = regional_satisfaction.index[0]
print(f"\n✅ Verified: {top_region} region has highest satisfaction ({regional_satisfaction.iloc[0]:.2f})")

print("\n💡 Always validate AI claims with data!")

print("\n" + "=" * 80)
print("🎓 DEMO COMPLETE!")
print("You've learned how to:")
print("• Set up AI tools for data analytics")
print("• Use AI for data cleaning and exploration")
print("• Generate code and visualizations with AI")
print("• Create automated reports")
print("• Apply best practices and limitations")
print("\nReady to revolutionize your data analysis workflow! 🚀")