# 🏥 HIPAA-Compliant Sentiment Analysis Demo - Google Colab Version

## Complete Self-Contained Analysis with AI Assistant

This notebook is designed for Google Colab and includes:
- ✅ **Complete sentiment analysis code** (no external dependencies)
- ✅ **Synthetic healthcare data generation**
- ✅ **Service combination analysis**
- ✅ **Rich visualizations**
- ✅ **Hugging Face AI assistant** for result interpretation

**Ready to run in Google Colab!** 🚀

## 1. Install Required Packages

In [None]:
# Install required packages for Google Colab
!pip install transformers torch nltk scikit-learn plotly seaborn

print("✅ All packages installed successfully!")

In [None]:
# Complete Sentiment Analysis System for Google Colab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import random
from datetime import datetime, timedelta
from collections import Counter, defaultdict
import re
from pathlib import Path

# NLP libraries
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.metrics import silhouette_score

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Download NLTK data
try:
    nltk.download("vader_lexicon", quiet=True)
    nltk.download("stopwords", quiet=True)
    nltk.download("punkt", quiet=True)
    nltk.download("wordnet", quiet=True)
except:
    pass

warnings.filterwarnings("ignore")
plt.style.use("default")

class HealthcareSentimentAnalyzer:
    """
    Complete sentiment analysis system for healthcare feedback.
    Designed for Google Colab with all dependencies included.
    """
    
    def __init__(self):
        self.sia = SentimentIntensityAnalyzer()
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words("english"))
        self.tfidf_vectorizer = None
        self.kmeans_model = None
        self.lda_model = None
        self.data = None
        self.analyzed_data = None
        self.combinations = []
        
    def generate_sample_data(self, n_samples=1000):
        """Generate realistic healthcare service feedback data."""
        print(f"📊 Generating {n_samples} sample healthcare feedback entries...")
        
        services = [
            "Telemedicine Consultation", "Emergency Care", "Physical Therapy",
            "Mental Health Counseling", "Pharmacy Services", "Laboratory Testing",
            "Radiology Services", "Surgical Procedures", "Preventive Care",
            "Specialist Consultation", "Home Healthcare", "Urgent Care"
        ]
        
        age_groups = ["18-25", "26-35", "36-45", "46-55", "56-65", "65+"]
        genders = ["Male", "Female", "Other", "Prefer not to say"]
        insurance_types = ["Private", "Medicare", "Medicaid", "Uninsured"]
        
        feedback_templates = {
            "positive": [
                "The {service} was excellent. Staff was professional and caring.",
                "Outstanding {service} experience. Highly recommend to others.",
                "Very satisfied with {service}. Quick and efficient service.",
                "The {service} team was knowledgeable and helpful throughout.",
                "Exceptional {service} quality. Will definitely return.",
                "Great experience with {service}. Everything went smoothly.",
                "Amazing {service}. The staff was wonderful and caring.",
                "Perfect {service} experience. Exceeded my expectations."
            ],
            "negative": [
                "Disappointed with {service}. Long wait times and poor communication.",
                "The {service} experience was frustrating. Staff seemed overwhelmed.",
                "Not satisfied with {service}. Expected better quality of care.",
                "Poor {service} experience. Would not recommend to others.",
                "The {service} was disappointing. Room for significant improvement.",
                "Unsatisfactory {service}. Did not meet my expectations.",
                "The {service} could be much better. Several issues encountered.",
                "Below average {service}. Needs improvement in several areas."
            ],
            "neutral": [
                "The {service} was adequate. Met basic expectations.",
                "Standard {service} experience. Nothing particularly notable.",
                "The {service} was okay. Average quality overall.",
                "Decent {service}. Met my basic needs.",
                "The {service} was fine. Standard level of care.",
                "Acceptable {service} experience. Met expectations.",
                "The {service} was satisfactory. No major issues.",
                "Average {service} experience. Standard quality."
            ]
        }
        
        data = []
        random.seed(42)
        
        for i in range(n_samples):
            service = random.choice(services)
            sentiment_type = random.choices(
                ["positive", "negative", "neutral"],
                weights=[0.5, 0.3, 0.2]
            )[0]
            
            # Generate feedback text
            template = random.choice(feedback_templates[sentiment_type])
            feedback_text = template.format(service=service)
            
            # Add service combinations (30% chance)
            if random.random() < 0.3:
                other_services = [s for s in services if s != service]
                additional_service = random.choice(other_services)
                combo_texts = [
                    f" Combined with {additional_service} - both were good.",
                    f" Also used {additional_service} recently.",
                    f" The {additional_service} service was also helpful.",
                    f" Along with {additional_service}, the experience was positive."
                ]
                feedback_text += random.choice(combo_texts)
            
            # Generate rating based on sentiment
            if sentiment_type == "positive":
                rating = random.choices([4, 5], weights=[0.3, 0.7])[0]
            elif sentiment_type == "negative":
                rating = random.choices([1, 2], weights=[0.7, 0.3])[0]
            else:
                rating = random.choice([3, 4])
            
            # Generate date within last 6 months
            days_ago = random.randint(1, 180)
            date = datetime.now() - timedelta(days=days_ago)
            
            data.append({
                "id": f"FB_{i+1:04d}",
                "service_type": service,
                "feedback_text": feedback_text,
                "rating": rating,
                "date": date,
                "age_group": random.choice(age_groups),
                "gender": random.choice(genders),
                "insurance_type": random.choice(insurance_types),
                "has_combination": "Combined with" in feedback_text or "Also used" in feedback_text
            })
        
        self.data = pd.DataFrame(data)
        print(f"✅ Generated {len(self.data)} feedback entries")
        return self.data
    
    def analyze_sentiment(self):
        """Perform comprehensive sentiment analysis."""
        print("🔍 Performing sentiment analysis...")
        
        # VADER sentiment analysis
        sentiments = []
        compound_scores = []
        
        for text in self.data["feedback_text"]:
            scores = self.sia.polarity_scores(text)
            compound_scores.append(scores["compound"])
            
            if scores["compound"] >= 0.05:
                sentiments.append("positive")
            elif scores["compound"] <= -0.05:
                sentiments.append("negative")
            else:
                sentiments.append("neutral")
        
        self.data["vader_sentiment"] = sentiments
        self.data["vader_compound"] = compound_scores
        
        # TF-IDF clustering
        self.tfidf_vectorizer = TfidfVectorizer(
            max_features=1000,
            stop_words="english",
            ngram_range=(1, 2)
        )
        
        tfidf_matrix = self.tfidf_vectorizer.fit_transform(self.data["feedback_text"])
        
        # Find optimal number of clusters
        silhouette_scores = []
        K_range = range(2, 11)
        
        for k in K_range:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            cluster_labels = kmeans.fit_predict(tfidf_matrix)
            silhouette_avg = silhouette_score(tfidf_matrix, cluster_labels)
            silhouette_scores.append(silhouette_avg)
        
        optimal_k = K_range[np.argmax(silhouette_scores)]
        
        # Final clustering
        self.kmeans_model = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
        clusters = self.kmeans_model.fit_predict(tfidf_matrix)
        self.data["cluster"] = clusters
        
        # LDA Topic Modeling
        self.lda_model = LatentDirichletAllocation(
            n_components=5,
            random_state=42,
            max_iter=10
        )
        
        lda_topics = self.lda_model.fit_transform(tfidf_matrix)
        dominant_topics = np.argmax(lda_topics, axis=1)
        self.data["dominant_topic"] = dominant_topics
        
        print(f"✅ Sentiment analysis completed")
        print(f"🎯 Optimal clusters: {optimal_k}")
        print(f"📋 Topics discovered: 5")
        
        self.analyzed_data = self.data.copy()
        return self.analyzed_data
    
    def find_service_combinations(self):
        """Find and analyze service combinations."""
        print("🔗 Analyzing service combinations...")
        
        combinations = []
        
        for idx, row in self.analyzed_data.iterrows():
            text = row["feedback_text"]
            
            # Look for service mentions
            mentioned_services = []
            services = [
                "Telemedicine Consultation", "Emergency Care", "Physical Therapy",
                "Mental Health Counseling", "Pharmacy Services", "Laboratory Testing",
                "Radiology Services", "Surgical Procedures", "Preventive Care",
                "Specialist Consultation", "Home Healthcare", "Urgent Care"
            ]
            
            for service in services:
                if service.lower() in text.lower():
                    mentioned_services.append(service)
            
            # Check for combination language
            has_combo_language = any(phrase in text.lower() for phrase in [
                "combined with", "also used", "along with", "in addition to"
            ])
            
            if len(mentioned_services) > 1 or has_combo_language:
                combinations.append({
                    "id": row["id"],
                    "primary_service": row["service_type"],
                    "mentioned_services": mentioned_services,
                    "sentiment": row["vader_sentiment"],
                    "sentiment_score": row["vader_compound"],
                    "feedback": text,
                    "age_group": row["age_group"],
                    "gender": row["gender"],
                    "insurance_type": row["insurance_type"],
                    "rating": row["rating"]
                })
        
        self.combinations = combinations
        print(f"✅ Found {len(combinations)} service combinations")
        return combinations
    
    def create_visualizations(self):
        """Create comprehensive visualizations."""
        print("📈 Creating visualizations...")
        
        # Set up the plotting style
        plt.style.use("default")
        fig = plt.figure(figsize=(20, 15))
        
        # 1. Sentiment Distribution
        ax1 = plt.subplot(3, 4, 1)
        sentiment_counts = self.analyzed_data["vader_sentiment"].value_counts()
        colors = ["#2ecc71", "#e74c3c", "#f39c12"]
        ax1.pie(sentiment_counts.values, labels=sentiment_counts.index, autopct="%1.1f%%",
               colors=colors, startangle=90)
        ax1.set_title("Sentiment Distribution", fontsize=12, fontweight="bold")
        
        # 2. Service Performance
        ax2 = plt.subplot(3, 4, 2)
        service_performance = self.analyzed_data.groupby("service_type")["vader_compound"].mean().sort_values(ascending=False)
        top_services = service_performance.head(6)
        bars = ax2.bar(range(len(top_services)), top_services.values, color="lightblue", alpha=0.7)
        ax2.set_title("Top Services by Sentiment", fontsize=12, fontweight="bold")
        ax2.set_xticks(range(len(top_services)))
        ax2.set_xticklabels(top_services.index, rotation=45, ha="right")
        ax2.grid(True, alpha=0.3)
        
        # 3. Rating vs Sentiment
        ax3 = plt.subplot(3, 4, 3)
        ax3.scatter(self.analyzed_data["rating"], self.analyzed_data["vader_compound"],
                   alpha=0.6, color="purple")
        ax3.set_title("Rating vs Sentiment Score", fontsize=12, fontweight="bold")
        ax3.set_xlabel("Rating")
        ax3.set_ylabel("Sentiment Score")
        ax3.grid(True, alpha=0.3)
        
        # 4. Service Combinations
        ax4 = plt.subplot(3, 4, 4)
        if self.combinations:
            combo_sentiments = [c["sentiment"] for c in self.combinations]
            combo_counts = Counter(combo_sentiments)
            bars = ax4.bar(combo_counts.keys(), combo_counts.values(),
                          color=["#2ecc71", "#e74c3c", "#f39c12"])
            ax4.set_title("Service Combination Sentiment", fontsize=12, fontweight="bold")
            ax4.set_ylabel("Count")
        else:
            ax4.text(0.5, 0.5, "No combinations
found", ha="center", va="center",
                    transform=ax4.transAxes, fontsize=12)
            ax4.set_title("Service Combinations", fontsize=12, fontweight="bold")
        
        # 5. Age Group Analysis
        ax5 = plt.subplot(3, 4, 5)
        age_sentiment = self.analyzed_data.groupby("age_group")["vader_compound"].mean().sort_values(ascending=False)
        bars = ax5.bar(range(len(age_sentiment)), age_sentiment.values, color="lightgreen", alpha=0.7)
        ax5.set_title("Sentiment by Age Group", fontsize=12, fontweight="bold")
        ax5.set_xticks(range(len(age_sentiment)))
        ax5.set_xticklabels(age_sentiment.index, rotation=45)
        ax5.grid(True, alpha=0.3)
        
        # 6. Insurance Type Analysis
        ax6 = plt.subplot(3, 4, 6)
        insurance_sentiment = self.analyzed_data.groupby("insurance_type")["vader_compound"].mean().sort_values(ascending=False)
        bars = ax6.bar(range(len(insurance_sentiment)), insurance_sentiment.values, color="lightcoral", alpha=0.7)
        ax6.set_title("Sentiment by Insurance", fontsize=12, fontweight="bold")
        ax6.set_xticks(range(len(insurance_sentiment)))
        ax6.set_xticklabels(insurance_sentiment.index, rotation=45)
        ax6.grid(True, alpha=0.3)
        
        # 7. Temporal Trends
        ax7 = plt.subplot(3, 4, 7)
        self.analyzed_data["month"] = self.analyzed_data["date"].dt.to_period("M")
        monthly_sentiment = self.analyzed_data.groupby("month")["vader_compound"].mean()
        ax7.plot(range(len(monthly_sentiment)), monthly_sentiment.values, marker="o", color="green", linewidth=2)
        ax7.set_title("Monthly Sentiment Trend", fontsize=12, fontweight="bold")
        ax7.set_xticks(range(len(monthly_sentiment)))
        ax7.set_xticklabels([str(x) for x in monthly_sentiment.index], rotation=45)
        ax7.grid(True, alpha=0.3)
        
        # 8. Cluster Analysis
        ax8 = plt.subplot(3, 4, 8)
        pca = PCA(n_components=2)
        tfidf_matrix = self.tfidf_vectorizer.transform(self.analyzed_data["feedback_text"])
        tfidf_2d = pca.fit_transform(tfidf_matrix.toarray())
        scatter = ax8.scatter(tfidf_2d[:, 0], tfidf_2d[:, 1], c=self.analyzed_data["cluster"],
                            cmap="viridis", alpha=0.6)
        ax8.set_title("Text Clusters", fontsize=12, fontweight="bold")
        ax8.set_xlabel("PC1")
        ax8.set_ylabel("PC2")
        plt.colorbar(scatter, ax=ax8)
        
        # 9. Service Volume
        ax9 = plt.subplot(3, 4, 9)
        service_counts = self.analyzed_data["service_type"].value_counts().head(8)
        bars = ax9.barh(range(len(service_counts)), service_counts.values, color="lightblue", alpha=0.7)
        ax9.set_title("Service Volume", fontsize=12, fontweight="bold")
        ax9.set_yticks(range(len(service_counts)))
        ax9.set_yticklabels(service_counts.index)
        ax9.grid(True, alpha=0.3)
        
        # 10. Sentiment Score Distribution
        ax10 = plt.subplot(3, 4, 10)
        ax10.hist(self.analyzed_data["vader_compound"], bins=30, alpha=0.7, color="skyblue", edgecolor="black")
        ax10.axvline(self.analyzed_data["vader_compound"].mean(), color="red", linestyle="--",
                    label=f"Mean: {self.analyzed_data["vader_compound"].mean():.3f}")
        ax10.set_title("Sentiment Score Distribution", fontsize=12, fontweight="bold")
        ax10.set_xlabel("Sentiment Score")
        ax10.set_ylabel("Frequency")
        ax10.legend()
        ax10.grid(True, alpha=0.3)
        
        # 11. Gender Analysis
        ax11 = plt.subplot(3, 4, 11)
        gender_sentiment = self.analyzed_data.groupby("gender")["vader_compound"].mean().sort_values(ascending=False)
        bars = ax11.bar(range(len(gender_sentiment)), gender_sentiment.values, color="lightpink", alpha=0.7)
        ax11.set_title("Sentiment by Gender", fontsize=12, fontweight="bold")
        ax11.set_xticks(range(len(gender_sentiment)))
        ax11.set_xticklabels(gender_sentiment.index, rotation=45)
        ax11.grid(True, alpha=0.3)
        
        # 12. Overall Summary
        ax12 = plt.subplot(3, 4, 12)
        ax12.axis("off")
        summary_text = f"""
        📊 ANALYSIS SUMMARY
        
        Total Feedback: {len(self.analyzed_data):,}
        Positive: {(self.analyzed_data["vader_sentiment"] == "positive").sum()}
        Negative: {(self.analyzed_data["vader_sentiment"] == "negative").sum()}
        Neutral: {(self.analyzed_data["vader_sentiment"] == "neutral").sum()}
        
        Avg Sentiment: {self.analyzed_data["vader_compound"].mean():.3f}
        Service Types: {self.analyzed_data["service_type"].nunique()}
        Combinations: {len(self.combinations)}
        Clusters: {self.analyzed_data["cluster"].nunique()}
        """
        ax12.text(0.1, 0.9, summary_text, transform=ax12.transAxes, fontsize=10,
                verticalalignment="top", fontfamily="monospace")
        
        plt.tight_layout()
        plt.show()
        
        print("✅ Visualizations completed!")
        return fig

print("✅ Healthcare Sentiment Analyzer class loaded!")

## 2. Initialize and Run Analysis

In [None]:
# Initialize the analyzer
analyzer = HealthcareSentimentAnalyzer()

# Generate sample data
data = analyzer.generate_sample_data(1000)

print(f"\n📊 Dataset Overview:")
print(f"Total entries: {len(data):,}")
print(f"Service types: {data["service_type"].nunique()}")
print(f"Date range: {data["date"].min().date()} to {data["date"].max().date()}")

# Show sample data
print("\n📋 Sample Data:")
display(data.head())

In [None]:
# Run comprehensive sentiment analysis
analyzed_data = analyzer.analyze_sentiment()

# Find service combinations
combinations = analyzer.find_service_combinations()

print(f"\n📈 Analysis Results:")
print(f"Sentiment distribution:")
print(analyzed_data["vader_sentiment"].value_counts())
print(f"\nAverage sentiment score: {analyzed_data["vader_compound"].mean():.3f}")
print(f"Service combinations found: {len(combinations)}")

## 3. Comprehensive Visualizations

In [None]:
# Create comprehensive visualizations
fig = analyzer.create_visualizations()

## 4. AI Assistant Setup

In [None]:
# Simple AI Assistant for result interpretation
class SimpleAIAssistant:
    def __init__(self, analyzer):
        self.analyzer = analyzer
        self.prepare_context()
    
    def prepare_context(self):
        self.context = {
            "total_feedback": len(self.analyzer.analyzed_data),
            "sentiment_dist": dict(self.analyzer.analyzed_data["vader_sentiment"].value_counts()),
            "avg_sentiment": self.analyzer.analyzed_data["vader_compound"].mean(),
            "service_count": self.analyzer.analyzed_data["service_type"].nunique(),
            "combinations": len(self.analyzer.combinations),
            "clusters": self.analyzer.analyzed_data["cluster"].nunique()
        }
    
    def answer_question(self, question):
        q = question.lower()
        
        if "top" in q and "service" in q:
            top_services = self.analyzer.analyzed_data.groupby("service_type")["vader_compound"].mean().sort_values(ascending=False).head(3)
            return f"The top performing services are: {.join(top_services.index.tolist())}. These services show the highest average sentiment scores."
        
        elif "combination" in q:
            if self.analyzer.combinations:
                combo_df = pd.DataFrame(self.analyzer.combinations)
                avg_combo_sentiment = combo_df["sentiment_score"].mean()
                return f"Found {len(self.analyzer.combinations)} service combinations. Average sentiment for combinations: {avg_combo_sentiment:.3f}. This shows how patients feel when using multiple services together."
            else:
                return "No service combinations were detected in the current dataset."
        
        elif "sentiment" in q:
            pos_pct = (self.context["sentiment_dist"]["positive"] / self.context["total_feedback"]) * 100
            return f"Overall sentiment analysis shows {pos_pct:.1f}% positive feedback, with an average sentiment score of {self.context["avg_sentiment"]:.3f}. This indicates generally positive patient experiences."
        
        elif "recommendation" in q:
            return "Key recommendations: 1) Focus on replicating success factors from top-performing services, 2) Address underperforming services with targeted improvements, 3) Monitor service combinations for cross-selling opportunities, 4) Implement regular sentiment monitoring."
        
        elif "cluster" in q:
            return f"Text clustering identified {self.context["clusters"]} distinct groups of feedback, revealing natural patterns in patient experiences and common themes."
        
        else:
            return f"Based on the analysis of {self.context["total_feedback"]} feedback entries across {self.context["service_count"]} service types, the sentiment analysis reveals patterns in patient experiences. Ask a more specific question for detailed insights."

print("✅ Simple AI Assistant class loaded!")

In [None]:
# Initialize AI Assistant
ai_assistant = SimpleAIAssistant(analyzer)

print("🤖 AI Assistant ready! Ask questions about your analysis results.")
print("\n💡 Example questions you can ask:")
print("- What are the top performing services?")
print("- How do service combinations affect sentiment?")
print("- What insights can you provide about the sentiment analysis?")
print("- What recommendations do you have based on the results?")
print("- How do different demographic groups rate the services?")

## 5. Interactive Q&A

In [None]:
# Interactive Q&A function
def ask_question(question):
    print(f"\n❓ Question: {question}")
    print("🤖 AI Assistant Response:")
    print("=" * 50)
    response = ai_assistant.answer_question(question)
    print(response)
    print("=" * 50)
    return response

# Example questions
ask_question("What are the top performing services based on sentiment analysis?")
ask_question("How do service combinations affect patient satisfaction?")
ask_question("What recommendations do you have for improving healthcare services?")

## 6. Ask Your Own Questions

In [None]:
# Ask your own question here!
your_question = "What insights can you provide about demographic patterns in the sentiment analysis?"
ask_question(your_question)

# Try more questions:
# ask_question("How do different age groups rate the services?")
# ask_question("What are the main themes identified in the feedback?")
# ask_question("How can we improve underperforming services?")

## 🎉 Analysis Complete!

### ✅ What You've Accomplished:

1. **Generated realistic healthcare feedback data** (1000 samples)
2. **Performed comprehensive sentiment analysis** using VADER, clustering, and topic modeling
3. **Analyzed service combinations** to understand cross-service patterns
4. **Created rich visualizations** showing insights across multiple dimensions
5. **Built an AI assistant** to interpret and explain results

### 🚀 Ready for Google Colab Sharing!

This notebook is completely self-contained and ready to be shared. Simply:

1. **Upload to Google Colab**
2. **Run all cells** (Runtime → Run All)
3. **Ask questions** using the AI assistant
4. **Share with colleagues** for collaboration

### 💡 Next Steps:

- Replace synthetic data with your own healthcare feedback
- Adjust analysis parameters for your specific needs
- Export results for further analysis
- Integrate with your existing healthcare systems

**🔒 Note: This uses synthetic data for demonstration. For real healthcare data, ensure HIPAA compliance.**