# Exploratory Data Analysis for SDOH Domains using Sweetviz

This notebook performs EDA for the domains/items in the domain map with cohorts defined by the 'survey' column in the combined dataset using Sweetviz.

In [None]:
import pandas as pd
import sweetviz as sv

# Load domain map
domain_map = pd.read_csv('reference/domain_map.tsv', sep='\t')
print("Domain Map:")
display(domain_map.head())

# Load combined data
combined_data = pd.read_csv('data/combined.tsv', sep='\t')
print("\nCombined Data:")
display(combined_data.head())

# Get unique cohorts
cohorts = combined_data['survey'].unique()
print(f"\nUnique cohorts: {cohorts}")

## 1. Generate Overall Sweetviz Report

In [None]:
# Generate Sweetviz report for the entire dataset
overall_report = sv.analyze(combined_data, pairwise_analysis='off')
overall_report.show_html("sdoh_sweetviz_report.html")
print("Overall Sweetviz report generated and saved as 'sdoh_sweetviz_report.html'")

## 2. Generate Cohort-specific Sweetviz Reports

In [None]:
for cohort in cohorts:
    cohort_data = combined_data[combined_data['survey'] == cohort]
    cohort_report = sv.analyze(cohort_data, pairwise_analysis='off')
    cohort_report.show_html(f"{cohort.lower()}_sweetviz_report.html")
    print(f"Sweetviz report for {cohort} cohort generated and saved as '{cohort.lower()}_sweetviz_report.html'")

## 3. Compare Cohorts using Sweetviz

In [None]:
# Compare the first two cohorts as an example
cohort1 = cohorts[0]
cohort2 = cohorts[1]

cohort1_data = combined_data[combined_data['survey'] == cohort1]
cohort2_data = combined_data[combined_data['survey'] == cohort2]

comparison_report = sv.compare([cohort1_data, cohort1], [cohort2_data, cohort2], pairwise_analysis='off')
comparison_report.show_html(f"{cohort1.lower()}_{cohort2.lower()}_comparison_report.html")
print(f"Comparison report between {cohort1} and {cohort2} generated and saved as '{cohort1.lower()}_{cohort2.lower()}_comparison_report.html'")

## 4. Domain-specific Analysis

In [None]:
for domain, items in domain_map.groupby('Domain')['Column Name']:
    domain_data = combined_data[items.tolist() + ['survey']]
    domain_report = sv.analyze(domain_data, pairwise_analysis='off')
    domain_report.show_html(f"{domain.lower().replace(' ', '_')}_sweetviz_report.html")
    print(f"Sweetviz report for {domain} domain generated and saved as '{domain.lower().replace(' ', '_')}_sweetviz_report.html'")

## 5. Summary and Conclusions

Based on the Sweetviz reports generated above, we can draw the following conclusions:

1. [Add your conclusions here based on the Sweetviz report results]
2. [Highlight key findings and differences between cohorts]
3. [Discuss any patterns or trends observed in the data]
4. [Suggest areas for further investigation or analysis]

For detailed insights, please refer to the generated HTML reports:
- Overall Sweetviz report: 'sdoh_sweetviz_report.html'
- Cohort-specific reports: '[cohort_name]_sweetviz_report.html'
- Cohort comparison report: '[cohort1]_[cohort2]_comparison_report.html'
- Domain-specific reports: '[domain_name]_sweetviz_report.html'