# Gene Expression Association Analysis Project - Student Handout

## Project Overview

In this project, you will perform a **complete association and correlation analysis** of the gene expression dataset you explored in Project 1. Now that you understand the data structure, missing values, and distributions, you will investigate **relationships between variables**.

### Dataset Reminder:
- **Two genes** (`gene1`, `gene2`) with expression values
- **Three categorical variables** (`cat1`, `cat2`, `cat3`) representing sample annotations
- **Hidden population structure** (`population`) that may explain patterns

### Analysis Objectives:
1. **Gene-Gene Correlation**: Measure the relationship between gene1 and gene2 using Pearson and Spearman
2. **Conditional Correlation**: Examine if the gene correlation differs across categories (Simpson's Paradox?)
3. **Categorical Associations**: Use contingency tables, chi-square tests, and Cramér's V
4. **Expression vs Category**: Analyze if gene expression levels associate with categorical variables
5. **Population Structure Discovery**: Uncover how the hidden population affects relationships

### Tools You Will Use:
- **pandas**: Data manipulation and cross-tabulation
- **scipy.stats**: Pearson, Spearman, chi-square tests
- **seaborn**: Visualization (scatter plots, heatmaps, regression plots)
- **matplotlib**: Plot customization

### Instructions:
- Complete each code cell marked with `# YOUR CODE HERE`
- Answer the reflection questions in the markdown cells
- Build on insights from Project 1

## 1. Setup and Data Loading

### Your Task:
- Import the required libraries
- Load the dataset from `gene_expression_data.csv`
- Preview the data and recall the structure from Project 1

In [1]:
# YOUR CODE HERE: Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Set style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = [10, 6]

# YOUR CODE HERE: Load the dataset
# df = pd.read_csv(...)


# Preview data
df.head()

NameError: name 'df' is not defined

In [None]:
# Check available columns and their types
df.info()

---
## 2. Gene-Gene Correlation Analysis

### Background
From Project 1, you observed the distributions of gene1 and gene2. Now we will quantify their relationship.

**Questions to answer:**
- Are gene1 and gene2 correlated?
- Is the relationship linear (Pearson) or monotonic (Spearman)?
- How strong is the correlation?

### Your Tasks:
1. **Task 2a**: Calculate Pearson correlation between gene1 and gene2
2. **Task 2b**: Calculate Spearman correlation (more robust to outliers and zeros)
3. **Task 2c**: Create a scatter plot with regression line
4. **Task 2d**: Discuss the difference between Pearson and Spearman results

In [None]:
# Task 2a: Pearson correlation
# Drop NaN values for correlation calculation
df_clean = df[['gene1', 'gene2']].dropna()

# YOUR CODE HERE: Calculate Pearson correlation using pandas
# r_pearson = df_clean['gene1'].corr(df_clean['gene2'])

print(f"Pearson correlation: r = {r_pearson:.4f}")

In [None]:
# Task 2b: Spearman correlation
# YOUR CODE HERE: Calculate Spearman correlation using pandas
# r_spearman = df_clean['gene1'].corr(df_clean['gene2'], method='spearman')

print(f"Spearman correlation: ρ = {r_spearman:.4f}")

print(f"\nDifference (Pearson - Spearman): {r_pearson - r_spearman:.4f}")

In [None]:
# Task 2c: Scatter plot with regression line
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Basic scatter plot
# YOUR CODE HERE: Create scatter plot of gene1 vs gene2
# sns.scatterplot(...)
axes[0].set_title('Gene1 vs Gene2 - Scatter Plot')

# Right: Regression plot with confidence interval
# YOUR CODE HERE: Create regression plot
# sns.regplot(...)
axes[1].set_title(f'Gene1 vs Gene2 - Regression (r = {r_pearson:.3f})')

plt.tight_layout()
plt.show()

### Reflection Question 2:
**After running the cells above, answer:**

1. Why might Pearson and Spearman differ? (Think about zeros/dropouts)
2. Based on the scatter plot, does the linear relationship seem appropriate?

*Your answer here:*



---
## 3. Conditional Correlation: By Categories

### Simpson's Paradox Warning!
The overall correlation might hide different patterns within subgroups. A positive overall correlation could be driven by group differences, while within-group correlations might be different or even reversed.

### Your Tasks:
1. **Task 3a**: Calculate gene1-gene2 correlation separately for each level of cat1
2. **Task 3b**: Calculate gene1-gene2 correlation separately for each level of cat2
3. **Task 3c**: Calculate gene1-gene2 correlation for each population (reveal the hidden structure!)
4. **Task 3d**: Visualize with colored scatter plots (lmplot)

In [None]:
# Task 3a: Conditional correlation by cat1
print("=== Correlation by cat1 ===")
print("-" * 50)

for category in df['cat1'].dropna().unique():
    subset = df[df['cat1'] == category][['gene1', 'gene2']].dropna()
    if len(subset) > 2:
        # YOUR CODE HERE: Calculate Pearson correlation for this subset using pandas
        # r = subset['gene1'].corr(subset['gene2'])
        print(f"{category}: r = {r:.4f} (n={len(subset)})")

In [None]:
# Task 3b: Conditional correlation by cat2
print("=== Correlation by cat2 ===")
print("-" * 50)

for category in df['cat2'].dropna().unique():
    subset = df[df['cat2'] == category][['gene1', 'gene2']].dropna()
    if len(subset) > 2:
        # YOUR CODE HERE: Calculate Pearson correlation for this subset using pandas
        # r = subset['gene1'].corr(subset['gene2'])
        print(f"{category}: r = {r:.4f} (n={len(subset)})")

In [None]:
# Task 3c: Conditional correlation by POPULATION (hidden structure!)
print("=== Correlation by population ===")
print("-" * 50)

for pop in sorted(df['population'].dropna().unique()):
    subset = df[df['population'] == pop][['gene1', 'gene2']].dropna()
    if len(subset) > 2:
        # YOUR CODE HERE: Calculate Pearson correlation for this subset using pandas
        # r = subset['gene1'].corr(subset['gene2'])
        print(f"{pop}: r = {r:.4f} (n={len(subset)})")

In [None]:
# Task 3d: Visualize conditional correlations
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# By cat1
# YOUR CODE HERE: Create scatter plot colored by cat1
# sns.scatterplot(data=df, x='gene1', y='gene2', hue='cat1', ax=axes[0], alpha=0.5)
axes[0].set_title('Gene Correlation by cat1')

# By cat2
# YOUR CODE HERE: Create scatter plot colored by cat2
# sns.scatterplot(...)
axes[1].set_title('Gene Correlation by cat2')

# By population
# YOUR CODE HERE: Create scatter plot colored by population
# sns.scatterplot(...)
axes[2].set_title('Gene Correlation by Population')

plt.tight_layout()
plt.show()

In [None]:
# Regression plots by category (shows individual regression lines)
# YOUR CODE HERE: Create lmplot by population
# g = sns.lmplot(data=df, x='gene1', y='gene2', hue='population', 
#                height=5, aspect=1.5, scatter_kws={'alpha': 0.4})

plt.title('Gene1 vs Gene2 with Regression Lines by Population')
plt.show()

### Reflection Question 3:
**After running the cells above, answer:**

1. Does the gene1-gene2 correlation differ across categories? Which grouping variable shows the biggest differences?
2. Is there evidence of Simpson's Paradox? (Overall correlation different from within-group correlations?)
3. What does the population structure reveal about the data?

*Your answer here:*



---
## 4. Categorical Variable Associations

### Background
Now we examine relationships **between the categorical variables themselves**:
- Are cat1 and cat2 independent?
- Are cat1 and cat3 associated?
- How does population relate to other categories?

### Your Tasks:
1. **Task 4a**: Create contingency tables between categorical variables
2. **Task 4b**: Perform chi-square tests of independence
3. **Task 4c**: Calculate Cramér's V to measure association strength
4. **Task 4d**: Visualize associations with heatmaps

In [None]:
# Define Cramér's V function
def cramers_v(contingency_table):
    """Calculate Cramér's V from a contingency table."""
    chi2 = stats.chi2_contingency(contingency_table)[0]
    n = contingency_table.sum().sum()
    r, c = contingency_table.shape
    return np.sqrt(chi2 / (n * min(r-1, c-1)))

In [None]:
# Task 4a: Contingency table - cat1 vs cat2
# YOUR CODE HERE: Create crosstab between cat1 and cat2
# contingency_cat1_cat2 = pd.crosstab(...)

print("=== Contingency Table: cat1 vs cat2 ===")
contingency_cat1_cat2

In [None]:
# Task 4b: Chi-square test - cat1 vs cat2
# YOUR CODE HERE: Perform chi-square test
# chi2, _, dof, expected = stats.chi2_contingency(contingency_cat1_cat2)

print(f"Chi-square test (cat1 vs cat2):")
print(f"  χ² = {chi2:.4f}")
print(f"  Degrees of freedom = {dof}")

In [None]:
# Task 4c: Cramér's V for cat1 vs cat2
# YOUR CODE HERE: Calculate Cramér's V
# v = cramers_v(contingency_cat1_cat2)

print(f"Cramér's V (cat1 vs cat2): {v:.4f}")

In [None]:
# Analyze all pairs of categorical variables
categorical_vars = ['cat1', 'cat2', 'cat3', 'population']

print("=== Association Analysis: All Categorical Pairs ===")
print("-" * 60)
print(f"{'Pair':<25} {'χ²':>10} {'Cramér V':>10}")
print("-" * 60)

for i, var1 in enumerate(categorical_vars):
    for var2 in categorical_vars[i+1:]:
        # YOUR CODE HERE: Create contingency table and calculate chi-square and Cramér's V
        # ct = pd.crosstab(...)
        # chi2, _, dof, _ = stats.chi2_contingency(...)
        # v = cramers_v(...)
        print(f"{var1} vs {var2:<12} {chi2:>10.2f} {v:>10.4f}")

In [None]:
# Task 4d: Visualize key associations
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# cat1 vs cat2
ct1 = pd.crosstab(df['cat1'], df['cat2'])
sns.heatmap(ct1, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('cat1 vs cat2')

# cat1 vs population
ct2 = pd.crosstab(df['cat1'], df['population'])
sns.heatmap(ct2, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('cat1 vs population')

# cat2 vs population
ct3 = pd.crosstab(df['cat2'], df['population'])
sns.heatmap(ct3, annot=True, fmt='d', cmap='Oranges', ax=axes[2])
axes[2].set_title('cat2 vs population')

plt.tight_layout()
plt.show()

### Reflection Question 4:
**After running the cells above, answer:**

1. Which pairs of categorical variables show significant associations?
2. Which pair has the strongest association (highest Cramér's V)?
3. What does this tell you about the structure of the experiment/data?

*Your answer here:*



---
## 5. Gene Expression by Categories: Quantifying Differences

### Background
In Project 1, you visualized gene expression distributions by categories (boxplots). Now we will quantify whether expression levels are associated with categorical variables.

### Approach
For continuous vs categorical associations, we can:
1. Calculate mean expression per category
2. Perform ANOVA or Kruskal-Wallis tests (beyond this course)
3. Create expression-level categories and use chi-square/Cramér's V

### Your Tasks:
1. **Task 5a**: Categorize gene expression levels (Low, Medium, High)
2. **Task 5b**: Create contingency table between expression level and categories
3. **Task 5c**: Calculate Cramér's V to quantify association
4. **Task 5d**: Visualize with heatmaps

In [None]:
# Task 5a: Categorize gene1 expression into Low, Medium, High
# Use tertiles (33rd and 67th percentiles)
q33 = df['gene1'].quantile(0.33)
q67 = df['gene1'].quantile(0.67)

def categorize_expression(value):
    if pd.isna(value):
        return np.nan
    elif value <= q33:
        return 'Low'
    elif value <= q67:
        return 'Medium'
    else:
        return 'High'

# YOUR CODE HERE: Apply the categorization function
# df['gene1_level'] = df['gene1'].apply(...)

df['gene1_level'] = pd.Categorical(df['gene1_level'], categories=['Low', 'Medium', 'High'], ordered=True)

# Check the distribution
print("Gene1 expression levels:")
print(df['gene1_level'].value_counts())

In [None]:
# Similarly for gene2
q33_g2 = df['gene2'].quantile(0.33)
q67_g2 = df['gene2'].quantile(0.67)

def categorize_gene2(value):
    if pd.isna(value):
        return np.nan
    elif value <= q33_g2:
        return 'Low'
    elif value <= q67_g2:
        return 'Medium'
    else:
        return 'High'

# YOUR CODE HERE: Apply the categorization
# df['gene2_level'] = ...

df['gene2_level'] = pd.Categorical(df['gene2_level'], categories=['Low', 'Medium', 'High'], ordered=True)

print("Gene2 expression levels:")
print(df['gene2_level'].value_counts())

In [None]:
# Task 5b & 5c: Association between gene expression levels and categorical variables
print("=== Gene Expression Level Associations ===")
print("-" * 60)

expression_vars = ['gene1_level', 'gene2_level']
category_vars = ['cat1', 'cat2', 'cat3', 'population']

for exp_var in expression_vars:
    print(f"\n{exp_var}:")
    for cat_var in category_vars:
        # YOUR CODE HERE: Create contingency table and calculate association
        # ct = pd.crosstab(...)
        # chi2, _, dof, _ = stats.chi2_contingency(...)
        # v = cramers_v(...)
        print(f"  vs {cat_var:<12}: χ² = {chi2:>8.2f}, V = {v:.4f}")

In [None]:
# Task 5d: Visualize gene expression levels by population
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Gene1 level vs population
ct1 = pd.crosstab(df['population'], df['gene1_level'], normalize='index')
ct1.plot(kind='bar', stacked=True, ax=axes[0], colormap='YlOrRd')
axes[0].set_title('Gene1 Expression Level Distribution by Population')
axes[0].set_ylabel('Proportion')
axes[0].legend(title='Gene1 Level')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)

# Gene2 level vs population
ct2 = pd.crosstab(df['population'], df['gene2_level'], normalize='index')
ct2.plot(kind='bar', stacked=True, ax=axes[1], colormap='YlGnBu')
axes[1].set_title('Gene2 Expression Level Distribution by Population')
axes[1].set_ylabel('Proportion')
axes[1].legend(title='Gene2 Level')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)

plt.tight_layout()
plt.show()

### Reflection Question 5:
**After running the cells above, answer:**

1. Which categorical variable is most strongly associated with gene1 expression levels?
2. Which categorical variable is most strongly associated with gene2 expression levels?
3. Does this suggest that some categories have systematically different gene expression?

*Your answer here:*



---
## 6. Summary: Correlation Matrix Heatmap

### Your Task:
Create a comprehensive correlation analysis summary:
1. Calculate correlation matrix for all numeric variables
2. Compare overall vs within-population correlations
3. Summarize all association findings

In [None]:
# Correlation matrix for numeric variables
numeric_cols = ['gene1', 'gene2']
corr_matrix = df[numeric_cols].corr()

print("=== Overall Correlation Matrix ===")
corr_matrix

In [None]:
# Within-population correlation comparison
print("=== Correlation Comparison: Overall vs By Population ===")
print("-" * 60)

# Overall
overall_clean = df[['gene1', 'gene2']].dropna()
r_overall = overall_clean['gene1'].corr(overall_clean['gene2'])
print(f"{'Overall':<12}: r = {r_overall:.4f}")
print()

# By population
for pop in sorted(df['population'].unique()):
    subset = df[df['population'] == pop][['gene1', 'gene2']].dropna()
    r = subset['gene1'].corr(subset['gene2'])
    print(f"{pop:<12}: r = {r:.4f}")

In [None]:
# Final summary visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Overall scatter with population coloring
sns.scatterplot(data=df, x='gene1', y='gene2', hue='population', 
                alpha=0.5, ax=axes[0, 0])
axes[0, 0].set_title(f'Gene1 vs Gene2 by Population (Overall r = {r_overall:.3f})')

# 2. Categorical association heatmap (Cramér's V matrix)
cat_vars = ['cat1', 'cat2', 'cat3', 'population']
v_matrix = pd.DataFrame(index=cat_vars, columns=cat_vars, dtype=float)
for i, v1 in enumerate(cat_vars):
    for j, v2 in enumerate(cat_vars):
        if i == j:
            v_matrix.loc[v1, v2] = 1.0
        else:
            ct = pd.crosstab(df[v1], df[v2])
            v_matrix.loc[v1, v2] = cramers_v(ct)

sns.heatmap(v_matrix.astype(float), annot=True, cmap='Purples', ax=axes[0, 1],
            vmin=0, vmax=1, fmt='.3f')
axes[0, 1].set_title("Cramér's V: Categorical Variable Associations")

# 3. Gene expression level vs population
ct = pd.crosstab(df['population'], df['gene1_level'], normalize='index')
ct.plot(kind='bar', stacked=True, ax=axes[1, 0], colormap='RdYlGn')
axes[1, 0].set_title('Gene1 Expression by Population')
axes[1, 0].set_ylabel('Proportion')
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=0)

# 4. Conditional correlation summary
pops = sorted(df['population'].unique())
correlations = []
for pop in pops:
    subset = df[df['population'] == pop][['gene1', 'gene2']].dropna()
    r = subset['gene1'].corr(subset['gene2'])
    correlations.append(r)

axes[1, 1].barh(pops, correlations, color='steelblue', edgecolor='black')
axes[1, 1].axvline(x=r_overall, color='red', linestyle='--', label=f'Overall r = {r_overall:.3f}')
axes[1, 1].set_xlabel('Pearson Correlation (gene1 vs gene2)')
axes[1, 1].set_title('Conditional Correlations by Population')
axes[1, 1].legend()
axes[1, 1].set_xlim(-0.2, 1)

plt.tight_layout()
plt.show()

---
## 7. Final Conclusions

### Your Task:
Based on all your analyses, write a summary of your findings.

**Questions to address:**

1. **Gene-Gene Correlation**: What is the overall relationship between gene1 and gene2? Is it consistent across all subgroups?

2. **Simpson's Paradox**: Did you observe any cases where within-group relationships differed from overall relationships?

3. **Categorical Associations**: Which categorical variables are most strongly associated with each other? What does this suggest about the experimental design?

4. **Gene Expression Patterns**: Are gene expression levels systematically different across categories or populations?

5. **Key Insight**: What is the most important finding from this analysis?

### Your Conclusions:

**1. Gene-Gene Correlation:**
*Your answer here...*

**2. Simpson's Paradox Observations:**
*Your answer here...*

**3. Categorical Associations:**
*Your answer here...*

**4. Gene Expression Patterns:**
*Your answer here...*

**5. Key Insight:**
*Your answer here...*