# Gene Expression Data Analysis Project - Student Handout

## Project Overview

In this project, you will perform a **complete Exploratory Data Analysis (EDA)** of a gene expression dataset. The dataset simulates real-world gene expression data with common challenges.

### Dataset Characteristics:
- **Two genes** (`gene1`, `gene2`) with expression values that may contain:
  - **Dropout events** (zeros) - common in single-cell RNA-seq data
  - **Missing values (NaN)** - due to technical failures or quality filtering
  - **Outliers** - extreme expression values
- **Three categorical variables** (`cat1`, `cat2`, `cat3`) representing different sample annotations
- **Hidden population structure** to discover through analysis

### EDA Objectives:
1. **Data Loading & Inspection**: Understand the structure and dimensions of the data
2. **Missing Value Analysis**: Identify and quantify NaN values overall and by categories
3. **Summary Statistics**: Compute central tendency, dispersion, and distribution shape metrics
4. **Distribution Visualization**: Histograms to understand gene expression distributions
5. **Categorical Analysis**: Barplots to examine sample distribution across categories
6. **Comparative Analysis**: Boxplots and violin plots to compare gene expression across groups
7. **Relationship Exploration**: Scatter plots to visualize gene-gene correlations by categories

### Tools You Will Use:
- **pandas**: Data manipulation and summary statistics
- **seaborn**: Statistical visualization (primary plotting library)
- **matplotlib**: Plot customization
- **scipy.stats**: Statistical metrics (skewness, kurtosis)

### Instructions:
- Complete each code cell marked with `# YOUR CODE HERE`
- Answer the reflection questions in the markdown cells
- Run all cells to verify your code works correctly

## 1. Setup and Data Loading

### Your Task:
- Import the required libraries: `pandas`, `numpy`, `seaborn`, `matplotlib.pyplot`, and `scipy.stats`
- Set seaborn style to "whitegrid"
- Set figure size to [10, 6]
- Load the dataset from "gene_expression_data.csv" into a DataFrame called `df_analysis`
- Print the shape of the dataset

### Libraries to import:
- `pandas` as `pd`: DataFrame operations and CSV reading
- `numpy` as `np`: Numerical operations
- `seaborn` as `sns`: Statistical visualization
- `matplotlib.pyplot` as `plt`: Plot customization
- `stats` from `scipy`: Statistical functions

In [None]:
# YOUR CODE HERE: Import libraries
# import pandas as ...
# import numpy as ...
# import seaborn as ...
# import matplotlib.pyplot as ...
# from scipy import ...




# YOUR CODE HERE: Set seaborn style and figure size
# sns.set(style=...)
# plt.rcParams['figure.figsize'] = ...



# YOUR CODE HERE: Load the dataset
# df_analysis = ...



# YOUR CODE HERE: Print the shape
# print(f"Shape: ...")


## 2. Missing Value Analysis (NaN Detection)

### Why check for missing values?
Missing data is common in gene expression datasets due to:
- **Technical dropouts**: Low-abundance transcripts may not be detected
- **Quality filtering**: Poor-quality measurements are often set to NaN
- **Sample processing issues**: Some samples may have incomplete data

### Your Tasks:
1. **Task 2a**: Check total NaN values per column using `.isna().sum()`
2. **Task 2b**: Calculate percentage of NaN values per column
3. **Task 2c**: Check NaN values grouped by each categorical variable (cat1, cat2, cat3)

### Hints:
- Use `df.isna().sum()` to count NaN per column
- Use `df.groupby('column')[['gene1', 'gene2']].apply(lambda x: x.isna().sum())` for grouped counts

In [None]:
# Task 2a: Check for NaN values overall
print("=== Missing Values (NaN) ===")
print("\nTotal NaN per column:")
# YOUR CODE HERE: Print NaN count per column



print(f"\nTotal NaN in dataset:")
# YOUR CODE HERE: Print total NaN count



print(f"\nPercentage of NaN per column:")
# YOUR CODE HERE: Calculate and print percentage of NaN per column


In [None]:
# Task 2c: Check for NaN values by categories
print("=== NaN by cat1 ===")
# YOUR CODE HERE: Group by cat1 and count NaN in gene1 and gene2



print("\n=== NaN by cat2 ===")
# YOUR CODE HERE: Group by cat2 and count NaN in gene1 and gene2



print("\n=== NaN by cat3 ===")
# YOUR CODE HERE: Group by cat3 and count NaN in gene1 and gene2


### Reflection Question 2:
**After running the cells above, answer the following:**

1. Which column has the most missing values?
2. Are NaN values evenly distributed across categories, or concentrated in specific groups?
3. What might cause this pattern of missing data?

*Your answer here:*



## 3. Comprehensive Summary Statistics

### Central Tendency Measures:
- **Mean**: Average expression value - sensitive to outliers
- **Median**: Middle value - robust to outliers, useful for skewed distributions
- **Mode**: Most frequent value - indicates common expression levels

### Dispersion Measures:
- **Standard Deviation (Std)**: Spread around the mean
- **Variance**: Squared spread (Std²)
- **Interquartile Range (IQR)**: Range between 25th and 75th percentiles

### Distribution Shape:
- **Skewness**: Asymmetry of distribution (positive = right tail, negative = left tail)
- **Kurtosis**: "Tailedness" of distribution (positive = heavy tails)

### Your Task:
Create a summary statistics DataFrame containing: Mean, Median, Mode, Std, Variance, Min, Q1, Q2, Q3, Max, IQR, Skewness, Kurtosis for both genes.

### Hints:
- Import `kurtosis` and `skew` from `scipy.stats`
- Use `.dropna()` before calculating statistics
- Use `.mean()`, `.median()`, `.mode()`, `.std()`, `.var()`, `.min()`, `.max()`, `.quantile()`

In [None]:
from scipy.stats import kurtosis, skew

# YOUR CODE HERE: Create summary statistics for gene1 and gene2
genes = ['gene1', 'gene2']
summary_stats = pd.DataFrame()

for gene in genes:
    data = df_analysis[gene].dropna()
    summary_stats[gene] = {
        'Mean': None,  # YOUR CODE: Calculate mean
        'Median': None,  # YOUR CODE: Calculate median
        'Mode': None,  # YOUR CODE: Calculate mode (use .mode().iloc[0])
        'Std': None,  # YOUR CODE: Calculate standard deviation
        'Variance': None,  # YOUR CODE: Calculate variance
        'Min': None,  # YOUR CODE: Calculate minimum
        'Q1 (25%)': None,  # YOUR CODE: Calculate 25th percentile
        'Q2 (50%)': None,  # YOUR CODE: Calculate 50th percentile
        'Q3 (75%)': None,  # YOUR CODE: Calculate 75th percentile
        'Max': None,  # YOUR CODE: Calculate maximum
        'IQR': None,  # YOUR CODE: Calculate IQR (Q3 - Q1)
        'Skewness': None,  # YOUR CODE: Calculate skewness using skew()
        'Kurtosis': None  # YOUR CODE: Calculate kurtosis using kurtosis()
    }

print("=== Summary Statistics for Gene Expression ===")
summary_stats.T

### Reflection Question 3:
**After running the cell above, answer the following:**

1. Is the mean larger or smaller than the median for each gene? What does this indicate about the distribution?
2. What does the skewness value tell you about the data?
3. Are there potential outliers based on the Min/Max compared to the IQR?

*Your answer here:*



## 4. Histograms: Gene Expression Distributions

### Purpose:
Histograms reveal the **shape of data distribution** by grouping values into bins.

### What to look for:
- **Unimodal vs Multimodal**: Single peak vs multiple peaks
- **Skewness**: Is the distribution symmetric or skewed?
- **Zero inflation**: Many zeros (dropout)
- **Outliers**: Extreme values isolated from the main distribution

### Your Task:
Create histograms for gene1 and gene2 using `sns.histplot()`:
- Use 50 bins and enable KDE (kernel density estimation)
- Add vertical lines for mean (red, dashed) and median (green, dashed)
- Add proper labels, titles, and legends

### Hints:
- Use `sns.histplot(data=df, x='column', bins=50, kde=True, ax=ax)`
- Use `ax.axvline(value, color='red', linestyle='--', label='label')`

In [None]:
# YOUR CODE HERE: Create histograms for gene1 and gene2
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Gene1 histogram with KDE
# YOUR CODE HERE: Create histogram for gene1 with sns.histplot()


# YOUR CODE HERE: Add mean line (red, dashed)


# YOUR CODE HERE: Add median line (green, dashed)


# YOUR CODE HERE: Set labels, title, and legend for gene1



# Gene2 histogram with KDE
# YOUR CODE HERE: Create histogram for gene2 with sns.histplot()


# YOUR CODE HERE: Add mean line (red, dashed)


# YOUR CODE HERE: Add median line (green, dashed)


# YOUR CODE HERE: Set labels, title, and legend for gene2



plt.tight_layout()
plt.show()

### Reflection Question 4:
**After running the cell above, answer the following:**

1. Is the distribution unimodal or multimodal? What might multiple peaks indicate?
2. Where is the mean relative to the median? What type of skew does this suggest?
3. Do you see any visible outliers or zero inflation?

*Your answer here:*



## 5. Barplots: Categorical Variable Distributions

### Purpose:
Barplots display **counts** for each category, helping understand:
- **Sample distribution**: Are categories balanced or imbalanced?
- **Class sizes**: Important for statistical power in group comparisons

### Your Task:
Create barplots for cat1, cat2, and cat3 using `sns.countplot()`:
- Use different color palettes for each (e.g., 'Set2', 'Set3', 'pastel')
- Add count labels on top of each bar using `ax.bar_label()`
- Add proper labels and titles

### Hints:
- Use `sns.countplot(data=df, x='column', ax=ax, palette='Set2')`
- Use `for container in ax.containers: ax.bar_label(container)`

In [None]:
# YOUR CODE HERE: Create barplots for categorical variables
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# cat1 barplot
# YOUR CODE HERE: Create countplot for cat1


# YOUR CODE HERE: Add labels and title


# YOUR CODE HERE: Add count labels on bars



# cat2 barplot
# YOUR CODE HERE: Create countplot for cat2


# YOUR CODE HERE: Add labels and title


# YOUR CODE HERE: Add count labels on bars



# cat3 barplot
# YOUR CODE HERE: Create countplot for cat3


# YOUR CODE HERE: Add labels and title


# YOUR CODE HERE: Add count labels on bars



plt.tight_layout()
plt.show()

### Reflection Question 5:
**After running the cell above, answer the following:**

1. Are the categories balanced (similar sample sizes)?
2. Which category in cat3 has the smallest sample size?
3. Could imbalanced categories affect your analysis? How?

*Your answer here:*



## 6. Boxplots: Comparing Gene Expression Across Categories

### Purpose:
Boxplots provide a **5-number summary** for each group:
- **Box**: Interquartile range (IQR) - middle 50% of data
- **Line inside box**: Median (Q2)
- **Whiskers**: Extend to 1.5×IQR from the box
- **Points beyond whiskers**: Potential outliers

### Your Task:
Create boxplots comparing gene expression by each categorical variable:
- Create 2 subplots for each category (one for gene1, one for gene2)
- Use `sns.boxplot()` with appropriate palettes
- Add proper labels and titles

### Hints:
- Use `sns.boxplot(data=df, x='category', y='gene', ax=ax, palette='Set2')`

In [None]:
# YOUR CODE HERE: Boxplots by cat1
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# YOUR CODE HERE: Boxplot gene1 by cat1


# YOUR CODE HERE: Boxplot gene2 by cat1



plt.tight_layout()
plt.show()

In [None]:
# YOUR CODE HERE: Boxplots by cat2
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# YOUR CODE HERE: Boxplot gene1 by cat2


# YOUR CODE HERE: Boxplot gene2 by cat2



plt.tight_layout()
plt.show()

In [None]:
# YOUR CODE HERE: Boxplots by cat3
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# YOUR CODE HERE: Boxplot gene1 by cat3


# YOUR CODE HERE: Boxplot gene2 by cat3



plt.tight_layout()
plt.show()

### Reflection Question 6:
**After running the cells above, answer the following:**

1. Which categorical variable shows the clearest separation in gene expression?
2. Are there visible outliers? In which groups?
3. Do the boxes overlap between categories? What does this suggest about group differences?

*Your answer here:*



## 7. Violin Plots: Full Distribution Visualization by Categories

### Purpose:
Violin plots combine **boxplots** with **kernel density estimation** to show:
- The full **shape of the distribution**
- **Multimodality**: Multiple peaks indicate subpopulations
- **Density**: Wider sections = more observations at that value

### Advantages over Boxplots:
- Reveal distribution shapes that boxplots hide
- Show bimodal or multimodal distributions clearly

### Your Task:
Create violin plots for gene expression by each categorical variable:
- Use `sns.violinplot()` with `inner='box'` to show the quartiles
- Use different palettes for each category

### Hints:
- Use `sns.violinplot(data=df, x='category', y='gene', ax=ax, palette='Set2', inner='box')`

In [None]:
# YOUR CODE HERE: Violin plots by cat1
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# YOUR CODE HERE: Violin plot gene1 by cat1


# YOUR CODE HERE: Violin plot gene2 by cat1



plt.tight_layout()
plt.show()

In [None]:
# YOUR CODE HERE: Violin plots by cat2
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# YOUR CODE HERE: Violin plot gene1 by cat2


# YOUR CODE HERE: Violin plot gene2 by cat2



plt.tight_layout()
plt.show()

In [None]:
# YOUR CODE HERE: Violin plots by cat3
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# YOUR CODE HERE: Violin plot gene1 by cat3


# YOUR CODE HERE: Violin plot gene2 by cat3



plt.tight_layout()
plt.show()

### Reflection Question 7:
**After running the cells above, answer the following:**

1. Do any violins show multiple peaks (multimodality)? What might this indicate?
2. How do the violin shapes differ between categories?
3. What additional information do violin plots provide compared to boxplots?

*Your answer here:*



## 8. Scatter Plots: Gene-Gene Relationships by Categories

### Purpose:
Scatter plots reveal the **relationship between two continuous variables** while using color to encode categorical information.

### What to look for:
- **Correlation**: Positive slope = genes co-expressed
- **Clusters**: Separate point clouds may indicate distinct populations
- **Color separation**: If colors cluster together, the category explains gene expression patterns
- **Outliers**: Isolated points far from the main cloud

### Your Task:
Create scatter plots of gene1 vs gene2, colored by each categorical variable:
- Use `sns.scatterplot()` with `hue` parameter
- Use different palettes for each category
- Set `alpha=0.6` for transparency

### Hints:
- Use `sns.scatterplot(data=df, x='gene1', y='gene2', hue='category', palette='Set1', alpha=0.6)`

In [None]:
# YOUR CODE HERE: Scatter plot colored by cat1
plt.figure(figsize=(10, 8))

# YOUR CODE HERE: Create scatter plot with hue='cat1'


plt.title('Gene Expression: gene1 vs gene2 (colored by cat1)')
plt.xlabel('Gene 1')
plt.ylabel('Gene 2')
plt.legend(title='cat1')
plt.show()

In [None]:
# YOUR CODE HERE: Scatter plot colored by cat2
plt.figure(figsize=(10, 8))

# YOUR CODE HERE: Create scatter plot with hue='cat2'


plt.title('Gene Expression: gene1 vs gene2 (colored by cat2)')
plt.xlabel('Gene 1')
plt.ylabel('Gene 2')
plt.legend(title='cat2')
plt.show()

In [None]:
# YOUR CODE HERE: Scatter plot colored by cat3
plt.figure(figsize=(10, 8))

# YOUR CODE HERE: Create scatter plot with hue='cat3'


plt.title('Gene Expression: gene1 vs gene2 (colored by cat3)')
plt.xlabel('Gene 1')
plt.ylabel('Gene 2')
plt.legend(title='cat3')
plt.show()

### Pairplot: All Relationships at Once

### Your Task:
Create a pairplot using `sns.pairplot()`:
- Select only gene1, gene2, and cat1 columns
- Use `hue='cat1'` to color by category
- Set `diag_kind='kde'` for density on diagonal

In [None]:
# YOUR CODE HERE: Create pairplot colored by cat1
# Hint: sns.pairplot(df[['gene1', 'gene2', 'cat1']].dropna(), hue='cat1', ...)



### Reflection Question 8:
**After running the cells above, answer the following:**

1. Are gene1 and gene2 positively or negatively correlated?
2. Which categorical variable best separates the data into distinct clusters?
3. Are there any outliers visible in the scatter plots? Describe them.

*Your answer here:*



## 9. Heatmaps: Relationships Between Categorical Variables

### Purpose:
Heatmaps visualize **cross-tabulations (contingency tables)** between categorical variables, revealing:
- **Co-occurrence patterns**: How categories from different variables relate to each other
- **Conditional distributions**: Given one category, which other categories are most common?
- **Hidden structure**: Relationships that might indicate underlying population structure

### Creating Cross-tabulations:
- **`pd.crosstab()`**: Creates a frequency table counting observations for each combination
- **Normalization options**: 
  - `normalize='index'` → row percentages (each row sums to 1)
  - `normalize='columns'` → column percentages (each column sums to 1)

### Your Tasks:
1. Create a crosstab between cat1 and cat2
2. Create a crosstab between cat1 and cat3
3. Create a crosstab between cat2 and cat3
4. Create normalized heatmaps to show proportions

### Hints:
- Use `pd.crosstab(df['cat1'], df['cat2'])` to create contingency table
- Use `sns.heatmap(crosstab, annot=True, fmt='d', cmap='Blues')` for count heatmap
- For percentages use `fmt='.2%'` instead of `fmt='d'`

In [None]:
# Task 9a: Cross-tabulation heatmap - cat1 vs cat2
# YOUR CODE HERE: Create crosstab between cat1 and cat2
# crosstab_1_2 = pd.crosstab(...)


plt.figure(figsize=(8, 6))
# YOUR CODE HERE: Create heatmap with sns.heatmap()
# Use annot=True, fmt='d', cmap='Blues'


plt.title('Crosstab Heatmap: cat1 vs cat2 (Raw Counts)', fontsize=14)
plt.xlabel('cat2', fontsize=12)
plt.ylabel('cat1', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Task 9b: Cross-tabulation heatmap - cat1 vs cat3
# YOUR CODE HERE: Create crosstab between cat1 and cat3


plt.figure(figsize=(10, 6))
# YOUR CODE HERE: Create heatmap with cmap='Greens'


plt.title('Crosstab Heatmap: cat1 vs cat3 (Raw Counts)', fontsize=14)
plt.xlabel('cat3', fontsize=12)
plt.ylabel('cat1', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Task 9c: Cross-tabulation heatmap - cat2 vs cat3
# YOUR CODE HERE: Create crosstab between cat2 and cat3


plt.figure(figsize=(10, 6))
# YOUR CODE HERE: Create heatmap with cmap='Oranges'


plt.title('Crosstab Heatmap: cat2 vs cat3 (Raw Counts)', fontsize=14)
plt.xlabel('cat3', fontsize=12)
plt.ylabel('cat2', fontsize=12)
plt.tight_layout()
plt.show()

### Task 9d: Normalized Heatmaps (Proportions)

Normalizing helps interpret **conditional probabilities**:
- **Row-normalized (`normalize='index'`)**: "Given cat1=TypeA, what proportion are in each cat3 category?"
- **Column-normalized (`normalize='columns'`)**: "Given cat3=SubtypeA, what proportion are TypeA vs TypeB?"

In [None]:
# Task 9d: Normalized crosstab heatmaps
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Row-normalized: cat1 vs cat3
# YOUR CODE HERE: Create crosstab with normalize='index'
# crosstab_row_norm = pd.crosstab(df_analysis['cat1'], df_analysis['cat3'], normalize='index')


# YOUR CODE HERE: Create heatmap on axes[0]
# Use fmt='.2%' for percentage format, cmap='YlGnBu'


axes[0].set_title('cat1 vs cat3 (Row-normalized: % within each cat1)', fontsize=12)
axes[0].set_xlabel('cat3', fontsize=11)
axes[0].set_ylabel('cat1', fontsize=11)

# Column-normalized: cat1 vs cat3
# YOUR CODE HERE: Create crosstab with normalize='columns'
# crosstab_col_norm = pd.crosstab(df_analysis['cat1'], df_analysis['cat3'], normalize='columns')


# YOUR CODE HERE: Create heatmap on axes[1]
# Use fmt='.2%' for percentage format, cmap='YlOrRd'


axes[1].set_title('cat1 vs cat3 (Column-normalized: % within each cat3)', fontsize=12)
axes[1].set_xlabel('cat3', fontsize=11)
axes[1].set_ylabel('cat1', fontsize=11)

plt.tight_layout()
plt.show()

### Reflection Question 9:
**After running the cells above, answer the following:**

1. Which pairs of categorical variables show the strongest association (non-uniform distribution)?
2. Looking at cat1 vs cat3: Is there a clear mapping between cat1 categories and cat3 categories?
3. What does the normalized heatmap reveal that the raw count heatmap doesn't?
4. Based on these heatmaps, do you think the categorical variables might be related to hidden population structure?

*Your answer here:*



## 10. EDA Summary - Your Conclusions

### Your Task:
Based on all the analyses you performed, write a summary of your key findings.

**Data Quality Assessment:**
- What did you find about missing values (NaN)?
- Were there outliers? How severe?
- Was there zero inflation in the data?

*Your answer here:*



**Distribution Characteristics:**
- How would you describe the distribution of gene1 and gene2?
- Were the distributions skewed? In which direction?
- Was there evidence of multimodality (multiple peaks)?

*Your answer here:*



**Categorical Patterns:**
- Which categorical variable (cat1, cat2, or cat3) shows the clearest relationship with gene expression?
- Which categories have the highest/lowest gene expression?
- Were there any surprising patterns?

*Your answer here:*



**Categorical Relationships (from Heatmaps):**
- Which pairs of categorical variables are most strongly associated?
- Do the heatmaps suggest any hidden structure in the data?
- How might these relationships help explain the gene expression patterns?

*Your answer here:*



**Gene Correlation:**
- Are gene1 and gene2 correlated? Positive or negative?
- Does the correlation differ by category?
- What might this correlation mean biologically?

*Your answer here:*



**Recommended Next Steps:**
- What further analyses would you recommend?
- Should outliers be removed or kept?
- Would statistical tests help confirm your observations?

*Your answer here:*

