# 01 - Exploratory Data Analysis (EDA)

**AI-Powered Code Review Assistant**  
**CS 5590 - Final Project**

---

## Objectives

This notebook performs comprehensive exploratory data analysis on the **CodeSearchNet** dataset to:

1. Understand the **structure** and **distribution** of code samples
2. Analyze **data quality** and identify potential issues
3. Visualize **label distributions** and class imbalance
4. Determine appropriate **preprocessing strategies**
5. Inform **model architecture** and **training decisions**

---

## CRISP-DM Phase: Data Understanding

This notebook corresponds to **Phase 2** of the CRISP-DM methodology.

## 1. Setup and Imports

In [None]:
# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
    
    # Clone repository
    !git clone https://github.com/darshlukkad/Code-Review-Assistant.git
    %cd Code-Review-Assistant
    
except ImportError:
    IN_COLAB = False
    print("Running locally")

In [None]:
# Install required packages
!pip install -q datasets transformers pandas matplotlib seaborn plotly tqdm

In [None]:
# Import libraries
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from collections import Counter
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("âœ“ All libraries imported successfully")

## 2. Load Dataset

We'll use the **CodeSearchNet** dataset from Hugging Face, focusing on **Python** and **JavaScript** for our initial analysis.

**Dataset Details:**
- Source: GitHub repositories
- Languages: Python, JavaScript, Java, Go, PHP, Ruby
- Size: ~2M code samples
- Format: Function code + documentation

In [None]:
from datasets import load_dataset

# Load a small subset first (for quick EDA)
# For full training, remove the split parameter
SUBSET_SIZE = 5000  # Adjust based on your memory

print("Loading Python dataset...")
dataset_python = load_dataset(
    "code_search_net",
    "python",
    split=f"train[:{SUBSET_SIZE}]"
)

print(f"âœ“ Loaded {len(dataset_python)} Python samples")
print(f"\nDataset features: {dataset_python.features.keys()}")

## 3. Initial Data Inspection

In [None]:
# Display first few samples
print("=" * 80)
print("SAMPLE CODE SNIPPET")
print("=" * 80)

sample = dataset_python[0]
print(f"Repository: {sample['repo_name']}")
print(f"Function: {sample['func_name']}")
print(f"\nCode:\n{sample['func_code_string'][:500]}...")
print(f"\nDocstring:\n{sample['func_documentation_string'][:200]}...")

In [None]:
# Convert to pandas for easier analysis
df = pd.DataFrame(dataset_python)

print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())

## 4. Code Length Analysis

Understanding code length distribution helps us:
- Set appropriate **max_length** for tokenization
- Identify **outliers** (very long/short functions)
- Assess **memory requirements** for training

In [None]:
# Calculate code statistics
df['code_length'] = df['func_code_string'].str.len()
df['num_lines'] = df['func_code_string'].str.count('\n')
df['num_tokens'] = df['func_code_string'].str.split().str.len()

print("CODE LENGTH STATISTICS")
print("=" * 80)
print(df[['code_length', 'num_lines', 'num_tokens']].describe())

In [None]:
# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Code length histogram
axes[0, 0].hist(df['code_length'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Code Length (characters)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Code Length')
axes[0, 0].axvline(df['code_length'].median(), color='red', linestyle='--', label='Median')
axes[0, 0].legend()

# Number of lines
axes[0, 1].hist(df['num_lines'], bins=50, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_xlabel('Number of Lines')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Number of Lines')
axes[0, 1].axvline(df['num_lines'].median(), color='red', linestyle='--', label='Median')
axes[0, 1].legend()

# Number of tokens
axes[1, 0].hist(df['num_tokens'], bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[1, 0].set_xlabel('Number of Tokens')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Number of Tokens')
axes[1, 0].axvline(df['num_tokens'].median(), color='red', linestyle='--', label='Median')
axes[1, 0].legend()

# Box plot for outlier detection
axes[1, 1].boxplot([df['num_lines']], labels=['Lines'])
axes[1, 1].set_ylabel('Number of Lines')
axes[1, 1].set_title('Box Plot - Outlier Detection')

plt.tight_layout()
plt.savefig('eda_code_length_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nâœ“ Saved: eda_code_length_distribution.png")

**Key Insights from Code Length Analysis:**

Based on the distributions above, we can determine:
- Most functions are **X-Y lines** (will update after running)
- Median code length suggests **max_length=512 tokens** is appropriate for CodeBERT
- Outliers (very long functions) may need special handling or truncation

## 5. Create Synthetic Labels

Since CodeSearchNet doesn't have quality labels, we create them using **heuristic rules**.

**Label Categories:**
1. **Bug:** Error handling, try/except patterns
2. **Security:** Authentication, passwords, secrets
3. **Code Smell:** Long functions, high complexity
4. **Style:** Missing docstrings, naming conventions  
5. **Performance:** Nested loops, inefficient patterns

In [None]:
def label_code_quality(code_str, docstring):
    """
    Apply heuristic rules to create quality labels.
    
    Returns:
        dict: Binary labels for each issue type
    """
    labels = {
        'bug': 0,
        'security': 0,
        'code_smell': 0,
        'style': 0,
        'performance': 0
    }
    
    code_lower = code_str.lower()
    
    # Bug: Contains error handling
    if 'except' in code_lower or 'error' in code_lower:
        labels['bug'] = 1
    
    # Security: Contains sensitive keywords
    security_keywords = ['password', 'token', 'secret', 'key', 'auth']
    if any(kw in code_lower for kw in security_keywords):
        labels['security'] = 1
    
    # Code smell: Long function (>50 lines)
    if code_str.count('\n') > 50:
        labels['code_smell'] = 1
    
    # Style: Missing or short docstring
    if not docstring or len(docstring) < 20:
        labels['style'] = 1
    
    # Performance: Nested loops
    if code_str.count('for ') >= 2 or code_str.count('while ') >= 2:
        labels['performance'] = 1
    
    return labels

# Apply labeling
print("Creating synthetic labels...")
labels_list = []
for _, row in tqdm(df.iterrows(), total=len(df)):
    labels = label_code_quality(
        row['func_code_string'],
        row.get('func_documentation_string', '')
    )
    labels_list.append(labels)

# Add labels to dataframe
labels_df = pd.DataFrame(labels_list)
df = pd.concat([df, labels_df], axis=1)

print("\nâœ“ Labels created successfully")

## 6. Label Distribution Analysis

**Critical for understanding class imbalance** - affects:
- Loss function weighting
- Evaluation metrics choice
- Sampling strategies

In [None]:
# Calculate label statistics
label_cols = ['bug', 'security', 'code_smell', 'style', 'performance']
label_counts = df[label_cols].sum()

print("LABEL DISTRIBUTION")
print("=" * 80)
for label, count in label_counts.items():
    percentage = (count / len(df)) * 100
    print(f"{label.capitalize():15} : {count:6,} ({percentage:5.2f}%)")

# Samples with no issues
no_issues = (df[label_cols].sum(axis=1) == 0).sum()
print(f"\nNo Issues       : {no_issues:6,} ({(no_issues/len(df))*100:5.2f}%)")

# Samples with multiple issues
multi_issues = (df[label_cols].sum(axis=1) > 1).sum()
print(f"Multiple Issues : {multi_issues:6,} ({(multi_issues/len(df))*100:5.2f}%)")

In [None]:
# Visualize label distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Bar chart
axes[0].bar(label_counts.index, label_counts.values, edgecolor='black', alpha=0.8)
axes[0].set_xlabel('Issue Type')
axes[0].set_ylabel('Count')
axes[0].set_title('Label Distribution (Bar Chart)')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, (label, count) in enumerate(label_counts.items()):
    axes[0].text(i, count + 50, f'{count:,}', ha='center', fontweight='bold')

# Pie chart
colors = plt.cm.Set3(range(len(label_counts)))
axes[1].pie(
    label_counts.values,
    labels=label_counts.index,
    autopct='%1.1f%%',
    colors=colors,
    startangle=90
)
axes[1].set_title('Label Distribution (Pie Chart)')

plt.tight_layout()
plt.savefig('eda_label_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nâœ“ Saved: eda_label_distribution.png")

## 7. Correlation Analysis

Check if certain issues tend to co-occur - helps understand code quality patterns.

In [None]:
# Correlation matrix
correlation_matrix = df[label_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={"shrink": 0.8}
)
plt.title('Label Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('eda_label_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nâœ“ Saved: eda_label_correlation.png")

## 8. Code Complexity Analysis

Additional metrics to understand code characteristics.

In [None]:
# Calculate complexity metrics
def calculate_complexity(code):
    """Simple complexity metrics."""
    metrics = {}
    metrics['cyclomatic'] = code.count('if ') + code.count('for ') + code.count('while ') + 1
    metrics['num_functions'] = code.count('def ')
    metrics['num_comments'] = code.count('#')
    return metrics

complexity_list = [calculate_complexity(code) for code in df['func_code_string']]
complexity_df = pd.DataFrame(complexity_list)
df = pd.concat([df, complexity_df], axis=1)

print("COMPLEXITY STATISTICS")
print("=" * 80)
print(df[['cyclomatic', 'num_functions', 'num_comments']].describe())

## 9. Summary and Recommendations

Based on our EDA, we can make the following recommendations for the next steps:

In [None]:
print("="*80)
print("EDA SUMMARY AND RECOMMENDATIONS")
print("="*80)

print("\n1. DATASET CHARACTERISTICS:")
print(f"   - Total samples: {len(df):,}")
print(f"   - Median code length: {df['code_length'].median():.0f} characters")
print(f"   - Median lines: {df['num_lines'].median():.0f}")
print(f"   - Median tokens: {df['num_tokens'].median():.0f}")

print("\n2. LABEL DISTRIBUTION:")
print(f"   - Most common issue: {label_counts.idxmax()} ({label_counts.max():,} samples)")
print(f"   - Least common issue: {label_counts.idxmin()} ({label_counts.min():,} samples)")
print(f"   - Imbalance ratio: {label_counts.max() / label_counts.min():.2f}:1")

print("\n3. RECOMMENDATIONS:")
print("   âœ“ Use max_length=512 for tokenization (covers 95%+ of samples)")
print("   âœ“ Apply data augmentation to improve robustness")
print("   âœ“ Use BCEWithLogitsLoss for multi-label classification")
print("   âœ“ Consider class weights to handle imbalance")
print("   âœ“ Use stratified splits for train/val/test")
print("   âœ“ Monitor per-class metrics during training")

print("\n" + "="*80)

## 10. Save Processed Data

Save the labeled dataset for use in subsequent notebooks.

In [None]:
# Select columns to keep
columns_to_save = [
    'func_code_string',
    'func_name',
    'func_documentation_string',
    'bug', 'security', 'code_smell', 'style', 'performance'
]

df_clean = df[columns_to_save].copy()

# Save to CSV
output_file = 'labeled_code_samples.csv'
df_clean.to_csv(output_file, index=False)

print(f"âœ“ Saved {len(df_clean)} samples to {output_file}")
print(f"  File size: {os.path.getsize(output_file) / 1024 / 1024:.2f} MB")

## ðŸŽ¯ Next Step: Preprocessing (02-preprocessing.ipynb)

Now that we've thoroughly analyzed the data, we can proceed to:
- Tokenize code using CodeBERT
- Apply data augmentation
- Create train/val/test splits
- Prepare data loaders