# 📈 4.3 Correlation Analysis

This notebook explores correlation analysis to identify relationships in nutrition datasets.

**Objectives**:
- Calculate Pearson and Spearman correlations.
- Visualise correlations with heatmaps.
- Apply correlation analysis to `vitamin_trial.csv`.

**Context**: Correlation analysis helps understand relationships, like vitamin D and trial outcomes.

<details><summary>Fun Fact</summary>
Correlations are like a hippo’s friendships—some are strong, some are just acquaintances! 🦛
</details>

In [None]:
# Setup for Google Colab: Fetch datasets automatically or manually
import os
from google.colab import files

# Define the module and dataset for this notebook
MODULE = '04_data_analysis'
DATASET = 'vitamin_trial.csv'
DATASET_PATH = os.path.join('data', DATASET)

# Step 1: Attempt to clone the repository (automatic method)
try:
    print('Attempting to clone repository...')
    !git clone https://github.com/ggkuhnle/data-analysis-toolkit-FNS.git
    os.chdir(f'/content/data-analysis-toolkit-FNS/notebooks/{MODULE}')
    if os.path.exists(DATASET_PATH):
        print(f'Dataset found: {DATASET_PATH} 🦛')
    else:
        print(f'Error: Dataset {DATASET} not found after cloning.')
        raise FileNotFoundError
except Exception as e:
    print(f'Cloning failed: {e}')
    print('Falling back to manual upload option...')

    # Step 2: Manual upload option
    print(f'Please upload {DATASET} manually.')
    print(f'1. Click the "Choose Files" button below.')
    print(f'2. Select {DATASET} from your local machine.')
    print(f'3. Ensure the file is placed in notebooks/{MODULE}/data/')
    
    # Create the data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Prompt user to upload the dataset
    uploaded = files.upload()
    
    # Check if the dataset was uploaded
    if DATASET in uploaded:
        with open(DATASET_PATH, 'wb') as f:
            f.write(uploaded[DATASET])
        print(f'Successfully uploaded {DATASET} to {DATASET_PATH} 🦛')
    else:
        raise FileNotFoundError(f'Upload failed. Please ensure you uploaded {DATASET}.')

# Install required packages for this notebook
%pip install pandas numpy
print('Python environment ready.')

In [1]:
# Install required packages
%pip install pandas numpy seaborn scipy  # Ensures compatibility in Colab
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
import seaborn as sns  # For visualizations
from scipy.stats import pearsonr, spearmanr  # For correlation calculations
print('Correlation environment ready.')

Correlation environment ready.


## Data Preparation

Load `vitamin_trial.csv` and select numerical columns for correlation.

In [2]:
# Load the dataset
df = pd.read_csv('data/vitamin_trial.csv')  # Path relative to notebook

# Select numerical columns
num_cols = ['Vitamin_D', 'Time']  # Numerical columns for correlation
print(f'Numerical columns: {num_cols}')  # Display selected columns

Numerical columns: ['Vitamin_D', 'Time']


## Pearson Correlation

Calculate Pearson correlation between `Vitamin_D` and `Time`.

In [3]:
# Calculate Pearson correlation
corr, p_value = pearsonr(df['Vitamin_D'], df['Time'])
print(f'Pearson correlation: {round(corr, 2)}, p-value: {p_value:.1e}')  # Display results

Pearson correlation: 0.85, p-value: 1.2e-45


## Correlation Heatmap

Visualise correlations with a heatmap.

In [4]:
# Compute correlation matrix
corr_matrix = df[num_cols].corr(method='pearson')  # Pearson correlation matrix

# Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')  # Plot title
plt.show()  # Display plot

## Exercise 1: Spearman Correlation

Calculate Spearman correlation between `Vitamin_D` and `Time` and visualise with a scatter plot. Document your code.

**Guidance**: Use `spearmanr()` and `sns.scatterplot()`.

**Answer**:

My Spearman correlation code is...

## Conclusion

You’ve learned to identify relationships in nutrition data using correlation analysis.

**Next Steps**: Explore statistical testing in 4.4.

**Resources**:
- [SciPy Stats](https://docs.scipy.org/doc/scipy/reference/stats.html)
- [Seaborn Heatmaps](https://seaborn.pydata.org/generated/seaborn.heatmap.html)
- Repository: [github.com/ggkuhnle/data-analysis-toolkit-FNS](https://github.com/ggkuhnle/data-analysis-toolkit-FNS)