# Pearson Correlation Analysis (Parametric Methods)

In this notebook, we explore **parametric correlation analysis**, focusing specifically on the **Pearson correlation coefficient**.

Correlation analysis allows us to **quantify the strength and direction of a linear relationship** between two continuous numeric variables.

⚠️ **Important Concept**: Correlation does **not** imply causation.

For example, a strong correlation may exist between grocery store size and regional obesity levels, but this does not mean that store size *causes* obesity. Correlation simply indicates that two variables move together in some way.

---
## What is Pearson Correlation?

The **Pearson correlation coefficient (R)** measures the **linear relationship** between two variables.

- **R ≈ +1** → Strong positive linear relationship
- **R ≈ -1** → Strong negative linear relationship
- **R ≈ 0** → No linear relationship

### Key Assumptions of Pearson Correlation

Before applying Pearson correlation, the following assumptions should reasonably hold:

1. Variables are **continuous numeric** values
2. Variables are **linearly related**
3. Data is **approximately normally distributed**

⚠️ Pearson correlation can detect **linear** relationships only. It **cannot rule out non-linear relationships**.


## Importing Required Libraries

We will use the following libraries:

- **pandas** and **numpy** for data handling
- **matplotlib** and **seaborn** for visualization
- **scipy.stats** for statistical calculations (Pearson R)


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams

import scipy
from scipy.stats import pearsonr

### Plotting Configuration

We configure matplotlib and seaborn settings to ensure consistent and readable plots throughout the notebook.

In [None]:
%matplotlib inline
rcParams['figure.figsize'] = (8, 4)
sns.set_style("whitegrid")

## Loading the Cars Dataset

We use the classic **mtcars** dataset, which contains fuel consumption and design characteristics of automobiles.

Each row represents a car model, and each column represents a numeric attribute such as:

- `mpg` → miles per gallon
- `hp` → horsepower
- `wt` → weight
- `qsec` → quarter-mile time


In [None]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/mtcars.csv'

cars = pd.read_csv(address)
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

## Exploring Relationships Using Pair Plots

A **pair plot** visualizes pairwise relationships between variables.

This helps us visually assess:
- Linearity
- Direction of relationships
- Potential outliers
- Approximate normality of distributions

Below, we first generate a pair plot using **all numeric variables**.

In [None]:
sns.pairplot(cars)

### Reduced Pair Plot for Pearson Analysis

Since the full dataset contains many variables, we focus on a smaller subset that is well-suited for Pearson correlation:

- `mpg`
- `hp`
- `qsec`
- `wt`

This makes it easier to visually inspect linear relationships.

In [None]:
x = cars[['mpg','hp','qsec','wt']]
sns.pairplot(x)

## Calculating Pearson Correlation Using SciPy

The **scipy.stats.pearsonr** function returns:

- The **Pearson correlation coefficient (R)**
- The **p-value**, indicating statistical significance

We compute Pearson R between `mpg` and other variables to quantify their linear relationships.

In [None]:
mpg = cars['mpg']
hp = cars['hp']
qsec = cars['qsec']
wt = cars['wt']

pearsonr_coefficient, p_value = pearsonr(mpg, hp)
print('PearsonR Correlation Coefficient %0.3f' % pearsonr_coefficient)

In [None]:
pearsonr_coefficient, p_value = pearsonr(mpg, qsec)
print('PearsonR Correlation Coefficient %0.3f' % pearsonr_coefficient)

In [None]:
pearsonr_coefficient, p_value = pearsonr(mpg, wt)
print('PearsonR Correlation Coefficient %0.3f' % pearsonr_coefficient)

## Calculating Pearson Correlation Using Pandas

Pandas provides a convenient `.corr()` method that computes **pairwise Pearson correlations** for all numeric columns in a DataFrame.

In [None]:
corr = x.corr()
corr

## Visualizing Correlation with a Heatmap

A **heatmap** provides an intuitive visual summary of correlation strength and direction.

- Darker colors represent stronger correlations
- Values close to +1 or -1 indicate strong linear relationships
- Values near 0 indicate weak or no linear relationship


In [None]:
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, annot=True)