<a href="https://colab.research.google.com/github/anandchauhan21/AI_ML_Notes/blob/main/Regression_and_Correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression and Correlation – Module 5

## Big Question:
Can we predict a student’s exam score using:
- Study hours
- Sleep hours?

### Today you will learn:
1. What correlation is
2. What regression is
3. How to use regression to make predictions


## Topic Definitions

### Correlation
Correlation shows how strongly two variables are related and
whether they move together or in opposite directions.

### Regression
Regression is a method used to understand relationships
and predict an outcome using one or more variables.

### Simple Linear Regression
Predicts one value using one variable.
Formula: y = mx + b

### Multiple Regression
Predicts one value using two or more variables.
Formula: y = b0 + b1x1 + b2x2


## What is Correlation?

Correlation answers:
"When one value changes, does the other change too?"

Types:
- Positive: both increase
- Negative: one increases, the other decreases
- Zero: no clear pattern

Important:
Correlation does NOT mean causation.


## What is Regression?

Regression finds the best line or equation
that describes the relationship between variables.

It helps us:
- Understand patterns
- Make predictions


In [None]:
# Import Tools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

## Our Dataset

Each row is one student.
Each column is a factor that may affect exam scores.


In [None]:
data = {
    "StudyHours": [2, 4, 6, 3, 7, 8, 5, 1, 9, 4],
    "SleepHours": [5, 6, 7, 4, 8, 9, 6, 3, 8, 5],
    "TestScore": [60, 70, 85, 65, 90, 95, 78, 50, 92, 72]
}

df = pd.DataFrame(data)
df

In [None]:
# Correlation Table
df.corr()

In [None]:
# Scatter Plot
plt.scatter(df["StudyHours"], df["TestScore"])
plt.xlabel("Study Hours")
plt.ylabel("Test Score")
plt.title("Study Hours vs Test Score")
plt.show()


## Simple Linear Regression

We draw the best straight line through the data.

Formula:
y = mx + b

m = slope → change in score per hour
b = intercept → score when study = 0


In [None]:
# Train Simple Model
X = df[["StudyHours"]]
y = df["TestScore"]

model = LinearRegression()
model.fit(X, y)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

In [None]:
# Regression Line
plt.scatter(X, y)
plt.plot(X, model.predict(X))
plt.xlabel("Study Hours")
plt.ylabel("Test Score")
plt.title("Regression Line")
plt.show()

## Multiple Regression

We now use TWO predictors:
- Study Hours
- Sleep Hours

Formula:
y = b0 + b1(Study) + b2(Sleep)

Each coefficient shows the effect of one variable
while the other remains constant.


In [None]:
# Train Multiple Model
X_multi = df[["StudyHours", "SleepHours"]]

model_multi = LinearRegression()
model_multi.fit(X_multi, y)

print("Intercept:", model_multi.intercept_)
print("Coefficients:", model_multi.coef_)

In [None]:
# Make Prediction
new_data = pd.DataFrame({"StudyHours":[5], "SleepHours":[7]})
model_multi.predict(new_data)

## Reflection

1. Which affects score more: study or sleep?
2. What does slope mean?
3. Why is multiple regression more realistic?
4. Can correlation prove cause?
