## 1. Import Required Libraries - lecture : 50-52

**Description:** Import essential Python libraries for data manipulation, linear regression modeling, and evaluation.

**Libraries Used:**
- **NumPy**: For numerical computations
- **Pandas**: For data manipulation and DataFrame operations
- **Scikit-learn**: For machine learning algorithms and metrics

**Purpose:** Setting up the environment for implementing Linear Regression analysis.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error

## 2. Create Dataset

**Description:** Create a sample dataset with CGPA (independent variable) and Package (dependent variable) to demonstrate the relationship between academic performance and job placement package.

**Dataset Structure:**
- **Features (X)**: CGPA - Student's Cumulative Grade Point Average
- **Target (y)**: Package - Job placement package in LPA (Lakhs Per Annum)
- **Size**: 16 observations

**Purpose:** Generate training data to build a predictive model for estimating placement packages based on CGPA.

In [None]:
# Create dataset with CGPA and Package
data = {
    'CGPA': [6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 6.0, 7.2, 7.8, 8.3, 8.7, 9.2, 6.8, 7.6, 8.9],
    'Package': [3.5, 4.0, 4.5, 5.0, 6.0, 7.0, 8.5, 3.0, 4.2, 4.8, 5.5, 6.5, 7.5, 3.8, 4.6, 7.2]
}

df = pd.DataFrame(data)
df

Unnamed: 0,CGPA,Package
0,6.5,3.5
1,7.0,4.0
2,7.5,4.5
3,8.0,5.0
4,8.5,6.0
5,9.0,7.0
6,9.5,8.5
7,6.0,3.0
8,7.2,4.2
9,7.8,4.8


## 3. Train-Test Split

**Description:** Split the dataset into training and testing sets to evaluate model performance on unseen data.

**Mathematical Concept:**
- **Training Set**: Used to train the model (80% of data)
- **Testing Set**: Used to evaluate model performance (20% of data)

**Formula:**
$$\text{Train Size} = n \times (1 - \text{test\_size})$$
$$\text{Test Size} = n \times \text{test\_size}$$

Where $n = 16$ observations, test_size = 0.2

**Purpose:** Prevent overfitting and ensure the model generalizes well to new data.

In [None]:
X = df[['CGPA']]
y = df['Package']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Train Linear Regression Model

**Description:** Build and train a Linear Regression model to find the best-fit line that describes the relationship between CGPA and Package.

**Linear Regression Formula:**
$$y = \beta_0 + \beta_1 x + \epsilon$$

Where:
- $y$ = Package (dependent variable)
- $x$ = CGPA (independent variable)
- $\beta_0$ = Intercept (y-intercept of the line)
- $\beta_1$ = Slope (rate of change of Package w.r.t CGPA)
- $\epsilon$ = Error term

**Optimization:**
The model minimizes the **Sum of Squared Errors (SSE)**:
$$\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

**Purpose:** Learn the linear relationship between CGPA and placement package.

In [None]:
pred_model = LinearRegression()
pred_model.fit(x_train, y_train)
y_pred = pred_model.predict(x_test)

## 5. Make Predictions

**Description:** Use the trained model to predict placement package for a given CGPA value.

**Prediction Formula:**
$$\hat{y} = \beta_0 + \beta_1 x_{\text{new}}$$

Where:
- $\hat{y}$ = Predicted package
- $x_{\text{new}}$ = Input CGPA value
- $\beta_0, \beta_1$ = Learned parameters from training

**Purpose:** Demonstrate how the model can predict outcomes for new, unseen CGPA values.

In [None]:
# Generate prediction for a given CGPA
input_cgpa = 8.2  # You can change this value
predicted_package = pred_model.predict([[input_cgpa]])
print(f"For CGPA {input_cgpa}, the predicted package is: {predicted_package[0]:.2f} LPA")

For CGPA 8.2, the predicted package is: 5.80 LPA




## 6. Model Evaluation Metrics

**Description:** Evaluate the model's performance using various statistical metrics to assess prediction accuracy and goodness of fit.

**Evaluation Metrics:**

### 1. Mean Absolute Error (MAE)
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$
- Measures average absolute difference between actual and predicted values
- **Lower is better** (0 = perfect predictions)

### 2. Mean Squared Error (MSE)
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
- Measures average squared difference (penalizes larger errors more)
- **Lower is better** (0 = perfect predictions)

### 3. R² Score (Coefficient of Determination)
$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$
- Represents proportion of variance explained by the model
- **Range**: 0 to 1 (**higher is better**, 1 = perfect fit)

### 4. Adjusted R² Score
$$\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}$$

Where:
- $n$ = number of observations (16)
- $p$ = number of features (1)
- Adjusts for number of predictors to prevent overfitting

**Purpose:** Quantify model performance and reliability of predictions.

In [None]:
print("MAE", mean_absolute_error(y_test, y_pred))
print("MSE", mean_squared_error(y_test, y_pred))
print("R2 Score", r2_score(y_test, y_pred))
print("shape of X", X.shape)
#1-(1-R2)*(n-1)/(n-p-1) => n=16, p=1 , R2=0.97
print("adjusted R2 Score", 1 - (1 - r2_score(y_test, y_pred)) * (len(y_test) - 1) / (len(y_test) - X.shape[1] - 1))

MAE 0.1741942384483739
MSE 0.046929626019643976
R2 Score 0.973955115632525
shape of X (16, 1)
adjusted R2 Score 0.9609326734487875
