<a href="https://colab.research.google.com/github/Vaibhav9369755717/AI-ML-2-internship-/blob/main/janday5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DAY 5 – Improving the ML Project (Train–Test Split & Multiple Features)

## Objective of Day 5
In this session, we move from a basic ML demo to a more realistic workflow.

By the end of this class, students will understand:
- Why Train–Test Split is required
- How to work with multiple input features
- How to train a model on training data
- How to evaluate the model on unseen test data
- Why test accuracy is more important than training accuracy


## Context from Previous Class (Day 4)

On Day 4, we:
- Built our first Machine Learning model
- Trained it on the full dataset
- Checked accuracy on the same data

Today, we improve the process so that it looks like a real industry workflow.


## Why Train–Test Split is Required

If we train and test on the same data, the model may simply memorize the data.
This can give very high accuracy but poor real-world performance.

Train–Test Split solves this problem by:
- Training the model on one part of data
- Testing the model on unseen data

This helps us understand how the model will perform on new students.


In [None]:
# Import required libraries
import numpy as np
import pandas as pd


## Creating a More Realistic Dataset

Instead of using only CGPA, we now use multiple features.

Features used:
- CGPA
- Number of internships
- Coding skill level

Target:
- Placement status (0 = Not Placed, 1 = Placed)


In [None]:
# Creating dataset using NumPy
X = np.array([
    [7.5, 1, 3],
    [6.2, 0, 2],
    [8.1, 2, 4],
    [7.0, 1, 3],
    [8.5, 3, 5],
    [5.9, 0, 1],
    [7.8, 2, 4],
    [6.8, 1, 2]
])

y = np.array([1, 0, 1, 1, 1, 0, 1, 0])

print("Features (X):")
print(X)
print("\nTarget (y):")
print(y)

Features (X):
[[7.5 1.  3. ]
 [6.2 0.  2. ]
 [8.1 2.  4. ]
 [7.  1.  3. ]
 [8.5 3.  5. ]
 [5.9 0.  1. ]
 [7.8 2.  4. ]
 [6.8 1.  2. ]]

Target (y):
[1 0 1 1 1 0 1 0]


## Understanding the Shape of Data

Rows represent students.
Columns represent features.

This is the standard input format required by Machine Learning models.


In [None]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (8, 3)
Shape of y: (8,)


## Train–Test Split

We now split the data into training and testing parts.

- Training data is used to teach the model
- Testing data is used to evaluate the model


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

Training data shape: (6, 3)
Testing data shape: (2, 3)


## Importing the Machine Learning Model

We continue using Logistic Regression because:
- The problem is binary (Placed / Not Placed)
- The model is simple and interpretable


In [None]:
from sklearn.linear_model import LogisticRegression


## Training the Model on Training Data

The model learns patterns only from the training dataset.


In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

print("Model training completed")

Model training completed


## Making Predictions on Test Data

The test data has not been seen by the model during training.
This simulates real-world prediction.


In [None]:
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
print("Actual values:", y_test)

Predictions: [0 0]
Actual values: [0 0]


## Evaluating Model Accuracy on Test Data

Test accuracy tells us how well the model generalizes to new data.


In [None]:
test_accuracy = model.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 1.0


## Important Learning from Day 5

- Training accuracy can be misleading
- Test accuracy reflects real performance
- Using multiple features improves decision quality
- Train–Test Split is mandatory in real ML projects


## What Comes Next

In the next class, we will:
- Improve the model further
- Discuss overfitting and underfitting
- Introduce feature scaling
- Convert this into a more complete project
