# Day 50 - Logistic Regression on Unseen Data

## Introduction
In the previous notebook (Day49), I learned how to build and evaluate a Logistic Regression model on training and testing data.  

But in real-world scenarios, the ultimate goal is not just to test on a portion of the dataset — instead, we want to deploy the model so it can **predict on unseen or future data**.  

This notebook focuses on using the trained Logistic Regression model to make predictions on a new dataset (unseen data).

---
## Why Predict on Unseen Data?

- Training and testing on the same dataset only tells us how the model performs on known data.
- In practice, new data keeps coming (future customers, patients, transactions, etc.).
- We need to check if the model can generalize well to **unseen samples**.

This step is very close to deployment.  
If the model performs well on unseen data, it’s reliable for real-world use.

---
## Workflow for Using Logistic Regression on Unseen Data

1. **Train Logistic Regression Model** on training dataset (same as Day49).  
2. **Save the Model** (optional, using joblib/pickle).  
3. **Load New Dataset** (unseen data).  
4. **Preprocess New Data** (must apply the same preprocessing steps as training data: scaling, encoding, etc.).  
5. **Make Predictions** on unseen data.  
6. **Evaluate or Interpret** the predictions.

---

## Important: Preprocessing Consistency

When predicting on unseen data, it is crucial that the **same preprocessing steps** are applied:  
- If StandardScaler was used, the new data must also be scaled using the same scaler fitted on training data.  
- If encoding was applied, the same encoding scheme must be followed.  

Otherwise, predictions may be incorrect or inconsistent.

---

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import pickle

## Load dataset

In [2]:
dataset = pd.read_csv(r"C:\Users\Arman\Downloads\dataset\logit classification.csv")

## Prepare the data
### Split into features (X) and target (y)

In [3]:
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,random_state=0)

## Feature Scaling

In [5]:
sc = StandardScaler() 
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Train the Logistic Regression Model

In [6]:
classifier = LogisticRegression()
classifier.fit(X_train,y_train)

## Predicting the Test set results

In [7]:
y_pred = classifier.predict(X_test)

## Confusion Matrix

In [8]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[57  1]
 [ 5 17]]


## Accuracy of the model

In [9]:
ac = accuracy_score(y_test, y_pred)
print(ac)

0.925


## Training Accuracy

In [10]:
bias = classifier.score(X_train,y_train)
print(bias)

0.821875


## Testing Accuracy

In [11]:
variance = classifier.score(X_test, y_test)
print(variance)

0.925


## Now Predict Future Data

## Upload Future Data

In [12]:
dataset1 = pd.read_csv(r"C:\Users\Arman\Downloads\dataset\final1.csv")

In [13]:
d2 = dataset1.copy()

In [14]:
d2

Unnamed: 0.1,Unnamed: 0,User ID,Gender,Age,EstimatedSalary
0,0,15724611,Male,45,60000
1,1,15725621,Female,79,64000
2,2,15725622,Male,23,78000
3,3,15720611,Female,34,45000
4,4,15588044,Male,29,76000
5,5,15746039,Female,70,89000
6,6,15704887,Male,86,120000
7,7,15746009,Female,46,23000
8,8,15876009,Male,32,70000
9,9,15886009,Female,100,90000


### Dataset Description – Unseen Data
This dataset contains new records with the same feature columns as the training dataset.  
- Features: [Age, Estimated Salary]  
- Target: Not available (since this is future data).  
Goal: Predict class labels for these unseen samples using the trained Logistic Regression model.  

## Select Required Columns

In [15]:
dataset1 = dataset1.iloc[:, [3, 4]].values

## Feature Scaling

In [16]:
sc = StandardScaler()
M = sc.fit_transform(dataset1)

## Prediction

In [17]:
y_pred1 = pd.DataFrame()

In [18]:
d2 ['y_pred1'] = classifier.predict(M)

In [19]:
d2

Unnamed: 0.1,Unnamed: 0,User ID,Gender,Age,EstimatedSalary,y_pred1
0,0,15724611,Male,45,60000,0
1,1,15725621,Female,79,64000,1
2,2,15725622,Male,23,78000,0
3,3,15720611,Female,34,45000,0
4,4,15588044,Male,29,76000,0
5,5,15746039,Female,70,89000,1
6,6,15704887,Male,86,120000,1
7,7,15746009,Female,46,23000,0
8,8,15876009,Male,32,70000,0
9,9,15886009,Female,100,90000,1


### Interpretation of Predictions
- Each predicted value (0 or 1) indicates the class label assigned by the Logistic Regression model.  
- This demonstrates how the trained model can generalize beyond the original dataset.  

## Save Result to CSV

In [20]:
d2.to_csv('final2.csv')

## Save Model and Scaler for Streamlit App

In [21]:
filename = 'logistic_regression_model.pkl'
with open(filename, 'wb') as file:
    pickle.dump(classifier, file)
print("Model has been pickled and saved as linear_regression_model.pkl")

filename1 = 'logistic_regression_scaler.pkl'
with open(filename1, 'wb') as file1:
    pickle.dump(sc, file1)
print("Scaler has been pickled and saved as logistic_regression_scaler.pkl")

Model has been pickled and saved as linear_regression_model.pkl
Scaler has been pickled and saved as logistic_regression_scaler.pkl


---

## Summary and Key Takeaways
- Extended Logistic Regression from Day49 to work with **unseen/future data**.
- Reinforced the importance of **consistent preprocessing** between training and unseen data.
- Demonstrated how to generate predictions on a new dataset using the trained model.
- Highlighted that evaluation on unseen data depends on whether target labels are available.
- This step represents a bridge towards **real-world deployment**, where models are applied to new incoming data.