# Week 6 Lab Part 2
## Model Training, Testing, and Validation

### Load Saved Model

In [9]:
from joblib import load
from sklearn.metrics import accuracy_score

# Define the file path for the saved model
model_file_path = "Week6_model.joblib"

# Load the saved model
loaded_model = load(model_file_path)
X_val = load("X_val.joblib")
y_val = load("y_val.joblib")

In [11]:
# Make predictions on the validation set
y_val_pred = loaded_model.predict(X_val)

In [13]:
y_val_pred

array([1, 1, 1, ..., 1, 1, 1])

In [15]:
accuracy_score(y_val,y_val_pred)

1.0

# **Week 6 Lab Summary: Model Training, Testing, and Validation**

## **1. Data Loading and Feature Engineering**
- I connected to a **PostgreSQL** database and pulled in the `sales_transaction` dataset.
- I engineered new features:
  - **Total Purchase Frequency** (number of purchases per customer).
  - **Total Revenue Per Customer** (sum of `Price * Quantity` for each customer).
  - Created a **Repeat Customer** label (`1` if `Total_Purchase_Frequency > 1`, else `0`).

## **2. Data Splitting**
- I split the dataset into:
  - **Train (70%)**
  - **Validation (15%)**
  - **Test (15%)**

## **3. Model Training**
- I trained a **Random Forest Classifier** (`n_estimators=100, random_state=42`) using the training data.
- After training, I saved both the model and the validation dataset using **joblib**.

## **4. Model Validation in Another Notebook**
- I loaded the **saved model**, **X_val**, and **y_val** in a separate notebook.
- I made predictions on **y_val_pred** and calculated the accuracy score.
- The accuracy came out to **1.0**, which is way too high—definitely suspicious.

---

## **Possible Issues: Why Is My Model Too Perfect?**
1. **Data Leakage?**  
   - If `Total_Purchase_Frequency` was calculated using all data (instead of just training data), my model might be "cheating" by learning from information it shouldn't have.

2. **Overfitting?**  
   - **Random Forest with 100 trees** can overfit, especially if the dataset is small or if there's a dominant feature.
   - I might need to reduce `n_estimators` or limit the `max_depth`.

3. **Target Leakage?**  
   - If `Total_Revenue` includes future transactions, the model has unfair access to future outcomes, making predictions too easy.

---

## **Next Steps**
- **Check for data leakage**: I need to make sure my computed features don’t use future data when making predictions.
- **Try a simpler model**: Running a **Logistic Regression** or a shallow decision tree would help see if the issue persists.
- **Look at feature importance**: Using `.feature_importances_` on my Random Forest model could reveal if a single feature is dominating the predictions.

This accuracy score is **too good to be true**, so I need to dig deeper! 🚀
