# 📘 Preprocessing Notebook – Tutas Recommender

This notebook handles the preprocessing pipeline **after feature engineering in Google Cloud BigQuery**.  
The feature-engineered dataset (`tutas_recommender_training_features.csv`) is exported from BigQuery and used here as input.

---

## 🎯 Objective
- Load the feature-engineered dataset from BigQuery export.
- Clean and prepare the data for machine learning.
- Split into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`).
- Save processed files under `dataset/processed_2/` for model training.

---

## 🔄 Workflow
1. **Import dependencies** → pandas, numpy, sklearn.  
2. **Load dataset** → `tutas_recommender_training_features.csv` from `processed_1/`.  
3. **Feature/label separation** → split predictors (`X`) and target label (`y`).  
4. **Train-test split** → create `X_train`, `X_test`, `y_train`, `y_test`.  
5. **Save processed files** → store them in `dataset/processed_2/`.  

---

📌 **Note**:  
- The dataset is **synthetic** and was created for demonstration purposes.  
- No real user data is involved.  


## 1. Import Dependencies

We start by importing the required libraries for preprocessing:

- **pandas** → to handle and manipulate tabular datasets.  
- **sklearn.preprocessing.StandardScaler** → to standardize numerical features (zero mean, unit variance).  
- **sklearn.model_selection.train_test_split** → to split the dataset into training and testing subsets.

These libraries will be used throughout the notebook to clean, transform, and prepare the data for model training.


In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


## 2. Load the Feature-Engineered Dataset

We load the dataset exported from **BigQuery** (`tutas_recommender_training_features.csv`).  
This dataset contains feature-engineered records representing student–tutor interactions.

After loading:
- Display the first few rows with `head()` to get an overview.  
- Use `info()` to check data types and non-null counts.  
- Use `isnull().sum()` to identify any missing values in each column.


In [None]:
# Load the feature-engineered dataset exported from BigQuery
df = pd.read_csv('../data/dataset/processed_1/tutas_recommender_training_features.csv')

# Preview the first 5 rows to understand dataset structure
print(df.head())

# Display data types, column info, and non-null counts
df.info()

# Check for missing values in each column
df.isnull().sum()


  id_murid id_tutor  topik_match  gaya_match  metode_match  time_overlap  \
0    M0566    T0008         True        True          True         False   
1    M0379    T0013         True        True          True         False   
2    M0964    T0013         True        True          True         False   
3    M0100    T0017         True        True          True         False   
4    M0416    T0018         True        True          True         False   

   murid_flexible  tutor_flexible  feedback_score  label  \
0           False           False               0      0   
1           False           False               0      0   
2           False           False               0      0   
3           False           False               0      0   
4           False           False               0      0   

   average_rating_tutor  total_jam_ngajar_tutor  pernah_gagal  
0                  3.71                      72          True  
1                  4.60                     323       

id_murid                  0
id_tutor                  0
topik_match               0
gaya_match                0
metode_match              0
time_overlap              0
murid_flexible            0
tutor_flexible            0
feedback_score            0
label                     0
average_rating_tutor      0
total_jam_ngajar_tutor    0
pernah_gagal              0
dtype: int64

## 3. Identify Numerical Features

We select all columns with numeric data types (`int64`, `float64`) for further preprocessing.  
This step is important because numerical features will later be **standardized** using `StandardScaler` to improve model performance.

After selecting:
- Print the list of numerical features.  
- Use `describe()` to view basic statistics (mean, std, min, max, quartiles) for these features.


In [11]:
# Select columns with numeric data types (int64, float64)
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Print the list of numerical feature names
print("Fitur Numerikal : ", list(numerical_cols))

# Show descriptive statistics for numerical features
print(df[numerical_cols].describe())


Fitur Numerikal :  ['feedback_score', 'label', 'average_rating_tutor', 'total_jam_ngajar_tutor']
       feedback_score        label  average_rating_tutor  \
count     2012.000000  2012.000000           2012.000000   
mean         3.519384     0.736581              3.973335   
std          1.920116     0.440597              0.574820   
min          0.000000     0.000000              3.000000   
25%          3.000000     0.000000              3.490000   
50%          4.000000     1.000000              3.960000   
75%          5.000000     1.000000              4.452500   
max          5.000000     1.000000              5.000000   

       total_jam_ngajar_tutor  
count             2012.000000  
mean               196.456759  
std                109.466772  
min                 11.000000  
25%                106.000000  
50%                193.000000  
75%                287.000000  
max                400.000000  


## 4. Encode Boolean Features

Some columns contain boolean values (`True` / `False`).  
Since most machine learning algorithms expect numeric inputs, we convert these boolean features into integers:
- `True` → 1  
- `False` → 0  

This ensures all features are numeric and ready for preprocessing or scaling.


In [None]:
# Convert all boolean columns into integer values (True=1, False=0)
for col in df.select_dtypes(include=['bool']).columns:
    df[col] = df[col].astype(int)


## 5. Feature Scaling

We apply **StandardScaler** to normalize numerical features.  
- StandardScaler standardizes features by removing the mean and scaling to unit variance.  
- It works well because the data distribution is approximately normal and does not contain many outliers.  
- If outliers were present, a **RobustScaler** would be more appropriate.

Steps:
1. Separate features (`X`) and target (`y`).  
2. Select numerical columns to be scaled.  
3. Apply `StandardScaler` only to the selected numerical features.


In [None]:
# Separate features (X) and target label (y)
X = df.drop(columns=['label', 'id_murid', 'id_tutor'])  # remove label and ID columns
y = df['label']

# Define numerical columns to be standardized
numerical_cols = ['feedback_score', 'average_rating_tutor', 'total_jam_ngajar_tutor']

# Initialize StandardScaler
scaler = StandardScaler()

# Apply scaling to the selected numerical features
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])


## 6. Train–Test Split

We now split the dataset into **training** and **testing** subsets:

- **Training set (80%)** → used to train the machine learning model.  
- **Testing set (20%)** → held out for evaluation, to check generalization performance.  

After splitting, we save each subset as CSV files under `dataset/processed_2/`:
- `X_train.csv`, `X_test.csv`
- `y_train.csv`, `y_test.csv`


In [None]:
# Split dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Save the split datasets into processed 2 folder
X_train.to_csv('../data/dataset/processed 2/X_train.csv', index=False)
X_test.to_csv('../data/dataset/processed 2/X_test.csv', index=False)
y_train.to_csv('../data/dataset/processed 2/y_train.csv', index=False)
y_test.to_csv('../data/dataset/processed 2/y_test.csv', index=False)


OSError: Cannot save file into a non-existent directory: '..\data\dataset\processed'