## Project Workflow Overview

### 1. **Import Libraries**
   - Load all necessary libraries, including MLflow, data processing tools, and model frameworks.

### 2. **Set MLflow Tracking URI**
   - Configure the URI to track the experiment runs in your MLflow server or a local directory.

### 3. **Set MLflow Experiment**
   - Define or create the MLflow experiment where all runs will be logged.

### 4. **Define Utility Functions**
   - Write reusable functions, such as:
     - Logging dataset features to MLflow in YAML format.
     - Cleaning up files or performing other utility tasks.

### 5. **Import Dataset**
   - Load the dataset into a DataFrame or a similar structure.
   - Ensure that the dataset is prepared for analysis (e.g., by checking data types, null values, etc.).

### 6. **Log Original Dataset**
   - Log the original/raw dataset to MLflow for traceability and auditing.
   - Store this dataset as an artifact for future reference.

### 7. **Exploratory Data Analysis (EDA)**
   - Perform an EDA to understand data patterns, distributions, and potential outliers.
   - Use visualizations and descriptive statistics to explore relationships between features.

### 8. **Data Cleaning and Preprocessing**
   - Clean and preprocess the dataset:
     - Handle missing values, categorical encoding, scaling, etc.
     - Prepare the dataset for model training.
   - Ensure to keep track of all preprocessing steps for reproducibility.

### 9. **Log Transformed Dataset**
   - Log the preprocessed/transformed dataset as a separate artifact in MLflow.
   - This allows comparison between raw and processed datasets.

### 10. **Model Training Pipeline**
   - #### **Import Libraries**
     - Import all necessary libraries for model training, such as machine learning frameworks (e.g., `scikit-learn`, `TensorFlow`).
   
   - #### **Define and Log Hyperparameters**
     - Set model hyperparameters and log them to MLflow for tracking.

   - #### **Train the Model**
     - Train the machine learning model using the preprocessed data.
    
   - #### **Log Metrics**

      - Log the training metrics into mlflow

   - #### **Log the Model to MLflow**
     - Save the trained model to MLflow, ensuring it is accessible for future use or deployment.
   
   - #### **Log Feature Importance**
     - Compute and log feature importance (if applicable) as a visualization (e.g., `.jpg`).
     - Store it as an artifact in MLflow for model interpretability.


In [12]:
import mlflow
import yaml
import os
import pandas as pd

from train import main_train
from clean import clean_n_transform
from preprocess import preprocess_data

### Initiate MLflow

In [13]:
mlflow.set_tracking_uri("http://localhost:6969")

In [None]:
mlflow.set_experiment("credit_score_classification_testing")

In [15]:
# =============== User-Defined-Function ==========================
# Function to log features into features.yaml
def log_features_to_yaml(data, dataset_name='Dataset', file_name='features.yaml', description=''):
    """
    Logs the feature names and basic info of a DataFrame into a features.yaml file.

    Parameters:
    - data: The DataFrame containing the features to log.
    - dataset_name: A string specifying the name/type of the dataset (e.g., "Original" or "Processed").
    - file_name: The yaml file name.
    - description: The description of the set of features (optional).
    """
    # Extract feature names and data types
    feature_data_types = data.dtypes.apply(str).to_dict()
    
    # Prepare the feature info for YAML
    feature_info = {
        dataset_name: {
            'dataset_name': dataset_name,
            'features': feature_data_types,
            'num_features': len(data.columns),
            'description': description
        }
    }
    
    # Load existing YAML data if features.yaml already exists
    try:
        with open(file_name, 'r') as file:
            existing_data = yaml.safe_load(file) or {}
    except FileNotFoundError:
        existing_data = {}

    # Append the new dataset feature info to the existing data
    existing_data.update(feature_info)

    # Write updated data back to YAML file
    with open(file_name, 'w') as file:
        yaml.dump(existing_data, file, default_flow_style=False)

    print(f"Feature information for {dataset_name} saved to {file_name}")

# Function to clean up the features.yaml from the current directory
def clean_yaml(file_name='features.yaml'):
    if os.path.exists(file_name):
        os.remove(file_name)
        print(f"{file_name} has been deleted.")
    else:
        print(f"{file_name} does not exist.")

### Logging Initial Features

In [None]:
# Load dataset
data_path = './data/train.csv'
df = pd.read_csv(data_path)

In [None]:
# Log initial features to yaml
log_features_to_yaml(df, "Initial Features")

### Process and Engineer New Features

In [18]:
df_cleaned, clean_data_path = clean_n_transform(df)

In [None]:
# Log engineered features to yaml
log_features_to_yaml(df_cleaned, "Cleaned Features")

### Preprocess Cleaned Dataset

In [20]:
X_train, X_test, Y_train, Y_test = preprocess_data(df_cleaned)

In [None]:
# Combine and log the preprocessed data used for training
train_df = pd.concat([X_train, Y_train], axis=1)
log_features_to_yaml(train_df, "Training Features")

### Train and Log Model and Parameters through `train.py`

In [None]:
# End any active MLflow runs
if mlflow.active_run() is not None:
    mlflow.end_run()

with mlflow.start_run(run_name="main-run") as main_run:
    # Log features used within the runs
    mlflow.log_artifact("features.yaml")
    
    # Log cleaning, preprocessing, and training code
    mlflow.log_artifact("clean.py")
    mlflow.log_artifact("preprocess.py")
    mlflow.log_artifact("train.py")
    mlflow.log_artifact("main.ipynb")

    main_train(X_train, Y_train, X_test, Y_test, tuning_epochs=10, final_epochs=50, batch_size=32)