# **COMPRESS FILES NOTEBOOK**

## Objectives

* Compress .csv and .pkl files to get smaller file size for push to GitHub and rendering app on Render. 

## Inputs

* Android_Malware.csv
* Android_Malware_converted.csv
* X_test_scaled.csv
* best_model_fe_results.csv
* best_model_final_results.csv
* model_training_results.csv
* model_tuning_results.csv
* default_best_model.pkl

## Outputs

* Android_Malware.csv.gz
* Android_Malware_converted.csv.gz
* X_test_scaled.csv.gz
* best_model_fe_results.csv.gz
* best_model_final_results.csv.gz
* model_training_results.csv.gz
* model_tuning_results.csv.gz
* default_best_model.pkl.gz

---

# Set Project Root Directory

Centralise the base path using project_root

In [None]:
import os
from pathlib import Path

# Resolve the project root
project_root = Path.cwd()
if project_root.name == "jupyter_notebooks":
    project_root = project_root.parent

# Compress CSV Files

In [None]:
import pandas as pd
import os

# List os CSV files to compress
csv_files = [
    project_root / "inputs/datasets/raw/Android_Malware.csv",
    project_root / "outputs/data/Android_Malware_converted.csv",
    project_root / "outputs/data/X_test_scaled.csv",
    project_root / "outputs/evaluation/best_model_fe_results.csv",
    project_root / "outputs/evaluation/best_model_final_results.csv",
    project_root / "outputs/evaluation/model_training_results.csv",
    project_root / "outputs/evaluation/model_tuning_results.csv",
]

# Compress files with gzip for .csv.gz output files
for file_path in csv_files:
    if file_path.exists():
        print(f"Compressing: {file_path}")
        df = pd.read_csv(file_path)
        compressed_path = file_path.with_suffix(file_path.suffix + ".gz")
        df.to_csv(compressed_path.as_posix(), index=False, compression='gzip')
        print(f"Saved compressed file: {compressed_path}")
    else:
        print(f"File not found: {file_path}")

## Compress PKL Files

In [None]:
import joblib
import compress_pickle as cpickle

# Load model
model_path = project_root / "outputs" / "ml_pipeline" / "default_best_model.pkl"
default_best_model = joblib.load(model_path)

# Path to save compressed model
compressed_model_path = project_root / "outputs" / "ml_pipeline" / "default_best_model.pkl.gz"

# Save with gzip compression
cpickle.dump(default_best_model, compressed_model_path, compression="gzip")

print(f"Saved compressed file: {compressed_model_path}")

---

# Conclusion & Notes

* Needed .csv files were compressed to allow pushing to GitHub
* Needed .pkl files were compressed to allow pushing to GitHub
* The library [compress_pickle](https://lucianopaz.github.io/compress_pickle/html/) was used to compress the .pkl file. 
* Additionally, [Git Large File Storage](https://git-lfs.com/) was used to allow larger files to push to GitHub