# **DATA CLEANING NOTEBOOK**

## Objectives

* Given the results of the previous EDA, clean the data to solve issues and create a consistent and more balanced dataset for use in model training.

## Inputs

* Android_Malware_converted.csv

## Outputs

* Android_Malware_cleaned.csv


---

# Set Project Root Directory

Centralise the base path using project_root

In [None]:
import os
from pathlib import Path

# Resolve the project root
project_root = Path.cwd()
if project_root.name == "jupyter_notebooks":
    project_root = project_root.parent

# Import Libraries

In this section, all necessary standard libaries are imported to allow using their functions.

Import Libraries with necessary Settings

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
%matplotlib inline
sns.set(style="whitegrid")

---

# Load Converted Dataset

In this section, the converted Dataset is loaded to be able to access the prepared data.

In [None]:
df_converted = pd.read_csv(Path.cwd().parent / "outputs" / "data" / "Android_Malware_converted.csv")

---

# Data Cleaning

In this section, the data is cleaned based on the issues found in the EDA. Additional evaluations of the data is performed to get a deeper insight into how to handle the issues.

## Copy Dataframe

Make a copy to preserve the original converted dataset.

In [None]:
df_clean = df_converted.copy()

## Evaluate Missing Data

* The evaluation of missing data is done via a custom EvaluateMissingData function. This has been adapted from a Code Institute Walkthrough.
* The function was adapted to be more sensitive and to avoid dropping variables that only look "clean" due to rounding.

Use custom function to evaluate missing data

In [None]:
# Include minimum percentage of missing data required for a column to be included in the output
def EvaluateMissingData(df_clean, min_threshold=0.0):

    # Total missing values per column
    missing_data_absolute = df_clean.isnull().sum()
    
    # Unrounded percentage calculation
    missing_data_percentage = (missing_data_absolute / len(df_clean)) * 100

    # Create the summary dataframe
    df_missing_data = (
        pd.DataFrame({
            "RowsWithMissingData": missing_data_absolute,
            "PercentageOfDataset": missing_data_percentage,
            "DataType": df_clean.dtypes
        })
        .sort_values(by="PercentageOfDataset", ascending=False)
        .query(f"PercentageOfDataset > {min_threshold}")
    )

    # Round only for display
    df_missing_data["PercentageOfDataset"] = df_missing_data["PercentageOfDataset"].round(2)

    return df_missing_data

# Display summary dataframe
df_missing_data = EvaluateMissingData(df_clean, min_threshold=0.001)
display(df_missing_data)

## Drop Features

* From previous EDA some features are known to have all-zero values (and zero outliers)
* These features can be dropped

Drop selected features

In [None]:
from feature_engine.selection import DropFeatures

features_to_drop = [
    'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk', 'Fwd Avg Bulk Rate',
    'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate'
]

dropper = DropFeatures(features_to_drop=features_to_drop)
dropper.fit(df_clean)
df_clean = dropper.transform(df_clean)

Rerun EvaluateMissingData function to check improvement

In [None]:
EvaluateMissingData(df_clean)

## Handle Missing Values

* From previous EDA and evaluating missing data, the amount and type of missing data is known.
* Because of the type of features, numerical and continuos, median imputation can be useful for handling missing data.
* This approach is also robust to outliers/skewing

Impute missing values

In [None]:
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer

# For numeric variables with missing values
numeric_vars = df_clean.select_dtypes(include=['float64', 'int64']).columns
numeric_with_na = [col for col in numeric_vars if df_clean[col].isna().sum() > 0]

median_imputer = MeanMedianImputer(imputation_method='median', variables=numeric_with_na)
median_imputer.fit(df_clean)
df_clean = median_imputer.transform(df_clean)

## Handle Skewed Features

* In previous EDA, many features with very high skewedness were found.

Use np.log1p() for skewed positive variables and store transformed column names for reference

In [None]:
import numpy as np
from scipy.stats import skew

# Find highly skewed numeric features
skewed_features = df_clean[numeric_vars].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
high_skew = skewed_features[abs(skewed_features) > 2].index.tolist()

# Apply log1p transformation (avoids negative values)
def safe_log1p(x):
    return np.log1p(np.where(x < 0, 0, x))

df_clean[high_skew] = df_clean[high_skew].apply(safe_log1p)

* Using log1p may result in infinite values, which need to be replaced and imputed again

Impute again for possible NaNs introduced via log1p

In [None]:
# Replace any remaining infinite values with NaN (may result from log1p)
df_clean.replace([np.inf, -np.inf], np.nan, inplace=True)

# Impute again (if log1p introduced NaNs)
numeric_with_na = [col for col in df_clean.columns if df_clean[col].isna().sum() > 0]
if numeric_with_na:
    median_imputer = MeanMedianImputer(imputation_method='median', variables=numeric_with_na)
    median_imputer.fit(df_clean)
    df_clean = median_imputer.transform(df_clean)

## Final Check

Run final check for uncleaned data with custom EvaluateMissingData function

In [None]:
EvaluateMissingData(df_clean)

# Check for remaining skew, outlier ranges
df_clean.describe().T  

---

# Save Files

Save cleaned dataframe as csv file for better access

In [None]:
# Save cleaned dataframe as csv file for easier access of cleaned data
data_path = project_root / "outputs" / "data"
os.makedirs(data_path, exist_ok=True)
df_clean.to_csv(data_path / "Android_Malware_cleaned.csv", index=False)

print("✅ Saved cleaned dataframe to outputs/data/")

---

# Conclusion and Next Steps

* The data is now cleaned and ready to be used in feature engineering for training the model.