## Computational Feasibility Assessment  

we would try to evaluates the dataset in terms of size, features, and overall feasibility for computational processing. Key aspects will include:  
- **Dataset size**: Number of rows and columns  
- **Feature types**: Categorical vs. numerical features  
- **Memory usage**: Estimated RAM required to process the data  
- **Missing values**: Percentage of missing data  
- **Duplicate records**: Checking redundancy  

In [1]:
import pandas as pd

In [7]:
# Load the  train, test, and validation datasets
base_path = "../data/raw"
train_df = pd.read_csv(f"{base_path}/hiv_train.csv")
test_df = pd.read_csv(f"{base_path}/hiv_test.csv")
val_df = pd.read_csv(f"{base_path}/hiv_valid.csv")

# Store datasets in a dictionary for easier processing
datasets = {"Train": train_df, "Test": test_df, "Validation": val_df}

In [10]:
def assess_dataset(name:str, df:pd.DataFrame) -> None:
    print(f"\n===== {name} Dataset =====")
    print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]:,} columns")

    # Memory usage in MB
    memory_usage = df.memory_usage(deep=True).sum() / (1024**2)
    print(f"Memory Usage: {memory_usage:.2f} MB")

    # Count categorical and numerical features
    categorical_cols = df.select_dtypes(include=["object", "category"]).columns
    numerical_cols = df.select_dtypes(include=["int", "float"]).columns
    print(f"Categorical Features: {len(categorical_cols)} | Numerical Features: {len(numerical_cols)}")

    # Missing values
    missing_percentage = df.isnull().sum().sum() / (df.shape[0] * df.shape[1]) * 100
    print(f"Missing Data: {missing_percentage:.2f}%")

    # Duplicates
    duplicates = df.duplicated().sum()
    print(f"Duplicate Records: {duplicates}")

    print("-" * 40)

In [9]:
# Run assessment for all datasets
for name, df in datasets.items():
    assess_dataset(name, df)


===== Train Dataset =====
Shape: 28,789 rows × 3 columns
Memory Usage: 4.86 MB
Categorical Features: 2 | Numerical Features: 1
Missing Data: 0.00%
Duplicate Records: 0
----------------------------------------

===== Test Dataset =====
Shape: 8,225 rows × 3 columns
Memory Usage: 1.39 MB
Categorical Features: 2 | Numerical Features: 1
Missing Data: 0.00%
Duplicate Records: 0
----------------------------------------

===== Validation Dataset =====
Shape: 4,113 rows × 3 columns
Memory Usage: 0.69 MB
Categorical Features: 2 | Numerical Features: 1
Missing Data: 0.00%
Duplicate Records: 0
----------------------------------------


## Summary

After evaluating the dataset based on size, feature types, memory usage, missing values, and duplicate records to determine its computational feasibility, I observed;

- The dataset is small and well-structured, making local processing efficient.
- Size: Train (28,789), Test (8,225), Validation (4,113) rows, each with 3 columns
- Feature Types: 2 categorical, 1 numerical.
- Memory Usage: Minimal (~6.94 MB in total), ensuring low computational overhead.
- Missing Data: 0.00%, no need for imputation.
- Duplicates: None, ensuring data integrity after the splits.

Given the small size and low memory footprint, processing on a local machine is highly feasible without performance concerns. Although the Featurization stage requires loding and running anotheer ML model to create descriptors for each SMILES record in the dataset which might need more computational overhead, optimizing a script to preprocess the data with the ML model will be helpful. 