NanoToxicity Prediction Dataset Exploration

Life Cycle of Machine Learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

1.- Problem Statement
- This project aims to predict the toxicity of various nanomaterials using machine learning techniques. Accurate prediction of nanotoxicity is crucial for ensuring the safety of nanomaterials in various applications, including medicine, electronics, and environmental science.
- The goal is to provide a reliable in silico tool to assess nanomaterial safety, reducing the reliance on costly and time-consuming in vitro or in vivo experiments."

2.- Data Collection
- The dataset used in this project is a mixed dataset which is composed by 2 different sources:
- NanoPharos database: https://db.nanopharos.eu/ (which is part of S2NANO project database: https://www.s2nano.org/)
- NanoTox/ToxicityModel: https://doi.org/10.5281/zenodo.4055281 (which is part of S2NANO project database: https://www.s2nano.org/)


2.1 Import Data and Libraries

In [1]:
import dataclasses
import sys,os
import pandas as pd
import numpy as np
import matplotlib  as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

2.2 Dataset Information
- Material Type : Metal Oxides (MeOx)
- Core Size(nm) : size of the nanoparticle core in nanometers
- Method core size : method used to determine the core size (e.g., TEM, XRD)
- Hydro size(nm) : hydrodynamic size of the nanoparticle in nanometers
- Method hydro size : method used to determine the hydrodynamic size (e.g., DLS)
- Surface charge(mV) : surface charge of the nanoparticle in millivolts
- Method surface charge : method used to determine the surface charge (e.g., Zeta Potential)
- Surface area(m2/g) : surface area of the nanoparticle in square meters per gram
- Method surface area : method used to determine the surface area (e.g., BET)
- _HsF (eV) : Standard Heat of Formation. The change in enthalpy during the formation of the metal oxide from its elements. It indicates the thermodynamic stability of the crystal structure (lower values imply higher stability).
- Ec (eV) : Conduction Band Energy. The lowest energy level of the conduction band. It determines the reduction potential of the nanomaterial and its ability to transfer electrons to biological molecules (critical for ROS generation).
- Ev (eV) : Valence Band Energy. The highest energy level of the valence band. It dictates the oxidation potential and the hole ($h^+$) availability for oxidative reactions.
- _MeO (eV) : Metal-Oxygen Bond Enthalpy. The bond dissociation energy between the metal cation and the oxygen anion. It reflects the lattice strength and predicts the likelihood of metal ion release (dissolution) into the cellular environment.
- Assay : Cytotoxicity Assay Type. The specific biochemical technique used to measure cell viability (e.g., MTT, WST-1, LDH, Alamar Blue). Different assays measure different metabolic endpoints.
- Cell name : Cell Line Identity. The specific immortalized cell line or primary culture used in the experiment (e.g., A549, HeLa, HepG2).
- Cell species : Organism of Origin. The biological species from which the cells were derived (e.g., Human, Mouse, Hamster). Crucial for assessing inter-species sensitivity.
- Exposure time (h) : Exposure Duration. The length of time the cells were in contact with the nanomaterials, measured in hours.
- Exposure dose (ug/mL) : Concentration. The amount of nanomaterial applied to the cell culture per unit of volume, typically measured in micrograms per milliliter.
- Viability (%) : Cell Viability. The percentage of living cells remaining after exposure compared to the unexposed control group. This is the target variable for the regression model.
- Toxicity Class : Toxicity Classification. A categorical label indicating whether the nanomaterial is considered 'Toxic' or 'Non-Toxic' based on a predefined viability threshold (e.g., <70% viability = Toxic).

3.- Data Checks to perform
- check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

In [7]:
data = pd.read_csv('artifacts/20260202020758/data_ingestion/feature_store/MeOx_data.csv')
data.isna().sum()



FileNotFoundError: [Errno 2] No such file or directory: 'artifacts/20260202020758/data_ingestion/feature_store/MeOx_data.csv'

dThere are missing values in method columns

Check Duplicates

In [17]:
data.duplicated().sum()

np.int64(6)

In [24]:
# Muestra las filas duplicadas para revisarlas
display(data[data.duplicated()])


Unnamed: 0,Material type,Core size (nm),Method core size,Hydro size (nm),Method hydro size,Surface charge (mV),Method surface charge,Surface area (m2/g),Method surface area,_Hsf (eV),Ec (eV),Ev (eV),_MeO (eV),Assay,Cell name,Cell species,Exposure time,Exposure dose (ug/mL),Viability (%),Toxicity
518,TiO2,25.0,,504.5,,-10.7,,210.0,,-9.779,-4.16,-7.49,5.77,,SHSY5Y,Cancer,24,1e-05,100.0,Nontoxic
526,TiO2,25.0,,504.5,,-10.7,,210.0,,-9.779,-4.16,-7.49,5.77,,SHSY5Y,Cancer,6,1e-05,100.0,Nontoxic
534,TiO2,25.0,,504.5,,-10.7,,210.0,,-9.779,-4.16,-7.49,5.77,,SHSY5Y,Cancer,3,1e-05,100.0,Nontoxic
566,TiO2,25.0,,228.3,,-10.7,,40.0,,-9.779,-4.16,-7.49,5.77,,SHSY5Y,Cancer,24,1e-05,100.0,Nontoxic
574,TiO2,25.0,,228.3,,-10.7,,40.0,,-9.779,-4.16,-7.49,5.77,,SHSY5Y,Cancer,6,1e-05,100.0,Nontoxic
582,TiO2,25.0,,228.3,,-10.7,,40.0,,-9.779,-4.16,-7.49,5.77,,SHSY5Y,Cancer,3,1e-05,100.0,Nontoxic


3.1 Check Data Types

In [27]:
data.info()

<class 'pandas.DataFrame'>
RangeIndex: 976 entries, 0 to 975
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Material type          976 non-null    str    
 1   Core size (nm)         976 non-null    float64
 2   Method core size       494 non-null    str    
 3   Hydro size (nm)        976 non-null    float64
 4   Method hydro size      494 non-null    str    
 5   Surface charge (mV)    976 non-null    float64
 6   Method surface charge  494 non-null    str    
 7   Surface area (m2/g)    976 non-null    float64
 8   Method surface area    494 non-null    str    
 9   _Hsf (eV)              976 non-null    float64
 10  Ec (eV)                976 non-null    float64
 11  Ev (eV)                976 non-null    float64
 12  _MeO (eV)              976 non-null    float64
 13  Assay                  494 non-null    str    
 14  Cell name              976 non-null    str    
 15  Cell species     

3.2 Check Unique Values in Each Column

In [28]:
data.nunique()

Material type             15
Core size (nm)            35
Method core size           1
Hydro size (nm)           57
Method hydro size          1
Surface charge (mV)       37
Method surface charge      1
Surface area (m2/g)       36
Method surface area        2
_Hsf (eV)                 15
Ec (eV)                   15
Ev (eV)                   14
_MeO (eV)                 13
Assay                      2
Cell name                 13
Cell species               4
Exposure time              7
Exposure dose (ug/mL)     30
Viability (%)            556
Toxicity                   3
dtype: int64

3.3 Check Statistics of Dataset

In [29]:
data.describe()

Unnamed: 0,Core size (nm),Hydro size (nm),Surface charge (mV),Surface area (m2/g),_Hsf (eV),Ec (eV),Ev (eV),_MeO (eV),Exposure dose (ug/mL),Viability (%)
count,976.0,976.0,976.0,976.0,976.0,976.0,976.0,976.0,976.0,976.0
mean,39.071824,367.288422,4.397234,76.526148,-7.926569,-3.721957,-7.695266,5.721998,38.924353,81.187055
std,28.6749,318.137239,28.380784,122.212704,5.144564,0.891319,0.977241,0.15844,54.714503,26.875556
min,7.5,46.4,-46.1,7.0,-18.82,-5.17,-11.12,5.38,1e-05,-3.8739
25%,18.3,208.3,-11.7,21.8,-9.779,-4.16,-8.1,5.67,1.6,80.0
50%,28.4,267.0,0.0,40.0,-8.512,-3.89,-7.45,5.67,12.5,89.0
75%,51.5,313.8,27.6,74.2,-3.608,-3.615,-7.2,5.77,50.0,96.2409
max,125.0,1843.0,61.9,640.0,-1.17,-1.51,-6.51,6.19,300.0,151.1111


4.- Data Visualization


- Material type: str
- Core size (nm): float64
- Method core size: str
- Hydro size (nm): float64
- Method hydro size: str
- Surface charge (mV): float64
- Method surface charge: str
- Surface area (m2/g): float64
- Method surface area: str
- _Hsf (eV): float64
- Ec (eV): float64
- Ev (eV): float64
- _MeO (eV): float64
- Assay: str
- Cell name: str
- Cell species: str
- Exposure time: str
- Exposure dose (ug/mL): float64
- Viability (%): float64
- Toxicity: str

In [30]:
data['Toxicity'].nunique()

We have 20 numerical features : ['Material type', 'Core size (nm)', 'Method core size', 'Hydro size (nm)', 'Method hydro size', 'Surface charge (mV)', 'Method surface charge', 'Surface area (m2/g)', 'Method surface area', '_Hsf (eV)', 'Ec (eV)', 'Ev (eV)', '_MeO (eV)', 'Assay', 'Cell name', 'Cell species', 'Exposure time', 'Exposure dose (ug/mL)', 'Viability (%)', 'Toxicity']


5.- Data Cleaning and Preprocessing

Here we will handle missing values,drop columns which are not important, encode categorical variables, and scale numerical features as needed

In [None]:
import pandas as pd
import unicodedata
from sklearn.preprocessing import OneHotEncoder


DESCRIPTIVE_COLS_TO_DROP = [
    "ERM ID", "Method core size", "Method hydro size", "Method surface charge",
    "Method surface area", "Assay", "Cell name", "Cell species",
    "Cell origin", "Cell type"
]

PROXY_COLS_TO_DROP = ["Viability (%)"]