## **Juntando os Datasets**

O objetivo deste código é unificar dois datasets distintos para análise no projeto:

- [**``2017_MeOx_IIIB_HYU.xlsx``**](https://zenodo.org/records/15300193)
- [**``HA3B.csv``**](https://zenodo.org/records/15385143)

Para isso foi necessário renomar as colunas do primeiro dataset para se alinharem aos nomes das colunas do segundo dataset. Em seguida, foi utilizado o **``pd.concat()``** para empilhar os dois datasets, de forma que apenas as colunas de mesmo nome fosse mmantidas no dataset final (``join='inner'``).

In [10]:
import pandas as pd
import os

df1 = pd.read_excel("datasets/2017_MeOx_IIIB_HYU.xlsx")
df2 = pd.read_csv("datasets/HA3B.csv")

output_file = "datasets/dataset_nanotoxicologia_combinado.csv"

column_rename_map = {
    'Material type': 'Material_type',
    'Core size (nm)': 'Core_size',
    'Hydro size (nm)': 'Hydro_size',
    'Surface charge (mV)': 'Surface_charge',
    'Surface area (m2/g)': 'Surface_area',
    'ΔHsf (eV)': 'Formation_enthalpy',  
    'Ec (eV)': 'Conduction_band',     
    'Ev (eV)': 'Valence_band',        
    'χMeO (eV)': 'Electronegativity',   
    'Assay': 'Assay',
    'Cell name': 'Cell_name',
    'Cell species': 'Cell_species',
    'Cell origin': 'Cell_origin',
    'Cell type': 'Cell_type',
    'Exposure time': 'Exposure_time',
    'Mass dose (ug/mL)': 'Exposure_dose', 
    'Viability (%)': 'Viability_percent', 
    'Toxicity': 'Toxicity',
}

df1_renomeado = df1.rename(columns=column_rename_map)

colunas = list(column_rename_map.keys())

df_combinado = pd.concat([df1_renomeado, df2], ignore_index=True, sort=False, join='inner')

df_combinado.to_csv(output_file, index=False)

display(df_combinado.head())
print(df_combinado.info())

Unnamed: 0,Material_type,Core_size,Hydro_size,Surface_charge,Surface_area,Formation_enthalpy,Conduction_band,Valence_band,Electronegativity,Assay,Cell_name,Cell_species,Cell_origin,Cell_type,Exposure_time,Exposure_dose,Toxicity
0,Al2O3,39.7,267.0,36.3,64.7,-17.345,-1.51,-9.81,5.67,MTT,HCMEC,Human,Blood,Normal,24,0.001,Nontoxic
1,Al2O3,39.7,267.0,36.3,64.7,-17.345,-1.51,-9.81,5.67,MTT,HCMEC,Human,Blood,Normal,24,0.01,Nontoxic
2,Al2O3,39.7,267.0,36.3,64.7,-17.345,-1.51,-9.81,5.67,MTT,HCMEC,Human,Blood,Normal,24,0.1,Nontoxic
3,Al2O3,39.7,267.0,36.3,64.7,-17.345,-1.51,-9.81,5.67,MTT,HCMEC,Human,Blood,Normal,24,1.0,Nontoxic
4,Al2O3,39.7,267.0,36.3,64.7,-17.345,-1.51,-9.81,5.67,MTT,HCMEC,Human,Blood,Normal,24,5.0,Nontoxic


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1332 entries, 0 to 1331
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Material_type       1332 non-null   object 
 1   Core_size           1332 non-null   float64
 2   Hydro_size          1332 non-null   float64
 3   Surface_charge      1332 non-null   float64
 4   Surface_area        1332 non-null   float64
 5   Formation_enthalpy  1332 non-null   float64
 6   Conduction_band     1332 non-null   float64
 7   Valence_band        1332 non-null   float64
 8   Electronegativity   1332 non-null   float64
 9   Assay               1332 non-null   object 
 10  Cell_name           1332 non-null   object 
 11  Cell_species        1332 non-null   object 
 12  Cell_origin         1332 non-null   object 
 13  Cell_type           1332 non-null   object 
 14  Exposure_time       1332 non-null   int64  
 15  Exposure_dose       1332 non-null   float64
 16  Toxici