# Data Preparation for Wild Meat Nutritional Composition Analysis

This notebook details the data preparation steps for the study titled "Comprehensive Nutritional Composition of Wild Meat: A Systematic Review Using Data Imputation with Artificial Intelligence". The process involves loading the raw data, cleaning and transforming features, handling inconsistencies, and preparing a dataset suitable for subsequent analysis and imputation.

The primary goals of this notebook are:
1. Load the initial dataset.
2. Clean and standardize column names.
3. Remove irrelevant or redundant columns.
4. Standardize categorical variables such as animal part (`parte`) and species (`taxon`).
5. Extract new features like `genus`.
6. Convert data types to appropriate formats.
7. Perform initial handling of missing or inconsistent values.
8. Save the processed dataset.

## 1. Setup and Library Imports

Import necessary Python libraries for data manipulation and set pandas display options for better visualization of DataFrames.

In [1]:
import pandas as pd
import numpy as np

# Set pandas display options to show more columns and rows
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 500)

## 2. Data Loading

Load the input dataset from a CSV file into a pandas DataFrame.

In [2]:
df = pd.read_csv("data/input_data.csv")

## 3. Initial Data Exploration and Column Cleaning

Inspect the column names and identify columns to be removed. These may include columns related to data dispersion, units, or other metadata not directly used in the nutritional analysis.

In [3]:
# Display all column names to understand the dataset structure
df.columns

Index(['cod_fonte', 'taxon', 'nom_pt', 'nom_art', 'obs', 'classificação',
       'n_amostra', 'idade', 'un_idade', 'local_coleta', 'pais_coleta',
       'parte', 'peso (seco/úmido)', 'ag ', 'disp_sd_ag', 'al ', 'disp_se_al',
       'cd ', 'disp_se_cd', 'disp_sd_cd', 'disp_sem_cd', 'co ', 'disp_se_co',
       'disp_sd_co', 'cr', 'disp_sd_cr', 'cu', 'disp_se_cu', 'disp_sem_cu',
       'disp_sd_cu', 'fe', 'disp_se_fe', 'disp_sem_fe', 'disp_sd_sd', 'mn',
       'disp_sem_mn', 'disp_se_mn', 'disp_sd_mn', 'pb', 'disp_sem_pb',
       'disp_sd_pb', 'se', 'disp_sd_se', 'disp_se_se', 'zn', 'disp_se_zn',
       'disp_sem_zn', 'disp_sd_zn', 'ca ', 'disp_se_ca', 'disp_sem_ca', 'k ',
       'disp_se_k', 'disp_sem_k', 'mg', 'disp_se_mg', 'disp_sem_mg', 'na',
       'disp_se_na', 'disp_sem_na', 'p', 'disp_se_p', 'disp_sem_p', 'ba',
       'disp_se_ba', 'disp_sd_ba', 'mo', 'disp_se_mo', 'disp_sd_mo', 's',
       'disp_se_s', 'ptn', 'disp_rmse_lip', 'disp_sem_ptn', 'disp_se_ptn',
       'disp_sd_ptn', '

### 3.1. Identify and Remove Unnecessary Columns
Columns related to data dispersion (e.g., standard deviation, variance if provided separately and not needed), units (as data will be standardized), and other specific identifiers or metadata are removed to streamline the dataset.

In [4]:
# Identify columns to remove based on prefixes or specific names
# cols_dispersao: Columns related to dispersion measures (e.g., standard deviation)
cols_dispersao = [i for i in df.columns if i.strip().startswith("disp_") or i.strip().startswith("dips_")]
# cols_unidade: Columns specifying units, assuming data will be standardized or units are implicit
cols_unidade = [i for i in df.columns if i.strip().startswith("un_") or i.strip().endswith("_un")]
# cols_outras: Other miscellaneous columns to be removed
cols_outras = [i for i in df.columns if i.strip().startswith("um_")] + ["cod_fonte", "n_amostra", "local_coleta"]

# Combine all columns to be removed
cols_remover = cols_dispersao + cols_unidade + cols_outras

# Drop the identified columns from the DataFrame
df.drop(cols_remover, axis=1, inplace=True)

# Display the DataFrame to verify column removal (optional, can be large)
df

Unnamed: 0,taxon,nom_pt,nom_art,obs,classificação,idade,pais_coleta,parte,peso (seco/úmido),ag,al,cd,co,cr,cu,fe,mn,pb,se,zn,ca,k,mg,na,p,ba,mo,s,ptn,lip,PUFA,dis_sem_PUFA,w3,w6,Ti
0,Sus scrofa,javali-selvagem,boar,Houve discrdância nas buacas: um aparece com p...,mamifero,2-5,Itália,Carne (utilizado média),úmido,,,7e-06,,0.0000123,,,,1.2e-05,,,,,,,,,,,,,,,,,
1,Sus scrofa,javali-selvagem,boar,idem,mamifero,2-5,Itália,Figado (utilizado média),úmido,,,6.7e-06,,0.0000125,,,,3.2e-05,,,,,,,,,,,,,,,,,
2,Sus scrofa,javali-selvagem,wild boar,Os dados estão apresentados em 8 colunas disti...,mamifero,< 1 e 1 Média,Alemanha,M. longissimus,úmido,,,,,,0.00017,0.0019,,,1.3e-05,0.0024,,,,,,,,,22.5,2.1,0.6489,,0.0777,0.5712,
3,Capreolus capreolus,corça-selvavem,feral roe deer,idem,mamifero,< 1 e 1 Média,Alemanha,M. longissimus,úmido,,,,,,0.00028,0.00321,,,4e-06,0.00235,,,,,,,,,23.5,1.0,0.307,,0.11,0.266,
4,Equus quagga burchellii,zebra,zebra,,mamifero,,África do Sul,Longissimus lumborum muscle,úmido,,,,,,,,,,,,,,,,,,,,22.29,1.47,0.000586,,3.2e-05,5e-06,
5,Cervus elaphus,veado-vermelho,red deer,Utilizado os dados de média. Houve discrdância...,mamifero,,Polônia,Musculo,úmido,1.65e-08,,7.26e-06,5.61e-07,0.00001716,0.000363,,7.6e-05,6e-06,5e-06,0.00495,,,,,,5e-06,6e-06,,,,,,,,8.58e-07
6,Cervus elaphus,veado-vermelho,red deer,idem,mamifero,,Polônia,Figado,úmido,2.904e-06,,1.287e-05,5.94e-06,0.00001815,0.001947,,0.000396,6e-06,7e-06,0.0033,,,,,,4e-06,0.000102,,,,,,,,1.551e-06
7,Cervus elaphus,veado-vermelho,red deer,idem,mamifero,,Polônia,Rim,úmido,1.98e-08,,0.00264,2.937e-06,0.0000462,0.000693,,0.000218,1e-05,0.000132,0.00429,,,,,,1.5e-05,4.9e-05,,,,,,,,9.24e-06
8,Cervus elaphus nannodes,alces-de-tule,tule elk,,mamifero,,EUA,Figado,úmido,,,,,,0.000867,0.015131,0.000214,,,0.00204,0.004968,8.2e-08,0.015988,,0.003788,,0.000113,0.000218,,,,,,,
9,Cervus elaphus nannodes,alces-de-tule,tule elk,,mamifero,,EUA,Figado,úmido,,,,,,0.004799,0.013225,0.000211,,,0.001924,0.003905,1.4e-08,0.016113,,0.003571,,8.2e-05,0.000204,,,,,,,


### 3.2. Explore Key Categorical Columns
Examine the unique values in `parte` (anatomical part) and `taxon` (species name) to understand the diversity and identify necessary cleaning or standardization steps.

In [5]:
# Explore unique values in the 'parte' column (anatomical part of the animal)
df.parte.unique()

array(['Carne (utilizado média)', 'Figado  (utilizado média)',
       'M. longissimus', 'Longissimus lumborum muscle', 'Musculo',
       'Figado', 'Rim', 'Musculo - Carne de peito ', 'Perna inteira',
       'Coxa - Musculus semimembranosus', 'M. Longissimus lombar',
       'Longissimus thoracis et lumborum', 'músculo do peito', 'coxa',
       'músculo estriado', 'Fígado', 'M. logissimus lumborum',
       'Posteriores', 'Longissimus thoracis (LT)', 'músculo',
       'músculo (pata traseira)', 'músculo (pata dianteira)',
       'músculo (filé)', 'viscera',
       'músculo do dorso (occipito-cervicalismedialis)',
       'músculo da calda (músculo ílio-ischiocaudalis)'], dtype=object)

In [6]:
# Explore unique values in the 'taxon' column (species name)
df.taxon.unique()

array(['Sus scrofa', 'Capreolus capreolus ', 'Equus quagga burchellii',
       'Cervus elaphus', 'Cervus elaphus nannodes', 'Scolopax rusticola ',
       'Phacochoerus africanus', 'Columba palumbus', 'Turdus philomelos',
       'Streptopelia turtur', 'Alces alces', 'Branta canadensis',
       'Anas platyrhynchos', 'Syncerus caffer', 'Odocoileus virginianus',
       'Odocoileus hemionus', 'Sus scrofa ', 'Capreolus capreolus',
       'Tayassu tajacu', 'Chelonoidis denticulata', 'Cuniculus paca',
       'Mazama americana', 'Tayassu pecari', 'Pecari tajacu',
       'Podocnemis expansa', 'Peltocephalus dumerilianus ',
       'Dasyprocta leporina', 'Podocnemis sextuberculata Cornalia',
       'Caiman crocodilus', 'Agouti paca ', 'Podocnemis unifilis',
       'Manzama americana ', 'Caiman yacare'], dtype=object)

### 3.3. Clean 'taxon' Column
Remove leading/trailing whitespace from the 'taxon' column for consistency.

In [7]:
# Strip whitespace from 'taxon' entries
df["taxon"] = df["taxon"].str.strip()

## 4. Data Transformation and Feature Engineering

Standardize categorical features by mapping them to consistent values. This includes anatomical parts and species names. New features like 'genus' are also derived.

### 4.1. Define Mapping Dictionaries
Dictionaries are created to map various representations of anatomical parts and species names to standardized categories or codes. This is crucial for consistent analysis, especially when dealing with data aggregated from multiple sources.

In [8]:
# Dictionary to standardize anatomical part names
# e.g., 'Carne (utilizado média)' is mapped to 'Musculo'
dict_partes = {
    "Carne (utilizado média)": "Musculo",
    "Figado  (utilizado média)": "Visceras",
    "M. longissimus": "Musculo",
    "Longissimus lumborum muscle": "Musculo",
    "Musculo": "Musculo",
    "Figado": "Visceras",
    "Rim": "Visceras",
    "Musculo - Carne de peito ": "Musculo",
    "Perna inteira": "Musculo",
    "Coxa - Musculus semimembranosus": "Musculo",
    "M. Longissimus lombar": "Musculo",
    "Longissimus thoracis et lumborum": "Musculo",
    "músculo do peito": "Musculo",
    "coxa": "Musculo",
    "músculo estriado": "Musculo",
    "Fígado": "Visceras",
    "M. logissimus lumborum": "Musculo",
    "Posteriores": "Musculo",
    "Longissimus thoracis (LT)": "Musculo",
    "músculo": "Musculo",
    "músculo (pata traseira)": "Musculo",
    "músculo (pata dianteira)": "Musculo",
    "músculo (filé)": "Musculo",
    "viscera": "Visceras",
    "músculo do dorso (occipito-cervicalismedialis)": "Musculo",
    "músculo da calda (músculo ílio-ischiocaudalis)": "Musculo",
}


# Dictionary to map species names to standardized codes
# This helps in grouping species and can be useful for analysis and visualization
dict_especies = {
    "Agouti paca": "AP1",
    "Alces alces": "AA",
    "Anas platyrhynchos": "AP2",
    "Branta canadensis": "BC",
    "Capreolus capreolus": "CC1",
    "Capreolus capreolus ": "CC1",  # Note the trailing space, handled by prior .str.strip() or robust mapping
    "Cervus elaphus": "CE",
    "Cervus elaphus nannodes": "CE",
    "Chelonoidis denticulatus": "CD",
    "Chelonoidis denticulata": "CD",
    "Columba palumbus": "CP1",
    "Cuniculus paca": "CP2",
    "Equus quagga burchellii": "EQ",
    "Geochelone denticulaa": "GD",  # Potential typo, map to a consistent reptile code if Chelonoidis is preferred
    "Mazama americana": "MA",
    "Odocoileus hemionus": "OH",
    "Odocoileus virginianus": "OV",
    "Phacochoerus africanus": "PA",
    "Scolopax rusticola": "SR",
    "Streptopelia turtur": "ST",
    "Sus scrofa": "SS",
    "Syncerus caffer": "SC",
    "Tayassu pecari": "TP1",
    "Tayassu tajacu": "TT",
    "Turdus philomelos": "TP2",
    "Pecari tajacu": "PQ",  # Note: Tayassu tajacu is TT, Pecari tajacu is PQ. Verify if these are distinct or synonyms to be merged.
    "Podocnemis expansa": "PE",
    "Peltocephalus dumerilianus": "PD",
    "Dasyprocta leporina": "DL",
    "Podocnemis sextuberculata Cornalia": "PS",
    "Caiman crocodilus": "CC2",
    "Podocnemis unifilis": "PU",
    "Manzama americana": "MA",  # Typo for Mazama americana
    "Caiman yacare": "CY",
}

Verify the species mapping.

In [9]:
# Print species codes and their corresponding names for verification
for k, v in dict_especies.items():
    print(v, k)

AP1 Agouti paca
AA Alces alces
AP2 Anas platyrhynchos
BC Branta canadensis
CC1 Capreolus capreolus
CC1 Capreolus capreolus 
CE Cervus elaphus
CE Cervus elaphus nannodes
CD Chelonoidis denticulatus
CD Chelonoidis denticulata
CP1 Columba palumbus
CP2 Cuniculus paca
EQ Equus quagga burchellii
GD Geochelone denticulaa
MA Mazama americana
OH Odocoileus hemionus
OV Odocoileus virginianus
PA Phacochoerus africanus
SR Scolopax rusticola
ST Streptopelia turtur
SS Sus scrofa
SC Syncerus caffer
TP1 Tayassu pecari
TT Tayassu tajacu
TP2 Turdus philomelos
PQ Pecari tajacu
PE Podocnemis expansa
PD Peltocephalus dumerilianus
DL Dasyprocta leporina
PS Podocnemis sextuberculata Cornalia
CC2 Caiman crocodilus
PU Podocnemis unifilis
MA Manzama americana
CY Caiman yacare


### 4.2. Apply Mappings and Derive New Features
Create new columns `taxon_tfmed` (transformed taxon code) and `genus`.

In [10]:
# Apply species mapping to create 'taxon_tfmed' column
df["taxon_tfmed"] = df["taxon"].replace(dict_especies)

# Extract genus from the 'taxon' column (first word, lowercased and stripped)
df["genus"] = df["taxon"].str.split().str[0].str.strip().str.lower()

Explore unique values of the newly created transformed columns.

In [11]:
# Check unique values in the transformed taxon column
df["taxon_tfmed"].unique()

array(['SS', 'CC1', 'EQ', 'CE', 'SR', 'PA', 'CP1', 'TP2', 'ST', 'AA',
       'BC', 'AP2', 'SC', 'OV', 'OH', 'TT', 'CD', 'CP2', 'MA', 'TP1',
       'PQ', 'PE', 'PD', 'DL', 'PS', 'CC2', 'AP1', 'PU', 'CY'],
      dtype=object)

In [12]:
# Check unique values in the new 'genus' column
df["genus"].unique()

array(['sus', 'capreolus', 'equus', 'cervus', 'scolopax', 'phacochoerus',
       'columba', 'turdus', 'streptopelia', 'alces', 'branta', 'anas',
       'syncerus', 'odocoileus', 'tayassu', 'chelonoidis', 'cuniculus',
       'mazama', 'pecari', 'podocnemis', 'peltocephalus', 'dasyprocta',
       'caiman', 'agouti', 'manzama'], dtype=object)

Apply mapping for anatomical parts.

In [13]:
# Apply anatomical part mapping to create 'partes_tfmed' column
df["partes_tfmed"] = df.parte.replace(dict_partes)

# Display DataFrame to see new columns (optional)
df

Unnamed: 0,taxon,nom_pt,nom_art,obs,classificação,idade,pais_coleta,parte,peso (seco/úmido),ag,al,cd,co,cr,cu,fe,mn,pb,se,zn,ca,k,mg,na,p,ba,mo,s,ptn,lip,PUFA,dis_sem_PUFA,w3,w6,Ti,taxon_tfmed,genus,partes_tfmed
0,Sus scrofa,javali-selvagem,boar,Houve discrdância nas buacas: um aparece com p...,mamifero,2-5,Itália,Carne (utilizado média),úmido,,,7e-06,,0.0000123,,,,1.2e-05,,,,,,,,,,,,,,,,,,SS,sus,Musculo
1,Sus scrofa,javali-selvagem,boar,idem,mamifero,2-5,Itália,Figado (utilizado média),úmido,,,6.7e-06,,0.0000125,,,,3.2e-05,,,,,,,,,,,,,,,,,,SS,sus,Visceras
2,Sus scrofa,javali-selvagem,wild boar,Os dados estão apresentados em 8 colunas disti...,mamifero,< 1 e 1 Média,Alemanha,M. longissimus,úmido,,,,,,0.00017,0.0019,,,1.3e-05,0.0024,,,,,,,,,22.5,2.1,0.6489,,0.0777,0.5712,,SS,sus,Musculo
3,Capreolus capreolus,corça-selvavem,feral roe deer,idem,mamifero,< 1 e 1 Média,Alemanha,M. longissimus,úmido,,,,,,0.00028,0.00321,,,4e-06,0.00235,,,,,,,,,23.5,1.0,0.307,,0.11,0.266,,CC1,capreolus,Musculo
4,Equus quagga burchellii,zebra,zebra,,mamifero,,África do Sul,Longissimus lumborum muscle,úmido,,,,,,,,,,,,,,,,,,,,22.29,1.47,0.000586,,3.2e-05,5e-06,,EQ,equus,Musculo
5,Cervus elaphus,veado-vermelho,red deer,Utilizado os dados de média. Houve discrdância...,mamifero,,Polônia,Musculo,úmido,1.65e-08,,7.26e-06,5.61e-07,0.00001716,0.000363,,7.6e-05,6e-06,5e-06,0.00495,,,,,,5e-06,6e-06,,,,,,,,8.58e-07,CE,cervus,Musculo
6,Cervus elaphus,veado-vermelho,red deer,idem,mamifero,,Polônia,Figado,úmido,2.904e-06,,1.287e-05,5.94e-06,0.00001815,0.001947,,0.000396,6e-06,7e-06,0.0033,,,,,,4e-06,0.000102,,,,,,,,1.551e-06,CE,cervus,Visceras
7,Cervus elaphus,veado-vermelho,red deer,idem,mamifero,,Polônia,Rim,úmido,1.98e-08,,0.00264,2.937e-06,0.0000462,0.000693,,0.000218,1e-05,0.000132,0.00429,,,,,,1.5e-05,4.9e-05,,,,,,,,9.24e-06,CE,cervus,Visceras
8,Cervus elaphus nannodes,alces-de-tule,tule elk,,mamifero,,EUA,Figado,úmido,,,,,,0.000867,0.015131,0.000214,,,0.00204,0.004968,8.2e-08,0.015988,,0.003788,,0.000113,0.000218,,,,,,,,CE,cervus,Visceras
9,Cervus elaphus nannodes,alces-de-tule,tule elk,,mamifero,,EUA,Figado,úmido,,,,,,0.004799,0.013225,0.000211,,,0.001924,0.003905,1.4e-08,0.016113,,0.003571,,8.2e-05,0.000204,,,,,,,,CE,cervus,Visceras


In [14]:
# Check unique values in the transformed 'partes_tfmed' column
df.partes_tfmed.unique()

array(['Musculo', 'Visceras'], dtype=object)

## 5. Feature Selection and DataFrame Restructuring

Select the columns relevant for the nutritional analysis and create a new DataFrame. The selected columns include transformed categorical features and nutrient values.

In [15]:
# Select relevant columns for the final dataset
# These include identifiers, categorical features, and nutrient values
# Note: 'k ' has a trailing space, which will be handled later by stripping column names.
data = df[["taxon_tfmed", "genus", "classificação", "pais_coleta", "fe", "mn", "se", "zn", "k ", "mg", "na", "ptn", "lip", "w3", "w6", "partes_tfmed"]].copy()

# Display the new DataFrame (optional)
data

Unnamed: 0,taxon_tfmed,genus,classificação,pais_coleta,fe,mn,se,zn,k,mg,na,ptn,lip,w3,w6,partes_tfmed
0,SS,sus,mamifero,Itália,,,,,,,,,,,,Musculo
1,SS,sus,mamifero,Itália,,,,,,,,,,,,Visceras
2,SS,sus,mamifero,Alemanha,0.0019,,1.3e-05,0.0024,,,,22.5,2.1,0.0777,0.5712,Musculo
3,CC1,capreolus,mamifero,Alemanha,0.00321,,4e-06,0.00235,,,,23.5,1.0,0.11,0.266,Musculo
4,EQ,equus,mamifero,África do Sul,,,,,,,,22.29,1.47,3.2e-05,5e-06,Musculo
5,CE,cervus,mamifero,Polônia,,7.6e-05,5e-06,0.00495,,,,,,,,Musculo
6,CE,cervus,mamifero,Polônia,,0.000396,7e-06,0.0033,,,,,,,,Visceras
7,CE,cervus,mamifero,Polônia,,0.000218,0.000132,0.00429,,,,,,,,Visceras
8,CE,cervus,mamifero,EUA,0.015131,0.000214,,0.00204,8.2e-08,0.015988,,,,,,Visceras
9,CE,cervus,mamifero,EUA,0.013225,0.000211,,0.001924,1.4e-08,0.016113,,,,,,Visceras


Explore the distribution of `genus` and its relationship with `classificação` (taxonomic class).

In [16]:
# Value counts for 'genus', sorted by index
data.genus.value_counts().sort_index()

genus
agouti            1
alces             2
anas              1
branta            1
caiman            3
capreolus         3
cervus           12
chelonoidis       2
columba           2
cuniculus         2
dasyprocta        1
equus             1
manzama           1
mazama            1
odocoileus        4
pecari            2
peltocephalus     1
phacochoerus      2
podocnemis        7
scolopax          2
streptopelia      2
sus               7
syncerus          4
tayassu          12
turdus            2
Name: count, dtype: int64

In [17]:
# Number of unique 'classificação' (taxonomic class) for each 'genus'
# This helps to check if a genus is consistently assigned to a single class
data.groupby("genus")["classificação"].nunique()

genus
agouti           1
alces            1
anas             1
branta           1
caiman           1
capreolus        1
cervus           1
chelonoidis      1
columba          1
cuniculus        1
dasyprocta       1
equus            1
manzama          1
mazama           1
odocoileus       1
pecari           1
peltocephalus    1
phacochoerus     1
podocnemis       1
scolopax         1
streptopelia     1
sus              1
syncerus         1
tayassu          1
turdus           1
Name: classificação, dtype: int64

## 6. Data Type Conversion and Categorical Encoding

Convert categorical columns to the `category` data type for memory efficiency and to enable easy numerical encoding. Numerical representations of these categories are created for potential use in machine learning models.

In [18]:
# Convert categorical string columns to pandas 'category' dtype
data["classificação"] = data["classificação"].astype("category")
data["pais_coleta"] = data["pais_coleta"].astype("category")
data["partes_tfmed"] = data["partes_tfmed"].astype("category")
data["taxon_tfmed"] = data["taxon_tfmed"].astype("category")
data["genus"] = data["genus"].astype("category")

# Create new columns with numerical codes for these categories
data["classificação_cat"] = data["classificação"].cat.codes
data["pais_coleta_cat"] = data["pais_coleta"].cat.codes
data["partes_tfmed_cat"] = data["partes_tfmed"].cat.codes
data["taxon_tfmed_cat"] = data["taxon_tfmed"].cat.codes
data["genus_cat"] = data["genus"].cat.codes

Attempt to convert all columns to numeric types where appropriate. Columns that cannot be converted (e.g., already category type or string type that's not purely numeric) will be ignored by `errors='ignore'`.

In [19]:
# Apply pd.to_numeric to the DataFrame, ignoring errors for non-convertible columns
data = data.apply(pd.to_numeric, errors="ignore")

  data = data.apply(pd.to_numeric, errors="ignore")


## 7. Missing Value Checks and Further Cleaning

Inspect for missing values and perform further data cleaning, such as standardizing column names and correcting data entry issues (e.g., commas as decimal separators).

In [20]:
# Check if any column consists entirely of null values
data.isnull().all()

taxon_tfmed          False
genus                False
classificação        False
pais_coleta          False
fe                   False
mn                   False
se                   False
zn                   False
k                    False
mg                   False
na                   False
ptn                  False
lip                  False
w3                   False
w6                   False
partes_tfmed         False
classificação_cat    False
pais_coleta_cat      False
partes_tfmed_cat     False
taxon_tfmed_cat      False
genus_cat            False
dtype: bool

In [21]:
# Display DataFrame info to check data types and non-null counts per column
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   taxon_tfmed        78 non-null     category
 1   genus              78 non-null     category
 2   classificação      78 non-null     category
 3   pais_coleta        78 non-null     category
 4   fe                 25 non-null     float64 
 5   mn                 22 non-null     float64 
 6   se                 13 non-null     float64 
 7   zn                 30 non-null     float64 
 8   k                  15 non-null     float64 
 9   mg                 15 non-null     float64 
 10  na                 13 non-null     float64 
 11  ptn                54 non-null     float64 
 12  lip                50 non-null     float64 
 13  w3                 19 non-null     float64 
 14  w6                 19 non-null     float64 
 15  partes_tfmed       78 non-null     category
 16  classifica

### 7.1. Standardize Column Names
Remove leading/trailing whitespace from column names. Later, they will also be lowercased and unidecoded.

In [22]:
# Strip whitespace from all column names (e.g., 'k ' becomes 'k')
data.columns = [i.strip() for i in data.columns]

### 7.2. Handle Rows with All Null Nutrient Values
Identify rows where all key nutrient values are missing. These rows provide no nutritional information and can be removed before imputation.

In [23]:
# Define the list of nutrient columns that will be subject to imputation later
cols_impute = ["fe", "mn", "se", "zn", "k", "mg", "na", "ptn", "lip", "w3", "w6"]

In [24]:
# Count how many rows have all null values across the 'cols_impute' set
data[cols_impute].isnull().all(axis=1).sum()

np.int64(2)

In [25]:
# Display boolean Series indicating rows where all 'cols_impute' are null
data[cols_impute].isnull().all(axis=1)

0      True
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64    False
65    False
66    False
67    False
68    False
69    False
70    False
71    False
72    False
73    False
74    False
75    False
76    False
77    False
dtype: bool

In [26]:
# Display the first few rows to see the current state of the data
data.head(2)

Unnamed: 0,taxon_tfmed,genus,classificação,pais_coleta,fe,mn,se,zn,k,mg,na,ptn,lip,w3,w6,partes_tfmed,classificação_cat,pais_coleta_cat,partes_tfmed_cat,taxon_tfmed_cat,genus_cat
0,SS,sus,mamifero,Itália,,,,,,,,,,,,Musculo,1,9,0,24,21
1,SS,sus,mamifero,Itália,,,,,,,,,,,,Visceras,1,9,1,24,21


In [27]:
# Remove rows where all specified nutrient columns ('cols_impute') are null
print("Shape before removing all-null nutrient rows:", data.shape)
data = data[~data[cols_impute].isnull().all(axis=1)]
data.reset_index(drop=True, inplace=True)
print("Shape after removing all-null nutrient rows:", data.shape)

Shape before removing all-null nutrient rows: (78, 21)
Shape after removing all-null nutrient rows: (76, 21)


### 7.3. Further Column Name Standardization and Character Encoding
Convert column names to lowercase and remove accents/diacritics using `unidecode` for maximum compatibility and consistency.

In [28]:
import unidecode  # For removing accents from column names

# Convert column names to lowercase and strip whitespace again (redundant if done before, but safe)
data.columns = [i.lower().strip() for i in data.columns]
# Apply unidecode to remove accents/diacritics from column names
data.columns = [unidecode.unidecode(i) for i in data.columns]

# Display the DataFrame with fully standardized column names (optional)
data

Unnamed: 0,taxon_tfmed,genus,classificacao,pais_coleta,fe,mn,se,zn,k,mg,na,ptn,lip,w3,w6,partes_tfmed,classificacao_cat,pais_coleta_cat,partes_tfmed_cat,taxon_tfmed_cat,genus_cat
0,SS,sus,mamifero,Alemanha,0.0019,,1.3e-05,0.0024,,,,22.5,2.1,0.0777,0.5712,Musculo,1,1,0,24,21
1,CC1,capreolus,mamifero,Alemanha,0.00321,,4e-06,0.00235,,,,23.5,1.0,0.11,0.266,Musculo,1,1,0,4,5
2,EQ,equus,mamifero,África do Sul,,,,,,,,22.29,1.47,3.2e-05,5e-06,Musculo,1,14,0,12,11
3,CE,cervus,mamifero,Polônia,,7.6e-05,5e-06,0.00495,,,,,,,,Musculo,1,13,0,7,6
4,CE,cervus,mamifero,Polônia,,0.000396,7e-06,0.0033,,,,,,,,Visceras,1,13,1,7,6
5,CE,cervus,mamifero,Polônia,,0.000218,0.000132,0.00429,,,,,,,,Visceras,1,13,1,7,6
6,CE,cervus,mamifero,EUA,0.015131,0.000214,,0.00204,8.2e-08,0.015988,,,,,,Visceras,1,4,1,7,6
7,CE,cervus,mamifero,EUA,0.013225,0.000211,,0.001924,1.4e-08,0.016113,,,,,,Visceras,1,4,1,7,6
8,SR,scolopax,ave,Italia,,,,,,,,24.0,2.75,0.000197,1.57,Musculo,0,7,0,23,19
9,SR,scolopax,ave,Italia,,,,,,,,21.05,2.6,0.000175,1.04,Musculo,0,7,0,23,19


### 7.4. Correct Decimal Separators in Numeric Columns
Some numeric columns might use commas as decimal separators instead of periods. This section identifies such cases in object-type columns and corrects them, then converts these columns to float.

In [29]:
# Iterate through columns to find any 'object' type columns containing commas, which might indicate incorrect decimal format
for col in data.columns:
    if data[col].dtype == "object":
        for row in data.index:  # Iterate through rows for more robust checking if needed
            if "," in str(data.loc[row, col]):
                print(f"Comma found in row {row}, column '{col}': {data.loc[row, col]}")

Based on the output of the previous cell (or prior knowledge), specific columns (`se`, `zn`) are identified as needing comma-to-period replacement.

In [30]:
# Correct comma to period for decimal separation in 'se' and 'zn' columns
# This is done row by row, which can be slow for large datasets.
# A vectorized approach (e.str.replace) would be faster if applicable to the whole column.
for col in ["se", "zn"]:
    # Check if column exists and is of object type before attempting replacement
    if col in data.columns and data[col].dtype == "object":
        # Using .loc for setting values to avoid SettingWithCopyWarning
        for row in data.index:
            if isinstance(data.loc[row, col], str) and "," in data.loc[row, col]:
                data.loc[row, col] = data.loc[row, col].replace(",", ".")

# Convert these columns to float type after correction
if "se" in data.columns:
    data["se"] = data["se"].astype(float)
if "zn" in data.columns:
    data["zn"] = data["zn"].astype(float)

## 8. Save Processed Data

Save the cleaned and processed DataFrame to a new CSV file. This file will serve as the input for subsequent analysis, including data imputation as mentioned in the research paper.

In [31]:
data.to_csv("data/input_data_processed.csv", index=False)

The data preparation is now complete. The `input_data_processed.csv` file contains the cleaned dataset ready for the next stages of the research, such as exploratory data analysis, statistical testing, and data imputation.