# Import libraries and scripts
Various functions are imported from the preprocess_data module, which are used to perform various stages of data preprocessing. This includes shuffling data, separating training, test and validation sets, as well as cleaning and handling nulls and outliers. This notebook is essential for preparing data before applying machine learning models.

In [1]:
import sys
sys.path.append(r'C:\Users\di_estebannn\Desktop\universidad\austria\applied_machine_and_deep_learning\project\src\scripts')
from preprocess_data import shuffle_data
from preprocess_data import separate_data_frame
from preprocess_data import separate_and_clean_data
from preprocess_data import fill_null_values
from preprocess_data import remove_outliers
from preprocess_data import save_data

import pandas as pd
import os

# Import data
Data is loaded from CSV files corresponding to normalized training and test sets. The data is then combined and separated in training, testing and validation sets, applying preprocessing functions such as shuffle_data() and separate_and_clean_data(), transforming all values into numeric (or null if it cannot be numeric) and dropping any row with null outputs. In addition, the resulting data is saved in a new CSV file named "normalized_total_data.csv".

In [2]:
general_path = 'C:/Users/di_estebannn/Desktop/universidad/austria/applied_machine_and_deep_learning/project'
processed_data_path = os.path.join(general_path, 'data', 'processed')
file_path_train = general_path + '/data/raw/normalized_train_data.csv'
file_path_test = general_path + '/data/raw/normalized_test_data.csv'

df_train_data = pd.read_csv(file_path_train)
df_test_data = pd.read_csv(file_path_test)

df_total_data = shuffle_data(df_train_data, df_test_data)
X, y = separate_and_clean_data('Total data set', df_total_data, already_cleaned = True)
save_data((general_path + '/data/raw/'), 'normalized_total_data.csv', X, y)

df_train, df_test, df_validation = separate_data_frame(df_total_data)

Sizes of the Total data set: X = (10979, 21), y = (10979,).

The normalized_total_data.csv has been saved in C:/Users/di_estebannn/Desktop/universidad/austria/applied_machine_and_deep_learning/project/data/raw/.



# Access and prepare the data for Machine Learning tasks
## Preprocess the data (data cleaning)
### Separate and clean data
The separate_and_clean_data function performs two main tasks. First, if the data is not pre-cleaned, it converts all values in the data set to the numeric type, assigning null values to those that cannot be converted. Then, it removes rows that have null values in the 'output' column and removes duplicates from the data set. Second, the function separates the data set into inputs (X) and outputs (y), and returns these two parts. Additionally, it prints the sizes of the resulting arrays, providing information about the structure of the clean, separated data. This process is essential to prepare data before using it in training machine and deep learning models.

In [3]:
X_train, y_train = separate_and_clean_data('Training set', df_train)
X_test, y_test = separate_and_clean_data('Testing set', df_test)
X_validation, y_validation = separate_and_clean_data('Validation set', df_validation)
X, y = separate_and_clean_data('Total data set', df_total_data)

Sizes of the Training set: X = (8234, 21), y = (8234,).

Sizes of the Testing set: X = (1646, 21), y = (1646,).

Sizes of the Validation set: X = (1099, 21), y = (1099,).

Sizes of the Total data set: X = (10979, 21), y = (10979,).



### Fill null values
The fill_null_values function is responsible for handling null values in the data set. First, it concatenates the inputs (X) and outputs (y) into a single DataFrame (df). Next, check if a set of means is provided. If not provided, calculates the means of the input columns (X). Next, replace the null values in X with the calculated means. Only with the testing set a set of means is provided, corresponding to the means of the training set so that the first one is effectively related and based on the second one, while this is not done with the validation set so as not to influence the next validation that will be done of the effectiveness of the trained model.
The function prints descriptive statistics of the data set, highlighting the total amount of data in all columns and whether null values are present. It also shows the means of the 21 inputs in tabular form.


In [4]:
X_train, means = fill_null_values(X_train, y_train, 'Training set')
X_test, _ = fill_null_values(X_test, y_test, 'Testing set', means)
X_validation, _ = fill_null_values(X_validation, y_validation, 'Validation set')
X, _ = fill_null_values(X, y, 'Total data set')


The total amount of data for the Training set is equal to 8234.0 in all columns.
It is False that there are null values, and the means of the inputs in the Training set is
[[0.59527569 0.80519772 0.73970471]
 [0.43339199 0.58410562 0.39352117]
 [0.41149113 0.67946258 0.51885498]
 [0.11483995 0.06679987 0.53295611]
 [0.2470088  0.26888976 0.19573641]
 [0.07356736 0.21582806 0.06572546]
 [0.0656609  0.4675519  0.46131896]]

The total amount of data for the Testing set is equal to 1646.0 in all columns.
It is False that there are null values, and the means of the inputs in the Testing set is
[[0.60069866 0.80587528 0.72327828]
 [0.42720193 0.58348522 0.39022489]
 [0.42019706 0.67909619 0.51219709]
 [0.12160283 0.06804536 0.53233452]
 [0.24670042 0.26710013 0.19554625]
 [0.07365155 0.21500092 0.0681207 ]
 [0.0641415  0.46674492 0.45929988]]

The total amount of data for the Validation set is equal to 1099.0 in all columns.
It is False that there are null values, and the means of the input

### Remove outliers
The remove_outliers function identifies and removes outliers in the data set. First, it concatenates the outputs (y) and inputs (X) into a single DataFrame (df). It then iterates over the input columns and uses the interquartile range (IQR) method to determine the upper and lower bounds, based on a default threshold = 3.75. This is done 5 times to ensure removal of outliers that are outside the set range each time it is updated. Values that fall outside these limits are considered outliers and are removed. The function prints the number of rows removed and the percentage of data removed from the original set. Importantly, the function does not remove outliers in the validation set (X_validation, y_validation) in order not to influence the next validation that will be done of the effectiveness of the trained model. This process helps improve the robustness of the model by removing data that could introduce noise or bias during training.

In [5]:
X_train, y_train = remove_outliers(X_train, y_train,'Training set')
X_test, y_test = remove_outliers(X_test, y_test, 'Testing set')
# X_validation, y_validation = remove_outliers(X_validation, y_validation, 'Validation set') DON'T REMOVE
X, y = remove_outliers(X, y, 'Total data set')

806 rows have been removed from Training set, which is the 9.788681078455186% percent of the original data.
116 rows have been removed from Testing set, which is the 7.047387606318348% percent of the original data.
1093 rows have been removed from Total data set, which is the 9.955369341470078% percent of the original data.


### Save processed data
The save_data function takes the input data set (X and y) and saves them to CSV files with the specified name. If the data is not instances of pd.Series or pd.DataFrame, it converts it to those formats. Then, concatenate (y) and (X) into a single DataFrame (combined_df), sort the columns and rename the input columns as "input1", "input2", ..., "input21", and the output column as "output". Finally, save the DataFrame to a CSV file at the specified path (data_path) with the given file name (name_file). The function prints a message indicating that the file has been saved successfully, or displays an error message if a problem occurs during the process. In this case, the functions are used to save different processed data sets (training, testing, validation, and total) to specific CSV files.

In [6]:
save_data(processed_data_path, 'processed_training_data.csv', X_train, y_train)
save_data(processed_data_path, 'processed_testing_data.csv', X_test, y_test)
save_data(processed_data_path, 'processed_validation_data.csv', X_validation, y_validation)
save_data(processed_data_path, 'processed_total_data.csv', X, y)

The processed_training_data.csv has been saved in C:/Users/di_estebannn/Desktop/universidad/austria/applied_machine_and_deep_learning/project\data\processed.

The processed_testing_data.csv has been saved in C:/Users/di_estebannn/Desktop/universidad/austria/applied_machine_and_deep_learning/project\data\processed.

The processed_validation_data.csv has been saved in C:/Users/di_estebannn/Desktop/universidad/austria/applied_machine_and_deep_learning/project\data\processed.

The processed_total_data.csv has been saved in C:/Users/di_estebannn/Desktop/universidad/austria/applied_machine_and_deep_learning/project\data\processed.

