### Data Processing 
- This notebook will pull the Kaggle dataset of saints
- Scrape wiki descriptions
- Apply some basic preprocessing

In [47]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
import sweetviz as sv

**Lets subset our data to canonized saints only**

In [48]:
data = pd.read_csv('saints_data.csv')

In [49]:
sum(data['canonized_year'].isna())

7394

In [50]:
data = data[~data['canonized_year'].isna()]

In [51]:
print(f'Number of saints: {data.shape[0]}')

Number of saints: 1618


**Dropping duplicates**

In [52]:
data = data.drop_duplicates()

In [53]:
print(f'Number of unique saints: {data.shape[0]}')

Number of unique saints: 759


**Trimming features**

In [54]:
# Dropping this feature, its inconsistent and messy
data.drop(columns=['Memorial'], inplace=True)

In [55]:
# Some plots of the data
report = sv.analyze([data, 'SaintData'], pairwise_analysis='off')

Feature: canonized_by                        |██████████| [100%]   00:04 -> (00:00 left)


In [56]:
report.show_html('SaintData.html')

Report SaintData.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Prepping data for distance calculations
- One hot encode categorical
- Impute NA values
- Standardize numerical so that the features do not dominate distance

In [36]:
# Separate categorical and numerical features
categorical_columns = ['gender','place_of_birth','country',	'death_place','death_country','vocation','Venerated_by','beatified_by',	'canonized_by']
numerical_columns = ['birth_year', 'death_year', 'veneration_year', 'beatified_year', 'canonized_year']  

# One-hot encode categorical features
encoder = OneHotEncoder(sparse_output=False)
encoded_cats = encoder.fit_transform(data[categorical_columns])

# Convert to DataFrame
encoded_df = pd.DataFrame(encoded_cats, columns=encoder.get_feature_names_out(categorical_columns))

# Convert numerical columns to integers
data[numerical_columns] = data[numerical_columns].fillna(0).astype(int)
data[numerical_columns] = data[numerical_columns].astype(int)

# Drop original categorical columns and concatenate
processed_data = data.drop(columns=categorical_columns).reset_index(drop=True)
processed_data = pd.concat([processed_data, encoded_df], axis=1)

scaler = MinMaxScaler()
processed_data[numerical_columns] = scaler.fit_transform(processed_data[numerical_columns])

**Save data**

In [46]:
processed_data.to_csv('CleanSaintData.csv', index=False)