## 2. Data loading

### Introduction to the Dat Sets

- From UCI Machine Learning Repository
  - Maternal Health Risk -> https://archive.ics.uci.edu/dataset/863/maternal+health+risk
  - Autistic Spectrum Disorder Screening Data for Children -> https://archive.ics.uci.edu/dataset/419/autistic+spectrum+disorder+screening+data+for+children
- From Kaggle
  - 184.702 TU ML 2025S - Reviews -> https://www.kaggle.com/competitions/184-702-tu-ml-2025-s-reviews/data
  - 184.702 TU ML 2025S - Congressional Voting -> https://www.kaggle.com/competitions/184-702-tu-ml-2025-s-congressional-voting/data

### Download and save Kaggle Datasets

In [34]:
import pandas as pd
from logger import Logger
from ucimlrepo import fetch_ucirepo

logger = Logger(__name__)

In [35]:
# Clean the data directory
!rm -rf ../data/

In [36]:
# Prepare Folders Structure
!mkdir -p ../data/archiv
!mkdir -p ../data/raw/kaggle/reviews
!mkdir -p ../data/raw/kaggle/congress
!mkdir -p ../data/raw/uci/mental_health_risk
!mkdir -p ../data/raw/uci/autistic_spectrum

In [38]:
# Reviews-Dataset in ../data/reviews speichern
!kaggle competitions download -c 184-702-tu-ml-2025-s-reviews -p ../data/archiv/kaggle/
!unzip -qo ../data/archiv/kaggle/184-702-tu-ml-2025-s-reviews.zip -d ../data/raw/kaggle/reviews/
logger.info("Reviews dataset downloaded and extracted.")

# Congressional Voting Records in ../data/congress speichern
!kaggle competitions download -c 184-702-tu-ml-2025-s-congressional-voting -p ../data/archiv/kaggle/
!unzip -qo ../data/archiv/kaggle/184-702-tu-ml-2025-s-congressional-voting.zip -d ../data/raw/kaggle/congress/
logger.info("Congressional Voting Records dataset downloaded and extracted.")

# Load the datasets into a pandas DataFrame
df_reviews = pd.read_csv("../data/raw/kaggle/reviews/amazon_review_ID.shuf.lrn.csv")
df_congress = pd.read_csv("../data/raw/kaggle/congress/CongressionalVotingID.shuf.lrn.csv")


184-702-tu-ml-2025-s-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


2025-04-26 16:23:04,310 - __main__ - INFO - Reviews dataset downloaded and extracted.


184-702-tu-ml-2025-s-congressional-voting.zip: Skipping, found more recently modified local copy (use --force to force download)


2025-04-26 16:23:05,606 - __main__ - INFO - Congressional Voting Records dataset downloaded and extracted.


### Download and save UCI Datasets

In [None]:
# ------- Mental Health Risk ------- #
# Download the mental health risk dataset
# fetch dataset 
maternal_health_risk = fetch_ucirepo(id=863) 
  
# data (as pandas dataframes) 
X_maternal = maternal_health_risk.data.features
y_maternal = maternal_health_risk.data.targets
df_maternal = pd.concat([X_maternal, y_maternal], axis=1)
  
# Save the dataset to CSV
maternal_path = '../data/raw/uci/mental_health_risk/maternal_health_risk.csv'
df_maternal.to_csv(maternal_path, index=False)
logger.info(f"Maternal Health Risk dataset successfully saved at {maternal_path}")




# ------- Autistic Spectrum ------- #
# Download the autistic spectrum dataset
# fetch dataset 
asd_data = fetch_ucirepo(id=419) 
  
# data (as pandas dataframes) 
X_asd = asd_data.data.features
y_asd = asd_data.data.targets
df_asd = pd.concat([X_asd, y_asd], axis=1)
  
# Save the dataset to CSV
asd_path = '../data/raw/uci/autistic_spectrum/asd_screening.csv'
df_asd.to_csv(asd_path, index=False)
logger.info(f"Autistic Spectrum dataset successfully saved at {asd_path}")

# 

2025-04-26 10:20:30,833 - __main__ - INFO - Maternal Health Risk dataset successfully saved at ../data/uci/mental_health_risk/maternal_health_risk.csv
2025-04-26 10:20:32,614 - __main__ - INFO - Autistic Spectrum dataset successfully saved at ../data/uci/autistic_spectrum/asd_screening.csv
