# Data Processing for medical dataset

The medical appointment dataset contains over 100,000 data entries.

To simplify the project, five days worth of data were pulled from the dataset, with each day being saved as its own csv file.

Each csv contains the following columns to be used for the Weighted Interval Scheduling (WIS) algorithm:

- ***appointment_id***
  - Unique identifier for each appointment (such as *0000138*)
- ***start_time*** and ***end_time***
  - Start and end times for each appointment in *HH:MM:SS* format
- ***start_minutes*** and ***end_minutes***
  - Start and end times converted to number of minutes since midnight, which simplifies comparing times as numbers within the WIS algorithm
- ***priority***
  - Prioritizes appointments by patient age, as older patients experience increased health risks
  

In [1]:
# Import files
import sys
from pathlib import Path
import os

sys.path.append(str(Path.cwd().parent))

import python.data_processing as process

In [2]:
# Import data
csv_file = "../data/original-datasets/medical-appointment-scheduling-system/appointments.csv"
medical_data = process.load_medical_data(csv_file)

print(medical_data)

        appointment_id appointment_date start_time  end_time  start_minutes  \
2                   21       2015-01-01   13:37:57  13:43:09            817   
3                  233       2015-01-01   14:00:40  14:29:34            840   
5                  180       2015-01-01   14:30:38  14:38:20            870   
6                  197       2015-01-01   14:39:14  14:43:26            879   
7                  191       2015-01-01   15:00:08  15:27:14            900   
...                ...              ...        ...       ...            ...   
111312          111312       2024-11-29   08:28:54  08:44:48            508   
111315          111245       2024-11-29   11:48:01  11:56:43            708   
111317          111311       2024-11-29   11:57:44  12:02:32            717   
111319          111318       2024-11-29   12:03:38  12:21:20            723   
111321          110579       2024-11-29   11:30:58  11:46:52            690   

        end_minutes  priority  
2               823

In [3]:
# Split appointments by day
week = process.get_week_by_start_date(medical_data, "2015-01-01")

for day, df_day in week.items():
    print(f"{day}: {len(df_day)} appointments")

Friday: 101 appointments
Monday: 103 appointments
Thursday: 101 appointments
Tuesday: 98 appointments
Wednesday: 105 appointments


In [4]:
# Save to csv files
path = "../data/processed-datasets/medical-appointment-scheduling-system"
os.makedirs(path, exist_ok=True)

for day, df_day in week.items():
    filename = f"{path}/medical-appointments-{day}.csv"
    df_day.to_csv(filename, index=False)
    print(f"Saved {filename}")

Saved ../data/processed-datasets/medical-appointment-scheduling-system/medical-appointments-Friday.csv
Saved ../data/processed-datasets/medical-appointment-scheduling-system/medical-appointments-Monday.csv
Saved ../data/processed-datasets/medical-appointment-scheduling-system/medical-appointments-Thursday.csv
Saved ../data/processed-datasets/medical-appointment-scheduling-system/medical-appointments-Tuesday.csv
Saved ../data/processed-datasets/medical-appointment-scheduling-system/medical-appointments-Wednesday.csv


# Data Processing for Cloud Workload Dataset

The cloud workload dataset contains over 3,500 data entries.

To simplify the project, seven days worth of data were pulled from the dataset, with each day being saved as its own csv file.

Each csv contains the following columns to be used for the Weighted Interval Scheduling (WIS) algorithm:

- ***job_interval***
  - Unique identifier for each job (such as *JOB_00001*)
- ***start_time*** and ***end_time***
  - Start and end times for each appointment in *HH:MM:SS* format
- ***start_minutes*** and ***end_minutes***
  - Start and end times converted to number of minutes since midnight, which simplifies comparing times as numbers within the WIS algorithm
- ***priority***
  - Original dataset ranks priority using 'low', 'medium' and 'high'
  - These values were converted to 1, 2, and 3, which simplifies comparing priority levels within the WIS algorithm

In [5]:
csv_file = "../data/original-datasets/cloud-workload-job-traces/cloud_workload_dataset.csv"
cloud_data = process.load_cloud_data(csv_file)

print(cloud_data)

     job_interval    job_date          start_time            end_time  \
0       JOB_00001  2024-02-28 2024-02-28 04:57:34 2024-02-28 05:24:07   
1       JOB_00002  2024-01-11 2024-01-11 03:21:15 2024-01-11 03:28:29   
2       JOB_00003  2024-01-03 2024-01-03 06:46:10 2024-01-03 06:50:28   
3       JOB_00004  2024-03-08 2024-03-08 12:00:26 2024-03-08 12:14:15   
4       JOB_00005  2024-01-26 2024-01-26 00:49:34 2024-01-26 01:08:23   
...           ...         ...                 ...                 ...   
3557    JOB_03558  2024-02-29 2024-02-29 20:38:10 2024-02-29 20:50:18   
3558    JOB_03559  2024-03-01 2024-03-01 12:04:24 2024-03-01 12:40:07   
3559    JOB_03560  2024-02-02 2024-02-02 07:37:18 2024-02-02 08:03:21   
3560    JOB_03561  2024-01-16 2024-01-16 03:26:12 2024-01-16 03:41:50   
3561    JOB_03562  2024-02-28 2024-02-28 21:28:59 2024-02-28 21:55:53   

      start_minutes  end_minutes  priority  
0               297          324         3  
1               201          208 

In [6]:
# Split appointments by day
week = process.get_week_by_start_date(cloud_data, "2024-01-01")

for day, df_day in week.items():
    print(f"{day}: {len(df_day)} jobs scheduled")

Monday: 37 jobs scheduled
Tuesday: 40 jobs scheduled
Wednesday: 46 jobs scheduled
Thursday: 45 jobs scheduled
Friday: 53 jobs scheduled
Saturday: 48 jobs scheduled
Sunday: 60 jobs scheduled


In [7]:
# Save to csv files
path = "../data/processed-datasets/cloud-workload-job-traces"
os.makedirs(path, exist_ok=True)

for day, df_day in week.items():
    filename = f"{path}/cloud-workload-{day}.csv"
    df_day.to_csv(filename, index=False)
    print(f"Saved {filename}")

Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Monday.csv
Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Tuesday.csv
Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Wednesday.csv
Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Thursday.csv
Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Friday.csv
Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Saturday.csv
Saved ../data/processed-datasets/cloud-workload-job-traces/cloud-workload-Sunday.csv
