# **Weekday vs Weekend Walking Distance Analysis**
## **Data Preparation Notebook**

This is the **first notebook** in the project.  
Here, we will:
1. Parse the XML file to extract walking distance data.
2. Process the data for further analysis.
3. Save the prepared data in a structured format (CSV).

---

### **Outputs**
- A cleaned and processed dataset saved as `processed_walking_data.csv`.

### Importing Libraries
We need the following libraries for parsing and processing:
1. `xml.etree.ElementTree`: To parse the XML file.
2. `pandas`: To process and structure the extracted data.
3. `os`: To check file existence and handle paths.

In [None]:
# Import necessary libraries
import os
import pandas as pd
import xml.etree.ElementTree as ET

# Check current working directory
print("Current Working Directory:", os.getcwd())

!pip show pandas
!pip show numpy

Current Working Directory: /Users/egetas/Desktop/DSA210PROJECT
Name: pandas
Version: 2.0.3
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: 
Author: 
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License

Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.

Copyright (c) 2011-2023, Open source contributors.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor th

### Verifying File Existence
We will verify whether the `export.xml` file is in the expected location.  
If the file is not found, an appropriate message will be displayed.

In [None]:
# Specify the file path for the XML file
file_path = 'export.xml' 

# Check if the file exists
if not os.path.exists(file_path):
    print(f"File not found at: {file_path}. Please check the path.")
else:
    print(f"File found at: {file_path}")

File found at: export.xml


### Parsing the XML File
Using `ElementTree`, we will parse the XML file to explore its structure and prepare for data extraction.

In [None]:
# Parse the XML file and extract data
tree = ET.parse(file_path)
root = tree.getroot()


print("Root tag:", root.tag)
print("Attributes:", root.attrib)

Root tag: HealthData
Attributes: {'locale': 'en_AU@rg=trzzzz'}


### Extracting Walking Distance Data
From the XML file, we will extract:
- `Date`: The start date of the activity.
- `WalkingDistance_km`: The walking distance recorded for that date.

In [5]:
# Initialize an empty list to store records
data = []

# Loop through the XML to extract walking distance data
for record in root.findall('.//Record'):
    if record.get('type') == 'HKQuantityTypeIdentifierDistanceWalkingRunning':
        # Extract relevant information
        date = record.get('startDate').split(' ')[0]  # Extract the date
        distance = float(record.get('value'))  # Extract the distance
        
        # Append to the data list
        data.append({'Date': date, 'WalkingDistance_km': distance})

# Convert the data into a DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
print("Sample Data:")
print(df.head())

Sample Data:
         Date  WalkingDistance_km
0  2019-10-28             0.02648
1  2019-10-28             0.00547
2  2019-11-29             0.09277
3  2019-11-29             0.05989
4  2019-11-29             0.07366


### Processing the Data
We will enhance the dataset by:
1. Converting `Date` to a proper datetime format.
2. Adding a `Weekday` column to represent the day of the week.
3. Adding an `IsWeekend` column to indicate weekends.

In [6]:
# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Add a column for the day of the week (0 = Monday, 6 = Sunday)
df['Weekday'] = df['Date'].dt.dayofweek

# Add a column to indicate weekends (True for Saturday and Sunday)
df['IsWeekend'] = df['Weekday'].apply(lambda x: x >= 5)

# Display the updated DataFrame
print("Processed Data:")
print(df.head())

Processed Data:
        Date  WalkingDistance_km  Weekday  IsWeekend
0 2019-10-28             0.02648        0      False
1 2019-10-28             0.00547        0      False
2 2019-11-29             0.09277        4      False
3 2019-11-29             0.05989        4      False
4 2019-11-29             0.07366        4      False


### Exploring the Data
Here, we perform a quick exploration of the dataset:
1. Basic summary statistics for walking distances.
2. The range of dates covered in the dataset.

In [7]:
# Summary statistics of the walking distance
print("Summary Statistics:")
print(df['WalkingDistance_km'].describe())

# Check the range of dates in the dataset
print("\nDate Range:")
print(f"Start Date: {df['Date'].min()}, End Date: {df['Date'].max()}")

Summary Statistics:
count    68691.000000
mean         0.116509
std          0.147725
min          0.000410
25%          0.018320
50%          0.055670
75%          0.155215
max          1.247270
Name: WalkingDistance_km, dtype: float64

Date Range:
Start Date: 2019-10-28 00:00:00, End Date: 2025-01-04 00:00:00


### Saving the Processed Data
We will save the cleaned and processed data as a CSV file for use in subsequent notebooks.

In [8]:
# Save the processed data to a CSV file for further analysis
output_file = 'processed_walking_data.csv'
df.to_csv(output_file, index=False)

print(f"Processed data saved to: {output_file}")

Processed data saved to: processed_walking_data.csv
