## Data Combination Process

### Step 1: Data Extraction
Before we can combine the data, we need to first extract and prepare our raw data files. This is explained in detail in another notebook called **Data Extraction**. In that notebook, we will:

1. Download and organize the raw CSV files into the `data/raw` directory.
2. Ensure that all the CSV files are correctly formatted and contain the required data.

Make sure to run the steps in the **Data Extraction** notebook before proceeding with the next step.

### Step 2: Running the Data Combination Script
Once the raw data files are available in the `data/raw` folder, you can use the following Python script to combine them into one single CSV file. This script:

1. Reads all CSV files in the `data/raw` folder.
2. Combines them into a single DataFrame.
3. Saves the combined DataFrame as `combined_ais_data.csv` in a new folder called `processed`.

If the `processed` folder does not exist, the script will create it automatically.

Run the script to generate the combined data that can be used for further analysis or processing.

### Example Workflow:
1. **Run the Data Extraction notebook** to prepare the `data/raw` folder.
2. **Run the Python script** to combine the data into `combined_ais_data.csv` in the `processed` folder.


## Python Script to Combine the Data

The Python script provided combines multiple CSV files from the `data/raw` folder into one consolidated CSV file. Here's a breakdown of how the script works:

In [1]:
import os
import pandas as pd
import glob

# Define the project root directory name
project_root = 'DSCI-532_2025_5_vessel-vision'


current_directory = os.getcwd()


while project_root not in os.path.basename(current_directory) and current_directory != os.path.dirname(current_directory):
    current_directory = os.path.dirname(current_directory)

if project_root not in os.path.basename(current_directory):
    raise ValueError(f"Project root '{project_root}' not found.")


os.chdir(current_directory)
print(f"Current working directory: {os.getcwd()}")  


raw_data_folder = os.path.join('data', 'raw')
processed_folder_path = os.path.join('data', 'processed') 
raw_data_path = os.path.join(raw_data_folder, '*.csv')
combined_csv_path = os.path.join(processed_folder_path, 'combined_ais_data.csv')

# Ensure the raw data folder exists
if not os.path.exists(raw_data_folder):
    raise FileNotFoundError(
        f"Data folder not found: {raw_data_folder}\n"
        "Please execute the data extraction file first to generate the required data."
    )

# Get all CSV files in the 'data/raw' folder
csv_files = glob.glob(raw_data_path)

# Check if any CSV files are found
if not csv_files:
    raise ValueError(
        "No CSV files found in 'data/raw/'.\n"
        "Please execute the data extraction file first to generate the required data."
    )

# Ensure the processed folder exists
os.makedirs(processed_folder_path, exist_ok=True)

# Read and combine all CSV files
df_list = [pd.read_csv(file) for file in csv_files]
combined_df = pd.concat(df_list, ignore_index=True)

# Save the combined DataFrame to 'data/processed'
combined_df.to_csv(combined_csv_path, index=False)

print(f'✅ Combined CSV file saved at: {combined_csv_path}')


Current working directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision
✅ Combined CSV file saved at: data\processed\combined_ais_data.csv


## Filtering AIS Data for the West Coast of North America

To only include the **West Coast of North America**, you need to filter based on **longitude and latitude**.

### Steps:
1. **Identify geographic boundaries** of the West Coast of North America.
   - Covers **Canada, USA, and Mexico’s Pacific coasts**.
   
2. **Approximate latitude/longitude range**:
   - **Latitude**: 20°N to 60°N  
   - **Longitude**: -140°W to -110°W  


In [2]:
import os
import pandas as pd

# Define the project root directory name
project_root = 'DSCI-532_2025_5_vessel-vision'

# Get the absolute path of the current working directory
current_directory = os.getcwd()

# Find the root directory path
while project_root not in os.path.basename(current_directory) and current_directory != os.path.dirname(current_directory):
    current_directory = os.path.dirname(current_directory)

if project_root not in os.path.basename(current_directory):
    raise ValueError(f"Project root '{project_root}' not found.")

# Change to the root directory
os.chdir(current_directory)
print(f"Current working directory: {os.getcwd()}")  # Debugging statement

# Define paths
preprocessed_folder = os.path.join('data', 'processed')
combined_csv_path = os.path.join(preprocessed_folder, 'combined_ais_data.csv')
filtered_csv_path = os.path.join(preprocessed_folder, 'ais_west_coast.csv')

# Ensure the preprocessed data folder exists
if not os.path.exists(preprocessed_folder):
    raise FileNotFoundError(
        f"Preprocessed data folder not found: {preprocessed_folder}\n"
        "Please execute the data extraction and combination steps first."
    )

# Load the combined AIS dataset
df = pd.read_csv(combined_csv_path)

# Define latitude and longitude boundaries for the West Coast
lat_min, lat_max = 20, 60   # Covers from Mexico to Alaska
lon_min, lon_max = -140, -110  # Covers the Pacific coast range

# Filter data for West Coast
west_coast_df = df[
    (df['LAT'].between(lat_min, lat_max)) &
    (df['LON'].between(lon_min, lon_max))
]

# Save the filtered dataset
west_coast_df.to_csv(filtered_csv_path, index=False)

print(f'✅ Filtered dataset saved at: {filtered_csv_path} with {len(west_coast_df)} records.')

Current working directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision
✅ Filtered dataset saved at: data\processed\ais_west_coast.csv with 3716130 records.


## Split filtered data into multiple CSV files
This script splits a large dataset (ais_west_coast.csv) into smaller, approximately 50MB-sized CSV files and stores them in a split-data folder. Before processing the data, the script ensures that it is running from the root directory of the project (DSCI-532_2025_5_vessel-vision). The dataset is read in chunks of 100,000 rows at a time, and the size of each chunk is calculated to ensure it doesn’t exceed 50MB. Once a chunk reaches the target size, it is written to a CSV file with sequential names like ais_chunk_0.csv, ais_chunk_1.csv, etc. If there is leftover data after reading the entire dataset, it is saved to a final chunk file.

In [5]:

import pandas as pd
import os

# Define parameters
data_path = "data/processed/ais_west_coast.csv"
output_folder = "data/split-data/"  # Specify that 'split-data' should be inside the 'data' folder
project_root = "DSCI-532_2025_5_vessel-vision"
chunk_size_in_mb = 50  # Target chunk size in MB

# Ensure the script is running from the root directory
current_directory = os.getcwd()
if os.path.basename(current_directory) != project_root:
    root_directory = os.path.abspath(os.path.join(current_directory, '..', project_root))
    os.chdir(root_directory)
    print(f"Changed directory to project root: {root_directory}")
else:
    print(f"Already in the root directory: {current_directory}")

# Create the split-data folder inside the 'data' folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Get the size of the file in bytes
file_size = os.path.getsize(data_path)  # In bytes
target_chunk_size = chunk_size_in_mb * 1024 * 1024  # Convert to bytes

# Initialize variables
chunk_index = 0
current_chunk = []

# Read the file in chunks based on the file size (not rows)
for chunk in pd.read_csv(data_path, chunksize=100000):  # Read in manageable chunks
    current_chunk.append(chunk)
    current_chunk_size = sum([len(str(row)) for row in chunk.values.flatten()])  # Calculate approximate size of chunk
    
    # If the current chunk is larger than the target size, write it to a CSV file
    if current_chunk_size >= target_chunk_size:
        chunk_filename = f"ais_chunk_{chunk_index}.csv"
        chunk_path = os.path.join(output_folder, chunk_filename)
        pd.concat(current_chunk).to_csv(chunk_path, index=False)  # Combine chunk and write
        print(f"Written chunk {chunk_index} to {chunk_path}")
        
        # Reset for next chunk
        current_chunk = []
        chunk_index += 1

# If there is any leftover data in the current_chunk list, write it out
if current_chunk:
    chunk_filename = f"ais_chunk_{chunk_index}.csv"
    chunk_path = os.path.join(output_folder, chunk_filename)
    pd.concat(current_chunk).to_csv(chunk_path, index=False)
    print(f"Written final chunk {chunk_index} to {chunk_path}")

print("Data splitting completed.")


Already in the root directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision
Written final chunk 0 to data/split-data/ais_chunk_0.csv
Data splitting completed.
