## Data Combination Process

### Step 1: Data Extraction
Before we can combine the data, we need to first extract and prepare our raw data files. This is explained in detail in another notebook called **Data Extraction**. In that notebook, we will:

1. Download and organize the raw CSV files into the `data/raw` directory.
2. Ensure that all the CSV files are correctly formatted and contain the required data.

Make sure to run the steps in the **Data Extraction** notebook before proceeding with the next step.

### Step 2: Running the Data Combination Script
Once the raw data files are available in the `data/raw` folder, you can use the following Python script to combine them into one single CSV file. This script:

1. Reads all CSV files in the `data/raw` folder.
2. Combines them into a single DataFrame.
3. Saves the combined DataFrame as `combined_ais_data.csv` in a new folder called `processed`.

If the `processed` folder does not exist, the script will create it automatically.

Run the script to generate the combined data that can be used for further analysis or processing.

### Example Workflow:
1. **Run the Data Extraction notebook** to prepare the `data/raw` folder.
2. **Run the Python script** to combine the data into `combined_ais_data.csv` in the `processed` folder.


## Python Script to Combine the Data

The Python script provided combines multiple CSV files from the `data/raw` folder into one consolidated CSV file:

In [5]:
import os
import pandas as pd
import glob

# Define the project root directory name
project_root = 'DSCI-532_2025_5_vessel-vision'


current_directory = os.getcwd()


while project_root not in os.path.basename(current_directory) and current_directory != os.path.dirname(current_directory):
    current_directory = os.path.dirname(current_directory)

if project_root not in os.path.basename(current_directory):
    raise ValueError(f"Project root '{project_root}' not found.")


os.chdir(current_directory)
print(f"Current working directory: {os.getcwd()}")  


raw_data_folder = os.path.join('data', 'raw')
processed_folder_path = os.path.join('data', 'processed') 
raw_data_path = os.path.join(raw_data_folder, '*.csv')
combined_csv_path = os.path.join(processed_folder_path, 'combined_ais_data.csv')

# Ensure the raw data folder exists
if not os.path.exists(raw_data_folder):
    raise FileNotFoundError(
        f"Data folder not found: {raw_data_folder}\n"
        "Please execute the data extraction file first to generate the required data."
    )

# Get all CSV files in the 'data/raw' folder
csv_files = glob.glob(raw_data_path)

# Check if any CSV files are found
if not csv_files:
    raise ValueError(
        "No CSV files found in 'data/raw/'.\n"
        "Please execute the data extraction file first to generate the required data."
    )

# Ensure the processed folder exists
os.makedirs(processed_folder_path, exist_ok=True)

# Read and combine all CSV files
df_list = [pd.read_csv(file) for file in csv_files]
combined_df = pd.concat(df_list, ignore_index=True)

# Save the combined DataFrame to 'data/processed'
combined_df.to_csv(combined_csv_path, index=False)

print(f'✅ Combined CSV file saved at: {combined_csv_path}')


Current working directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision
✅ Combined CSV file saved at: data\processed\combined_ais_data.csv


## Filtering AIS Data for the West Coast of North America

To only include the **West Coast of North America**, you need to filter based on **longitude and latitude**.

### Steps:
1. **Identify geographic boundaries** of the West Coast of North America.
   - Covers **Canada, USA, and Mexico’s Pacific coasts**.
   
2. **Approximate latitude/longitude range**:
   - **Latitude**: 20°N to 60°N  
   - **Longitude**: -140°W to -110°W  


In [6]:
import os
import pandas as pd

# Define the project root directory name
project_root = 'DSCI-532_2025_5_vessel-vision'

# Get the absolute path of the current working directory
current_directory = os.getcwd()

# Find the root directory path
while project_root not in os.path.basename(current_directory) and current_directory != os.path.dirname(current_directory):
    current_directory = os.path.dirname(current_directory)

if project_root not in os.path.basename(current_directory):
    raise ValueError(f"Project root '{project_root}' not found.")

# Change to the root directory
os.chdir(current_directory)
print(f"Current working directory: {os.getcwd()}")  # Debugging statement

# Define paths
preprocessed_folder = os.path.join('data', 'processed')
combined_csv_path = os.path.join(preprocessed_folder, 'combined_ais_data.csv')
filtered_csv_path = os.path.join(preprocessed_folder, 'ais_west_coast.csv')

# Ensure the preprocessed data folder exists
if not os.path.exists(preprocessed_folder):
    raise FileNotFoundError(
        f"Preprocessed data folder not found: {preprocessed_folder}\n"
        "Please execute the data extraction and combination steps first."
    )

# Load the combined AIS dataset
df = pd.read_csv(combined_csv_path)

# Define latitude and longitude boundaries for the West Coast
lat_min, lat_max = 20, 60   # Covers from Mexico to Alaska
lon_min, lon_max = -140, -110  # Covers the Pacific coast range

# Filter data for West Coast
west_coast_df = df[
    (df['LAT'].between(lat_min, lat_max)) &
    (df['LON'].between(lon_min, lon_max))
]

# Add vessel type name column
west_coast_df['Vessel Type Name'] = west_coast_df['VesselType'].apply(
    lambda x: 'Passenger' if 60 <= x <= 69 else ('Cargo' if 70 <= x <= 79 else None)
)

# Filter only Passenger and Cargo vessels
filtered_west_coast_df = west_coast_df[west_coast_df['Vessel Type Name'].notna()]

# Save the filtered dataset
filtered_west_coast_df.to_csv(filtered_csv_path, index=False)

print(f'✅ Filtered dataset saved at: {filtered_csv_path} with {len(filtered_west_coast_df)} records.')


Current working directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  west_coast_df['Vessel Type Name'] = west_coast_df['VesselType'].apply(


✅ Filtered dataset saved at: data\processed\ais_west_coast.csv with 579340 records.


## Split filtered data into multiple CSV files
This script splits a large dataset (ais_west_coast.csv) into smaller, approximately 50MB-sized CSV files and stores them in a split-data folder. Before processing the data, the script ensures that it is running from the root directory of the project (DSCI-532_2025_5_vessel-vision). The dataset is read in chunks of 100,000 rows at a time, and the size of each chunk is calculated to ensure it doesn’t exceed 50MB. Once a chunk reaches the target size, it is written to a CSV file with sequential names like ais_chunk_0.csv, ais_chunk_1.csv, etc. If there is leftover data after reading the entire dataset, it is saved to a final chunk file.

In [16]:
import pandas as pd
import os
import shutil

# Define parameters
data_path = "data/processed/ais_west_coast.csv"
output_folder = "data/split-data/"
project_root = "DSCI-532_2025_5_vessel-vision"
chunk_size_in_mb = 100  # Target chunk size in MB (maximum size allowed by GitHub)

# Get the current directory
current_directory = os.getcwd()

# Define the correct project root directory
root_directory = r"c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision"

# Check if we are in the root directory
if current_directory != root_directory:
    # If not in root, navigate to the root directory
    if os.path.exists(root_directory):
        os.chdir(root_directory)
        print(f"Changed directory to project root: {root_directory}")
    else:
        print(f"Error: The root directory '{root_directory}' does not exist.")
else:
    print(f"Already in the root directory: {current_directory}")

# Remove the split-data folder if it exists and recreate it
if os.path.exists(output_folder):
    shutil.rmtree(output_folder)  # Remove the existing folder
    print(f"Removed existing folder: {output_folder}")

# Create the split-data folder again
os.makedirs(output_folder, exist_ok=True)
print(f"Created new folder: {output_folder}")

# Get the size of the file in bytes
file_size = os.path.getsize(data_path)  # In bytes
target_chunk_size = chunk_size_in_mb * 1024 * 1024  # Convert to bytes

# Read the dataset in chunks
chunk_index = 0
for chunk in pd.read_csv(data_path, chunksize=1000000):  # Read in manageable chunks
    # Calculate approximate size of the chunk (in bytes)
    chunk_size = chunk.memory_usage(deep=True).sum()

    # If the chunk size exceeds the target size, split the chunk
    while chunk_size > target_chunk_size:
        # Split into smaller chunks (split by rows)
        half_chunk = chunk.iloc[:len(chunk) // 2]
        chunk = chunk.iloc[len(chunk) // 2:]

        # Save the first half
        half_chunk_filename = f"ais_chunk_{chunk_index}_part1.csv"
        half_chunk.to_csv(os.path.join(output_folder, half_chunk_filename), index=False)
        print(f"Written {half_chunk_filename}")
        chunk_index += 1
        
        # Recalculate chunk size for the remaining part
        chunk_size = chunk.memory_usage(deep=True).sum()

    # Save the remaining chunk
    chunk_filename = f"ais_chunk_{chunk_index}.csv"
    chunk.to_csv(os.path.join(output_folder, chunk_filename), index=False)
    print(f"Written {chunk_filename}")
    chunk_index += 1

print("Data splitting completed.")




Changed directory to project root: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision
Created new folder: data/split-data/
Written ais_chunk_0_part1.csv
Written ais_chunk_1_part1.csv
Written ais_chunk_2.csv
Data splitting completed.


In [4]:
import os

# Print the current working directory
current_directory = os.getcwd()
print(f"Current working directory: {current_directory}")

# Define the project root (modify this as needed for your project)
project_root = "DSCI-532_2025_5_vessel-vision"
root_directory = os.path.abspath(os.path.join(current_directory, '..', project_root))

# Print the root directory (absolute path)
print(f"Project root directory: {root_directory}")


Current working directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision\notebooks
Project root directory: c:\Users\Azin\Desktop\Azin files\Azin's Document\UBC\block 5\532_viz-2\DSCI-532_2025_5_vessel-vision\DSCI-532_2025_5_vessel-vision
