## Data Combination Process

### Step 1: Data Extraction
Before we can combine the data, we need to first extract and prepare our raw data files. This is explained in detail in another notebook called **Data Extraction**. In that notebook, we will:

1. Download and organize the raw CSV files into the `data/raw` directory.
2. Ensure that all the CSV files are correctly formatted and contain the required data.

Make sure to run the steps in the **Data Extraction** notebook before proceeding with the next step.

### Step 2: Running the Data Combination Script
Once the raw data files are available in the `data/raw` folder, you can use the following Python script to combine them into one single CSV file. This script:

1. Reads all CSV files in the `data/raw` folder.
2. Combines them into a single DataFrame.
3. Saves the combined DataFrame as `combined_ais_data.csv` in a new folder called `processed`.

If the `processed` folder does not exist, the script will create it automatically.

Run the script to generate the combined data that can be used for further analysis or processing.

### Example Workflow:
1. **Run the Data Extraction notebook** to prepare the `data/raw` folder.
2. **Run the Python script** to combine the data into `combined_ais_data.csv` in the `processed` folder.


## Explanation of the Python Script to Combine the Data

The Python script provided combines multiple CSV files from the `data/raw` folder into one consolidated CSV file. Here's a breakdown of how the script works:

In [4]:
import os
import pandas as pd
import glob

# Check if the user is in the root directory of the project
project_root = 'DSCI-532_2025_5_vessel-vision'

# Get the current working directory
current_directory = os.path.basename(os.getcwd())

if current_directory != project_root:
    raise ValueError(f"You are not in the root directory. Please navigate to the root directory: '{project_root}'.")

# Define the paths
raw_data_path = 'data/raw/*.csv'
processed_folder_path = 'processed'
combined_csv_path = os.path.join(processed_folder_path, 'combined_ais_data.csv')

# Check if the processed folder exists, if not, create it
if not os.path.exists(processed_folder_path):
    os.makedirs(processed_folder_path)

# Get all CSV files in the 'data/raw' folder
csv_files = glob.glob(raw_data_path)

# List to store individual DataFrames
df_list = []

# Loop through the files and read them into pandas DataFrames
for file in csv_files:
    df = pd.read_csv(file)
    df_list.append(df)

# Concatenate all DataFrames into one
combined_df = pd.concat(df_list, ignore_index=True)

# Save the combined DataFrame to the 'processed' folder
combined_df.to_csv(combined_csv_path, index=False)

print(f'Combined CSV file saved at: {combined_csv_path}')


ModuleNotFoundError: No module named 'pandas'