The goal of this file is to download and filter a years (or all the data that is available in single days from AIS webpage) worth of data!
Loop over all files and do:
1) Download 
2) Unzip
3) Filter Oresund region
4) Sample in 5 mins intervals
5) Do additional filtering steps
6) Save in manner such that single ships are saved in single files!

In [29]:
import pandas as pd
import requests
from tqdm import tqdm
import os

import concurrent.futures

import zipfile

In [None]:
# Function to download a specific chunk
def download_chunk(start, end, chunk_idx, progress_bar, url, fname):
    headers = {"Range": f"bytes={start}-{end}"}
    chunk_fname = f"{fname}.part{chunk_idx}"
    
    #Stream=True to download in chunks and avoid loading the entire file into memory
    with requests.get(url, headers=headers, stream=True) as r, open(chunk_fname, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024 * 1024 * 10):  # 10 MB chunks
            f.write(chunk)
            progress_bar.update(len(chunk))

    return chunk_fname

In [None]:
url_home = "https://web.ais.dk/aisdata/" #Initializing URL 
date_0 = pd.Timestamp('2024-04-01') #Start date of download
date_end = pd.Timestamp('2025-03-28') #End date of download
data_home = "../Data/25_3_data_getting/data_generation/" #Local directory to save data

#If donwload is interrupted or needs to be resumed/extended, make a set of existing ships
existing_ships = set()
for ship in os.listdir(data_home):
    existing_ships.add(int(ship.split('.')[0]))

num_chunks = 4  # Number of parallel chunks (adjust based on network speed) while downloading

while date_0 < date_end: #Download until the end date
    url = url_home + "aisdk-" + date_0.strftime("%Y-%m-%d") + ".zip" #URL for the specific date
    fname = data_home + "aisdk-" + date_0.strftime("%Y-%m-%d") + ".zip" #Local file name
    response = requests.head(url) #Get information about the file
    file_size = int(response.headers.get("Content-Length", 0)) #File size in bytes
    chunk_size = file_size // num_chunks #Size of each chunk
    print(f"Downloading {url} ({file_size/10**6} MB) in {num_chunks} chunks of size {chunk_size/10**6} MB each.")

    chunk_ranges = [(i * chunk_size, (i + 1) * chunk_size - 1, i) for i in range(num_chunks)] #Creating ranges for each chunk (start, end, chunk_idx)
    chunk_ranges[-1] = (chunk_ranges[-1][0], file_size - 1, num_chunks - 1)  # Ensure last chunk gets remaining bytes

    #Downloading the file in parallel using ThreadPoolExecutor
    #Using tqdm to show progress bar for the entire download
    with tqdm(total=file_size, unit="B", unit_scale=True, desc="Downloading") as progress_bar:
        with concurrent.futures.ThreadPoolExecutor(max_workers=num_chunks) as executor:
            chunk_files = list(executor.map(lambda args: download_chunk(*args, progress_bar, url, fname), chunk_ranges))

    #Merging the downloaded chunks into a single file
    #Using a context manager to handle file operations
    #Deleting temporary chunk files after merging
    with open(fname, "wb") as output_file:
        for chunk_file in chunk_files:
            with open(chunk_file, "rb") as part:
                output_file.write(part.read())
            os.remove(chunk_file)  # Delete temporary chunk files


    with zipfile.ZipFile(fname, "r") as zip_ref:
        zip_ref.extractall(data_home)
        print("Files extracted")
        #Deleting the zip file after extraction
    os.remove(fname)  # Delete the zip file after extraction

    #Now onto the csv filtering!
    fname = data_home + "aisdk-" + date_0.strftime("%Y-%m-%d") + ".csv" #Local file name for the extracted CSV file

    date_0 += pd.DateOffset(days=1) #Increment date by one day
    break

Downloading https://web.ais.dk/aisdata/aisdk-2024-04-01.zip (552.159067 MB) in 4 chunks of size 138.039766 MB each.


TO-DO:
Think hard about how it is good to store data like this (for using for real later)
Is it one single large file?
Is it shipwise?
Is it datewise?
Is it both ship and datewise???

I will do many computations for a single ship but in the end want to fx plot by days

Take inspiration in what we did in ML project???

Find more efficient way to merge parquet or csv files! (inspiration from chunk download!)

The data will still be very large with a whole year (Gigs)
Find out how to separate static data from ships out! And maybe store in a different setting! (Pretty easy to postpone if I save as parquet I think!)

In [None]:
# with open("file1.csv", "a") as f1, open("file2.csv", "r") as f2:
#     next(f2)  # Skip header in second file
#     for line in f2:
#         f1.write(line)


# import dask.dataframe as dd

# df1 = dd.read_csv("file1.csv")
# df2 = dd.read_csv("file2.csv")

# merged = dd.concat([df1, df2])  # Efficient, lazy merge
# merged.to_csv("merged_dask.csv", single_file=True, index=False)  # Write back as a single CSV


In [13]:
t = pd.Timestamp('2024-04-01')
print(t)
t += pd.DateOffset(days=29)
print(t)
t += pd.DateOffset(days=1)
print(t.__str__().split(' ')[0])

2024-04-01 00:00:00
2024-04-30 00:00:00
2024-05-01
