#**The HuMob Challenge 2024 - Predicting the Future Mobility of City B Inhabitants**

###presented by Thomas Phillips, Carmen Robles, and Mohammad Ali Mohammadzadeh

#**Overview and Library Importing**

We chose to do the 2024 HuMob Challenge. We elected to focus specifically on City B, but we understand that our predictions will not be as precise as the winning responses for this challenge, as more data used meant more accurate predictions. We also did not fully complete the challenge for City B, but we will explain the steps that took us to where we got to, and we will explain what went wrong and the questions we need to answer moving forward.

The first thing we did was import all of the additional libraries that we need to complete this process.

In [None]:
import torch
import pandas as pd
import numpy as np
from transformers import DataCollatorForLanguageModeling, DistilBertTokenizerFast, DistilBertForMaskedLM, Trainer, TrainingArguments
from torch.utils.data import Dataset
from pathlib import Path
import json
import os
import folium

To outline the rest of this notebook, the next code blocks include each function that we use to process and predict with the data. All of these functions will be combined into one execution code block at the very end. To help organize our notebook, we created three subsections: preprocessing and training, testing and predicting, and results. In the results section, we will highlight those errors in our code (as mentioned earlier) and the questions we need to answer in order to fix them.

#**Preprocessing and Training**

The first function defined is `load_data()`, which handles loading and preprocessing the raw data for city B. It begins by reading the compressed CSV file into a pandas DataFrame, explicitly setting data types for efficiency. To focus on a manageable subset, the function filters the dataset to include only the first 500 unique user IDs. For each row, it generates a `grid_token` based on the x and y coordinates and a `time_token` based on the day and time interval, capturing spatiotemporal information. It also computes a `global_time` column, which is used to identify the temporal position of each record. The data is then split into a training set (covering days 1 to 60) and a test set (days 61 to 75). Debugging print statements are included to confirm the structure and size of the processed data.

In [None]:
# --- Load and Preprocess Data ---
def load_data():
    df = pd.read_csv('cityB_challengedata.csv.gz', dtype={'x': np.int16, 'y': np.int16, 'd': np.int8, 't': np.int8})

    unique_uids = sorted(df['uid'].unique())[:500]
    df = df[df['uid'].isin(unique_uids)]

    df['grid_class'] = df['y'] * 200 + df['x']
    df['grid_token'] = df['grid_class'].apply(lambda g: f'[G_{g}]')
    df['global_time'] = df['d'] * 48 + df['t']
    df['time_token'] = df['global_time'].apply(lambda t: f'[T_{t}]')

    print("Data columns:", df.columns)
    print("First few rows:\n", df.head())
    print(f"✅ Using {len(unique_uids)} users.")

    df_train = df[df['d'] <= 60]
    df_test = df[df['d'] > 60]

    return df_train, df_test

This block defines the `CustomDataset` class, a subclass of PyTorch's Dataset, tailored for preparing text sequences for a Masked Language Modeling (MLM) task. Initially, we considered using the St-MoE-BERT model, but due to compatibility issues with its bash-based training scripts on Google Colab, we opted for a simpler and more Colab-friendly approach using Hugging Face’s DistilBERT. This dataset class remains conceptually inspired by the St-MoE-BERT workflow.

The class constructor takes a list of tokenized sequences, a tokenizer, and an optional maximum sequence length (defaulting to 512). The `__len__` method returns the number of sequences, while the `__getitem__` method processes each sequence into a format suitable for MLM training. Each sequence is tokenized with padding and truncation, producing input IDs and attention masks. For MLM, approximately 15% of the tokens (excluding padding) are randomly selected and replaced with the `[MASK]` token. The original input IDs serve as labels for training. The method returns a dictionary containing the `input_ids`, `attention_mask`, and `labels`, structured for use with Hugging Face's Trainer.

In [None]:
# --- Custom Dataset Class ---
class CustomDataset(Dataset):
    def __init__(self, sequences, tokenizer, max_length=512):
        self.sequences = sequences
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]

        inputs = self.tokenizer(
            sequence,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        input_ids = inputs['input_ids'].squeeze(0)
        attention_mask = inputs['attention_mask'].squeeze(0)

        labels = input_ids.detach().clone()
        rand = torch.rand(input_ids.shape)
        mask_arr = (rand < 0.15) * (input_ids != self.tokenizer.pad_token_id)
        selection = list(mask_arr.nonzero(as_tuple=True)[0].cpu().numpy())

        for i in selection:
            input_ids[i] = self.tokenizer.mask_token_id

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

The `prepare_data` function prepares the training and testing datasets for a Masked Language Modeling (MLM) task. It starts by calling the `load_data()` function to retrieve preprocessed training and testing subsets. The function then loads a DistilBERT tokenizer from Hugging Face and augments it with custom special tokens that represent spatiotemporal information—specifically, the unique grid tokens and time tokens extracted from the training data.

Next, the function constructs sequences by sorting each user’s trajectory data by global time and concatenating the corresponding grid and time tokens into a single string. This is done separately for both training and testing datasets, resulting in a list of tokenized sequences for each user.

A few samples from each set are printed for inspection, and the number of users represented is displayed. Finally, the function wraps these sequences into `CustomDataset` objects, making them compatible with PyTorch and Hugging Face training workflows. It returns the training and testing datasets, the updated tokenizer, and the raw sequences for potential further analysis.



In [None]:
# --- Prepare Data for Training and Prediction ---
def prepare_data():
    df_train, df_test = load_data()
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

    grid_tokens = df_train['grid_token'].unique().tolist()
    time_tokens = df_train['time_token'].unique().tolist()
    special_tokens = grid_tokens + time_tokens

    tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
    print(f"Added {len(special_tokens)} special tokens.")
    print(f"Tokenizer size: {len(tokenizer)}")

    train_sequences = (
        df_train
        .sort_values(['uid', 'global_time'])
        .groupby('uid')[['grid_token', 'time_token']]
        .apply(lambda df: ' '.join(df.to_numpy().flatten()))
        .tolist()
    )

    test_sequences = (
        df_test
        .sort_values(['uid', 'global_time'])
        .groupby('uid')[['grid_token', 'time_token']]
        .apply(lambda df: ' '.join(df.to_numpy().flatten()))
        .tolist()
    )

    print(f"Train users: {len(train_sequences)}, Test users: {len(test_sequences)}")
    print("Sample train sequence:", train_sequences[0])
    print("Sample test sequence:", test_sequences[0])

    train_dataset = CustomDataset(train_sequences, tokenizer)
    test_dataset = CustomDataset(test_sequences, tokenizer)

    return train_dataset, test_dataset, tokenizer, train_sequences

The `train_model` function is responsible for training a DistilBERT model on a Masked Language Modeling (MLM) task. It begins by calling `prepare_data()` to load and tokenize the training and testing datasets and to initialize the tokenizer with custom special tokens.

A pre-trained `DistilBertForMaskedLM` model is then loaded from Hugging Face’s repository (`distilbert-base-uncased`) and resized to accommodate the additional tokens introduced during preprocessing. If the tokenizer lacks a defined padding token, it is set to either the end-of-sequence or unknown token as a fallback.

Training parameters are configured using the `TrainingArguments` class. These include setting the output directory to `./results`, disabling checkpoint saving (`save_strategy="no"`), using a batch size of 16 for both training and evaluation, training for one epoch, enabling mixed-precision training (`fp16=True`), and logging every 500 steps.

A `DataCollatorForLanguageModeling` is used to handle padding and apply masking to approximately 15% of input tokens, which is essential for MLM training.

The Hugging Face Trainer class is then used to manage the training loop, bringing together the model, datasets, training configuration, and data collator. After training, the model and tokenizer are saved to the `models/distilbert_cityB_v1` directory, and a confirmation message is printed.

In [None]:
# --- Train the DistilBERT Model ---
def train_model():
    train_dataset, test_dataset, tokenizer, train_sequences = prepare_data()

    model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
    model.resize_token_embeddings(len(tokenizer))

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
        print(f"ℹ️ Setting pad token to: {tokenizer.pad_token}")

    training_args = TrainingArguments(
        output_dir="./results",
        save_strategy="no",
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=1,
        logging_steps=500,
        fp16=True,
    )

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=True,
        mlm_probability=0.15
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=data_collator,
    )

    trainer.train()

    model_dir = 'models/distilbert_cityB_v1'
    model.save_pretrained(model_dir)
    tokenizer.save_pretrained(model_dir)

    print(f"✅ Training complete! Model and tokenizer saved to: {model_dir}")

#**Testing and Predicting**

The `predict_sequence` function is designed to make predictions on a given input sequence using a pre-trained masked language model like DistilBERT. The function begins by tokenizing the input sequence using the provided tokenizer, ensuring proper padding and truncation to a maximum length of 512 tokens.

Next, the function passes the tokenized inputs through the model in evaluation mode, ensuring no gradients are computed by wrapping the inference process in a `torch.no_grad()` context. The model's output is captured as logits, which represent the unnormalized prediction scores for each token in the vocabulary.

To generate predictions, the function selects the token with the highest score for each position by using `torch.argmax` on the logits. These predicted token indices are then converted back into the corresponding token strings using the tokenizer's `convert_ids_to_tokens` method.

Finally, the function returns the list of predicted tokens, which can be used for further analysis or interpretation of the sequence's predictions.

In [None]:
# --- Make Predictions ---
def predict_sequence(sequence, tokenizer, model):
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=512)

    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    predicted_indices = torch.argmax(logits, dim=-1)

    predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_indices[0])

    return predicted_tokens

The `decode_predictions` function takes a list of predicted tokens (in the form of strings) and decodes them into corresponding (x, y) grid coordinates.

First, the function filters the predicted tokens to extract only those that represent grid positions, which are identified by tokens starting with `[G_`
. These tokens represent grid locations on a 200x200 grid.

Next, the function parses the grid class values by stripping the `[G_` prefix and `]` suffix from each token, converting them to integers. These grid class values uniquely identify each grid position.

To convert these grid class values into (x, y) coordinates, the function uses the modulus operation (`% 200`) to extract the x-coordinate and integer division (`// 200`) to determine the y-coordinate. The x-coordinate represents the horizontal position, while the y-coordinate corresponds to the vertical position.

Finally, the function returns a list of tuples containing the (x, y) coordinates for each predicted grid position. This output can be used to visualize or interpret the predicted locations on the grid.

In [None]:
# --- Decode the Predictions into (x, y) Coordinates ---
def decode_predictions(predicted_tokens):
    grid_predictions = [token for token in predicted_tokens if token.startswith('[G_')]

    if not grid_predictions:
        print("No valid grid predictions found.")
        return []

    grid_classes = [int(token.replace('[G_', '').replace(']', '')) for token in grid_predictions]

    x_coords = [g % 200 for g in grid_classes]
    y_coords = [g // 200 for g in grid_classes]

    return list(zip(x_coords, y_coords))

This function, `predict_grid_locations`, generates location predictions for user trajectories from days 61 to 75 of the test dataset. It begins by using the preprocessed test data, which is already available from the `load_data` function, ensuring the data types are consistent. Each user's trajectory is then sorted by time and flattened into a single string sequence of interleaved grid and time tokens. The function loads the previously trained DistilBERT model along with its tokenizer, both fine-tuned for masked language modeling on grid prediction tasks. It then iterates through each test sequence, predicting masked tokens using the `predict_sequence` function. These predicted tokens are translated back into spatial coordinates (x, y) via the `decode_predictions` function. The result is a list of predicted location coordinates for each user, representing the model's interpretation of future movements based on past behavior.

In [None]:
# --- Predict the Grid Locations for Test Data (Days 61-75) ---
def predict_grid_locations(model, tokenizer):
    df_train, df_test = load_data()

    test_sequences = (
        df_test
        .sort_values(['uid', 'global_time'])
        .groupby('uid')[['grid_token', 'time_token']]
        .apply(lambda df: ' '.join(df.to_numpy().flatten()))
        .tolist()
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    predictions = []

    print(f"Number of test sequences: {len(test_sequences)}")

    for seq in test_sequences:
        inputs = tokenizer(seq, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

        print(f"Predicting for sequence: {seq}")

        predicted_tokens = predict_sequence(seq, tokenizer, model)

        print(f"Predicted tokens: {predicted_tokens}")

        predicted_locations = decode_predictions(predicted_tokens)

        print(f"Predicted locations: {predicted_locations}")

        predictions.append(predicted_locations)

    return predictions

#**Results and Map of Predicted Points**

This main execution block acts as the program's entry point. It begins by loading a pre-trained DistilBERT model and its tokenizer from disk, which were previously fine-tuned on user trajectory data. The script can re-train the model by calling the training procedure, though this is typically commented out during prediction. The script then uses the model to generate next-location predictions for each user in the test set—corresponding to time steps from days 61 to 75. To manage output efficiently, it ensures a `predictions/` directory exists (creating it if needed), and writes the predicted grid locations - formatted as (x, y) coordinate pairs - into a JSON file named `predicted_locations.json`. A confirmation message is printed upon successful completion.

In [None]:
# --- Main Execution ---
if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = DistilBertForMaskedLM.from_pretrained('distilbert_cityB').to(device)
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert_cityB_tokenizer')

    train_model()

    predicted_locations = predict_grid_locations(model, tokenizer)
    os.makedirs('predictions', exist_ok=True)
    with open('predictions/predicted_locations.json', 'w') as f:
        json.dump(predicted_locations, f)

    print("✅ Predictions saved to predictions/predicted_locations.json")


Data columns: Index(['uid', 'd', 't', 'x', 'y', 'grid_class', 'grid_token', 'global_time',
       'time_token'],
      dtype='object')
First few rows:
    uid  d   t   x    y  grid_class grid_token  global_time time_token
0    0  0  20  80   99       19880  [G_19880]           20     [T_20]
1    0  0  21  81   97       19481  [G_19481]           21     [T_21]
2    0  0  25  83  102       20483  [G_20483]           25     [T_25]
3    0  0  26  80  101       20280  [G_20280]           26     [T_26]
4    0  0  27  80  101       20280  [G_20280]           27     [T_27]
✅ Using 500 users.
Sample train sequence: [G_20482] [T_-128] [G_21690] [T_-128] [G_21891] [T_-128] [G_20885] [T_-127] [G_21690] [T_-127] [G_20285] [T_-127] [G_21891] [T_-127] [G_19678] [T_-126] [G_21690] [T_-126] [G_20485] [T_-126] [G_21891] [T_-126] [G_20280] [T_-126] [G_19277] [T_-125] [G_21690] [T_-125] [G_21690] [T_-125] [G_20285] [T_-125] [G_21690] [T_-125] [G_19680] [T_-125] [G_19878] [T_-124] [G_21690] [T_-124] [G_202

Step,Training Loss


✅ Training complete! Model and tokenizer saved.
Data columns: Index(['uid', 'd', 't', 'x', 'y', 'grid_class', 'grid_token', 'global_time',
       'time_token'],
      dtype='object')
First few rows:
    uid  d   t   x    y  grid_class grid_token  global_time time_token
0    0  0  20  80   99       19880  [G_19880]           20     [T_20]
1    0  0  21  81   97       19481  [G_19481]           21     [T_21]
2    0  0  25  83  102       20483  [G_20483]           25     [T_25]
3    0  0  26  80  101       20280  [G_20280]           26     [T_26]
4    0  0  27  80  101       20280  [G_20280]           27     [T_27]
✅ Using 500 users.
Number of test sequences: 500
Predicting for sequence: [G_21690] [T_-128] [G_21690] [T_-127] [G_21690] [T_-126] [G_21690] [T_-125] [G_26469] [T_-125] [G_19880] [T_-124] [G_21890] [T_-123] [G_20482] [T_-122] [G_21690] [T_-121] [G_20477] [T_-121] [G_21690] [T_-120] [G_20280] [T_-120] [G_21690] [T_-119] [G_20280] [T_-119] [G_21690] [T_-118] [G_21690] [T_-117] [G

KeyboardInterrupt: 

Our code did not run how we intended it to. There is an issue with our tokenizer where our tokens are not being saved for use in the model and to predict the next tokens. (The error displayed on the code block is because we stopped the full iteration after seeing the first few results.)

Had this initial code worked, we would proceed to visualize the results. This script reads predicted location data from a JSON file and visualizes the results on an interactive map. It would begin by loading the predictions from `predictions/predicted_locations.json`, which would contain lists of (x, y) grid coordinates for multiple users. Each coordinate is assumed to correspond to a point on a 200x200 grid, which the script would translate into approximate geographic coordinates using a simple linear mapping to latitude and longitude. A Folium map would then be initialized, centered around a general location (in this case, the center of the mapped grid). For each user’s trajectory, a polyline would be added to the map to represent their predicted path across the grid. These paths would be color-coded for visibility, and a circle marker would be placed at the final predicted location to highlight the endpoint. Finally, the completed map would be saved as an HTML file (`predicted_paths_map.html`) for viewing in a web browser. This approach provides a spatial visualization of model predictions for user movement over time.

In [None]:
with open('predictions/predicted_locations.json', 'r') as f:
    predictions = json.load(f)

all_coords = [coord for sublist in predictions for coord in sublist]
m = folium.Map(location=[0, 0], zoom_start=10)

for coord in all_coords:
    folium.CircleMarker(
        location=[coord[1], coord[0]],
        radius=2,
        color='blue',
        fill=True,
        fill_opacity=0.5
    ).add_to(m)

m.save('predicted_locations_map.html')
m

#**Sources (Works Cited)**

All of these sources refer to either the main libraries we used for this project, the St-MoE-BERT submission, or the HuMob Challenge 2024 website.

--------------------------------------------------------------------------------

Bray, Tim. The JavaScript Object Notation (JSON) Data Interchange Format. IETF, 2017, https://tools.ietf.org/html/rfc7159. Accessed 23 Apr. 2025.

ChatGPT

“Folium: Python Data. Leaflet.js Maps.” *Folium Documentation*, Python Software Foundation, https://python-visualization.github.io/folium/.

Harris, Charles R., et al. Array Programming with NumPy. Nature, vol. 585, no. 7825, 2020, pp. 357–362, https://numpy.org.

He, Haoyu, et al. St-MoE-BERT: A Spatial-Temporal Mixture-of-Experts
 Framework for Long-Term Cross-City Mobility Prediction. 18 Oct. 2024, https://arxiv.org/pdf/2410.14099.

HuMob Challenge. HuMob Challenge 2024 Dataset (City A, B, C, D). Zenodo, 2024. https://zenodo.org/record/14219563.

McKinney, Wes. Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 2010, https://pandas.pydata.org.

Paszke, Adam, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019, https://pytorch.org.

Wolf, Thomas, et al. Transformers: State-of-the-Art Natural Language Processing. Hugging Face, 2020, https://huggingface.co/transformers.