## CalculateEmbeddings

This script:
- Loads list of sample points, each with 4 associated image files
- Calculates an image embedding for each image files
- Calculates text embeddings from a list of prompts
- Calculates a similarity score between each image embedding and each text embedding
- Saves a pickle file containig each sample point, the list of image files, the list of embeddings and the list of similarity scores

In [2]:
import os
os.environ["HF_HOME"] = "/nfs/a319/gy17m2a/scratch/hf_cache"
import pickle
import geopandas as gpd
import torch
from PIL import Image
from tqdm import tqdm
import clip

data_dir = os.path.join("../../../../data/embeddings/")

### Load list of sample points
This contains points sampled along the road network in 1-SampleStreetNetwork.iypnb  
Each point has a latitude, a longitude, and 4 image files associated with it (these are sampled in each of the 4 cardinal directions from the sample point)  
It also contains an 'embedding' slot, which this script will fill with a list of embeddings for each of the 4 images

In [3]:
points_data_cache = data_dir + "sample_points_cache/points_data_cache.pkl"
with open(points_data_cache, "rb") as f:
        point_records = pickle.load(f)
print(f"Cache currently has {len(point_records)} points.")

Cache currently has 18897 points.


# Compute the Embeddings

In [3]:
# Define model
device = "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

## Create text embedding for categories we want to match image embeddings to

- For each headline category, define several different prompts
- Convert each of the subprompts into a text embedding
- For each headling category, find the mean text embedding

### Define the prompts

In [4]:
# Option 1
# These prompts are based on the planning use classes: https://www.planningportal.co.uk/permission/common-projects/change-of-use/use-classes 
multi_prompts = {
    "C ‚Äì Accommodation": [
        "a photo of a house or home",
        "an apartment building on a street",
        "houses in a residential neighborhood",
        "front view of a suburban house",
        "a cozy home exterior with a garden"],
    "B ‚Äì Industrial / Storage": [
        "a warehouse or big industrial building",
        "a factory with chimneys or machinery",
        "storage containers outside a building",
        "a logistics yard with trucks and crates",
        "industrial buildings in an urban area"],
    "E ‚Äì Commercial / Business / Service": [
        "a shop or cafe on the street",
        "a busy high street with stores",
        "a restaurant or small business front",
        "office building in the city",
        "people outside a retail store or service"],
    "F ‚Äì Local Community / Learning": [
        "a school or university building",
        "library or community centre",
        "children playing at a sports field",
        "outdoor playground or swimming pool",
        "museum, gallery or exhibition space"]}

# Option 2
# These prompts are based on trial and error refinement of what seem to be the most common scene types
multi_prompts = {
# 1. Indoor / interior scenes (to exclude non-street content)
    "indoor": [
        "an indoor photo inside a building",
        "an interior room scene",
        "a photo taken indoors under artificial lighting",
        "inside a residential home",
        "inside an office or workspace",
        "an indoor hallway or corridor",
        "inside a car interior",
        "a room with walls and furniture visible",
        "an indoor photo with no outdoor scenery",
        "an interior living space"],

    # 2. Single residential house (any detached, semi-detached, terraced)
    "single_house": [
        "a single residential house",
        "a standalone house on a street",
        "a terraced or semi-detached home",
        "a suburban residential house",
        "a UK-style house with a front door and windows",
        "a residential house with a garden or driveway",
        "a home with a pitched roof",
        "a brick residential house",
        "a typical British house exterior",
        "a single-family home on a quiet street"],

    # 3. Residential street (multiple houses visible)
    "residential_street": [
        "a residential street with multiple houses",
        "a row of houses along a street",
        "a street lined with terraced homes",
        "a suburban neighbourhood street view",
        "a street with parked cars and houses",
        "a residential road with houses on both sides",
        "a quiet housing neighbourhood",
        "a UK residential area with multiple homes visible",
        "a typical street of British housing",
        "a suburban housing street scene"],

    # 4. Shops and cafes (retail frontage)
    "shops_and_cafes": [
        "a shop or retail storefront",
        "a cafe or coffee shop entrance",
        "a commercial high street with shops",
        "a small business storefront",
        "a row of shops on a street",
        "a trendy cafe or coffee shop",
        "a street with commercial signage",
        "a food takeaway or retail frontage",
        "a boutique store or independent shop",
        "a street with restaurants or cafes"],

    # 5. Road / street without strong residential context
    "road": [
        "a road with cars and traffic",
        "a street with no houses visible",
        "an asphalt road outdoors",
        "a roadway with vehicles",
        "a simple road scene with pavement",
        "a street-level view of a road",
        "a road with buildings far away",
        "a road with painted lane markings",
        "an intersection or junction",
        "a wide street with traffic flow"],

    # 6. Highways / motorways (new category)
    "highway": [
        "a large highway with multiple lanes",
        "a motorway with fast-moving traffic",
        "a dual carriageway",
        "a high-speed road with no houses nearby",
        "a major road with overpasses or slip roads",
        "a highway with barriers and signage",
        "a wide road with multiple lanes of traffic",
        "a motorway with cars travelling at speed",
        "a road with green verges and no buildings",
        "a large transportation corridor" ],

    # 7. Car-dominated / close-up vehicles
    "car": [
        "a photo dominated by a car exterior",
        "a close-up of a vehicle",
        "a parked car in the foreground",
        "a photo where a car fills most of the frame",
        "a car on the street close-up",
        "a vehicle photographed from the side",
        "a car front or headlight close-up",
        "a parked vehicle dominating the scene",
        "a photo focused mainly on a car body",
        "a street scene with a car extremely close"],

    # 8. Industrial buildings / warehouses
    "industrial": [
        "an industrial building such as a warehouse",
        "a factory or manufacturing facility",
        "an industrial estate building",
        "a warehouse with metal siding",
        "a large industrial structure",
        "a factory exterior with chimneys",
        "an industrial unit with shutters",
        "a building used for manufacturing or storage",
        "an industrial complex",
        "a photo of a warehouse yard"],

    # 9. Wasteland / derelict land
    "wasteland": [
        "a derelict vacant lot",
        "an empty or abandoned outdoor area",
        "unused land with no buildings",
        "a vacant plot with rubble or overgrowth",
        "an open space with signs of neglect",
        "a barren ground area",
        "an outdoor area with debris and no structures",
        "a neglected or rundown open space",
        "a disused land area",
        "a wasteland with weeds or dirt"],

    # 10. Greenspace / parks
    "greenspace": [
        "a public park with grass and trees",
        "an urban greenspace or recreation area",
        "a landscaped public park",
        "a community park with greenery",
        "a garden or park with plants",
        "a green open space with trees",
        "a park pathway with vegetation",
        "a natural outdoor green area",
        "a public lawn or green field",
        "an urban park with trees and grass"]}

### Calculate the embeddings from the text categories

In [5]:
# List of embeddings for each of the 4 headline categories
final_text_features = []
category_names = []  

# This line tells PyTorch that we're not training CLIP (just using to calculate embeddings), so don't need to compute gradients
# This makes the computation faster
with torch.no_grad():
    # Loop through each category and its list of text prompts
    for cat, prompts in multi_prompts.items():

        # Add the category name to your list (used later for plotting or indexing)
        category_names.append(cat)

        # Convert all textual prompts into CLIP token IDs 
        # Token IDs are numerical codes that represent words or sub-words
        tokenized = clip.tokenize(prompts).to(device)

        # Encode all the token IDs into CLIP text embeddings
        txt_feats = model.encode_text(tokenized)

        # Normalise each prompt embedding to unit length
        # (CLIP uses cosine similarity, so normalisation matters)
        txt_feats = txt_feats / txt_feats.norm(dim=-1, keepdim=True)

        # Compute the mean embedding across all prompts for this category
        # This creates a single "category embedding" representing all its prompts
        avg_feat = txt_feats.mean(dim=0)

        # Normalise the averaged embedding again
        # This ensures it remains a proper CLIP embedding for cosine similarity
        avg_feat = avg_feat / avg_feat.norm()

        # Save this averaged category embedding
        final_text_features.append(avg_feat.cpu())

# Convert to tensor of shape (num_categories, 512)
# A tensor is a multi-dimensional array, and is the format expected by PyTorch
final_text_features = torch.stack(final_text_features)
print("Built improved category text embeddings:", final_text_features.shape)

Built improved category text embeddings: torch.Size([10, 512])


## Create embedding for each image and find similarity to categories 
- Create embedding for image
- Find similarity score to text embedding for each category
- Convert similarity score to a "probability-like number" using softmax

In [53]:
def embed_and_score_clip(image_path):
    """
    Loads an image, computes its CLIP embedding, 
    and calculates similarity-based category probabilities.

    Returns:
        image_embedding (np.array): 512-dim CLIP image embedding
        category_probabilities (np.array): Probability for each category
    """

    # -----------------------------------------------------------
    # 1. LOAD AND PREPROCESS THE IMAGE
    # -----------------------------------------------------------
    # Load image using PIL and convert to 3-channel RGB
    pil_image = Image.open(image_path).convert("RGB")

    # Apply CLIP preprocessing:
    # - resize/crop to 224x224
    # - convert to torch tensor
    # - normalise pixels with CLIP‚Äôs mean/std
    # This produces a tensor of shape (3, 224, 224)
    image_tensor = preprocess(pil_image)

    # Add a batch dimension ‚Üí (1, 3, 224, 224)
    # Required because CLIP expects a batch
    image_tensor = image_tensor.unsqueeze(0)

    # Move tensor to CPU or GPU depending on device
    image_tensor = image_tensor.to(device)

    # -----------------------------------------------------------
    # 2. RUN CLIP TO GET IMAGE EMBEDDING
    # -----------------------------------------------------------
    # Disable gradient tracking 
    with torch.no_grad():

        # Encode the image ‚Üí produces a 512-dim CLIP embedding
        raw_image_embedding = model.encode_image(image_tensor)

        # Normalise embedding to unit length (important for cosine similarity)
        image_embedding = raw_image_embedding / raw_image_embedding.norm(
            dim=-1, keepdim=True)

        # -----------------------------------------------------------
        # 3. COMPUTE SIMILARITIES TO TEXT CATEGORY EMBEDDINGS
        # -----------------------------------------------------------
        # Returns similarity of the 1 image embedding to N text embeddings
        # These are dot products, representing how close the image is to each category in embedding space
        similarity_scores = (image_embedding @ final_text_features.to(device).T)

        ######## Convert raw similarities to probabilities
        # Softmax is a function that turns a set of numbers into a probability-like distribution
        # However, the numbers do not represent true probabilities
        # e.g. Scores of Indoor: 0.75, Greenery: 0.18, Terraced house: 0.04, Road: 0.02, Shop: 0.01
        # Mean that The "indoor" text embedding was much closer to the image embedding than the others.
        # And NOT that the true probability that the scene is indoors is 75%.
        category_probabilities = similarity_scores.softmax(dim=-1)

    # -----------------------------------------------------------
    # 4. RETURN CLEAN CPU NUMPY ARRAYS
    # -----------------------------------------------------------
    return (image_embedding.cpu().numpy()[0],       # shape (512,)
        similarity_scores.cpu().numpy()[0],
        category_probabilities.cpu().numpy()[0])

# ------------------------------
# 5. Embed all images
# ------------------------------
for rec in tqdm(point_records, desc="Embedding points", unit="point"):
# for rec in point_records[200:300]:    

    rec["embedding"] = []
    rec["category_scores"] = []
    rec["category_probs"] = []
    
    for img_path in rec["image_files"]:

        try:
            embedding, scores, probabilities = embed_and_score_clip(img_path)
            
            rec["embedding"].append(embedding)
            rec["category_scores"].append(scores)
            rec["category_probs"].append(probabilities)
            
            # Testing
#             img = Image.open(img_path)
#             fig,ax=plt.subplots(figsize=(2,2))
#             ax.axis("off")
#             plt.imshow(img)
#             plt.show()
#             argmax = scores.argmax()
#             print(list(multi_prompts.keys())[argmax])
            
        except Exception as e:
            tqdm.write(f"‚ö†Ô∏è Error: {e}")

## Save outputs to pickle file

This contains points sampled along the road network in 1-SampleStreetNetwork.iypnb.    
Each point has a latitude, a longitude, and 4 image files associated with it.  
It also contains an 'embedding' slot, with a list of embeddings for each of the 4 images.  

In [7]:
output_file = (data_dir + "sample_points_cache/points_data_cache_with_CLIP_embeddings_and_scores_userdefinedclasses.pkl")

with open(output_file, "wb") as f:
    pickle.dump(point_records, f)

print(f"\nüíæ Saved embeddings + category scores for {len(point_records)} points.")


üíæ Saved embeddings + category scores for 18897 points.


## Testing how CLIP works

In [8]:
# img_path = point_records[2]["image_files"][3]
# img_path = img_path.replace("airbnb-manchester/", "embeddings/").replace("../", "../../../")
# img = Image.open(img_path)
# plt.imshow(img)

In [9]:
# from sentence_transformers import SentenceTransformer, util
# from PIL import Image

# #Load CLIP model
# model = SentenceTransformer('clip-ViT-B-32')

In [10]:
#Encode text descriptions
# text_emb = model.encode(['Two dogs in the snow', "a pizza", 'A cat on a table', 'A picture of a road, with cars and trees'])
# text_emb = model.encode(["a cucumber", "semi-detached house", "a highway with few cars and grass embankments", "a car", "a view down a road", "a park"])

In [11]:
# #Encode an image:
# img_emb = model.encode(Image.open(img_path))

# #Compute cosine similarities 
# cos_scores = util.cos_sim(img_emb, text_emb)
# print(cos_scores)

In [12]:
# text_emb = model.encode(["a house", "a shop", "a car", "a road", "a park"])

# #Compute cosine similarities 
# cos_scores = util.cos_sim(img_emb, text_emb)
# print(cos_scores)