<a href="https://colab.research.google.com/github/WilliamShengYangHuang/RC18_GenAI/blob/main/AI_Powered_Single_Image_to_3D_Scene_Reconstruction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **AI-Powered Single Image to 3D Scene Reconstruction** (Beta Version)

William Huang (william.huang@ucl.ac.uk)

December 2025





**Project Overview**

This project utilises advanced artificial intelligence to transform a single two-dimensional photograph into a high-fidelity three-dimensional scene. By synthesising the geometric precision of the Depth Anything V2 model with the semantic understanding of SegFormer, the system reconstructs dense point clouds and textured meshes. To ensure architectural accuracy, the tool features a specialised pre-processing module for perspective correction, allowing users to rectify geometric distortions before 3D generation. The system is designed for resilience, automatically reverting to a geometry-only mode if semantic analysis is unavailable.

The software supports multi-format export, automatically generating .obj (Textured Mesh), .ply (Point Cloud), and .glb (Web Preview) files. Users can inspect these outputs via an integrated dual-tab viewer that renders both structural point clouds and solid meshes directly within the web browser.

**Operation Manual**
1. Initialisation Begin by executing the code cell within the Google Colab environment. Once the dependencies are installed and the AI models have loaded, a public URL (e.g., https://xxxx.gradio.live) will appear. Click this link to access the web interface.

2. Upload and Pre-processing Drag and drop your source photograph into the 'Input Image' area. If the image exhibits perspective distortion, such as buildings appearing to lean backwards, expand the 'Keystone Correction' panel. Adjust the Vertical slider to rectify converging vertical lines or the Horizontal slider to correct side-angle distortion. Verify that lines appear straight and parallel in the 'Warped Preview' window before proceeding.

3. Parameter Configuration Configure the scene depth using the Depth Scale (Z-Scale) slider. Higher values are ideal for deep scenes like corridors or streets, whilst lower values suit flatter objects. You may also adjust the Preview Density to balance between visual detail and browser rendering performance.

4. Generation and Export Click the 'Generate 3D Model' button to commence processing. Upon completion, navigate the tabs on the right-hand side to inspect the results: the 'Cloud Preview' displays the raw depth structure, whilst the 'Mesh Preview' offers an interactive view of the solid, textured model. To save your work, use the dedicated buttons below the preview window to download the .obj file for use in 3D software (such as Blender or Unity) or the .ply file for point cloud applications.


In [None]:
# @title üöÄ AI 3D Suite
# ‰øÆÂ§çÊ†∏ÂøÉÔºö
# 1. È¢ÑËßàÁ™óÂè£Êîπ‰∏∫Âä†ËΩΩ .glb Êñá‰ª∂ (WebÂèãÂ•ΩÊ†ºÂºè)ÔºåËß£ÂÜ≥ Mesh È¢ÑËßà‰∏çÊòæÁ§∫ÁöÑÈóÆÈ¢ò
# 2. ‰∏ãËΩΩÊåâÈíÆ‰æùÁÑ∂Êèê‰æõ .obj (ÈÄöÁî®Ê†ºÂºè) Âíå .ply (ÁÇπ‰∫ë)
# 3. ÂåÖÂê´‰πãÂâçÁöÑÈÄèËßÜÁü´Ê≠£ÂíåÊâÄÊúâÂäüËÉΩ

import os
import sys
from pathlib import Path

# ===========================
# 1. ÁéØÂ¢ÉÂÆâË£Ö
# ===========================
print("Ê≠£Âú®‰ºòÂåñËøêË°åÁéØÂ¢É...")
!pip install -q gradio open3d numpy opencv-python pillow matplotlib scikit-image tqdm plotly accelerate trimesh timm transformers

import numpy as np
import torch
import cv2
import open3d as o3d
import gradio as gr
from PIL import Image, ImageOps
import plotly.graph_objects as go
from transformers import pipeline

# ËÆæÂ§áÈÖçÁΩÆ
device_str = "cuda" if torch.cuda.is_available() else "cpu"
device_id = 0 if device_str == "cuda" else -1
print(f"üü¢ ËøêË°åËÆæÂ§á: {device_str}")

OUTPUT_DIR = Path("/content/3dgs_output_fixed_preview")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ===========================
# 2. Âä†ËΩΩ AI Ê®°Âûã
# ===========================
print("üîµ Ê≠£Âú®Âä†ËΩΩ AI Ê®°Âûã...")

# Ê∑±Â∫¶Ê®°Âûã
try:
    print("  -> Âä†ËΩΩÊ∑±Â∫¶Ê®°Âûã...")
    depth_estimator = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf", device=device_id)
except Exception as e:
    depth_estimator = None
    print(f"  ‚ùå Ê∑±Â∫¶Ê®°ÂûãÂä†ËΩΩÂ§±Ë¥•: {e}")

# ËØ≠‰πâÊ®°Âûã
semantic_segmenter = None
try:
    print("  -> Âä†ËΩΩËØ≠‰πâÊ®°Âûã...")
    semantic_segmenter = pipeline(task="image-segmentation", model="nvidia/segformer-b0-finetuned-ade20k-512-512", device=device_id)
except:
    print(f"  ‚ö†Ô∏è ËØ≠‰πâÊ®°ÂûãÂä†ËΩΩÂ§±Ë¥•ÔºåÂàáÊç¢Ëá≥Á∫ØÂá†‰ΩïÊ®°Âºè„ÄÇ")
    semantic_segmenter = None

# ===========================
# 3. ÂõæÂÉèÈÄèËßÜÁü´Ê≠£ÁÆóÊ≥ï
# ===========================
def warp_image(image_pil, vertical_correction, horizontal_correction, zoom):
    if vertical_correction == 0 and horizontal_correction == 0:
        return image_pil

    w, h = image_pil.size
    img_np = np.array(image_pil)
    src_pts = np.float32([[0, 0], [w, 0], [0, h], [w, h]])

    vc = vertical_correction * (w * 0.2)
    hc = horizontal_correction * (h * 0.2)

    dst_pts = np.float32([
        [0 - vc - hc, 0],
        [w + vc + hc, 0],
        [0 + vc - hc, h],
        [w - vc + hc, h]
    ])

    matrix = cv2.getPerspectiveTransform(src_pts, dst_pts)
    warped_np = cv2.warpPerspective(img_np, matrix, (w, h), flags=cv2.INTER_LANCZOS4, borderMode=cv2.BORDER_REPLICATE)

    if zoom != 1.0:
        h, w = warped_np.shape[:2]
        center = (w / 2, h / 2)
        M = cv2.getRotationMatrix2D(center, 0, zoom)
        warped_np = cv2.warpAffine(warped_np, M, (w, h), flags=cv2.INTER_LANCZOS4, borderMode=cv2.BORDER_REPLICATE)

    return Image.fromarray(warped_np)

# ===========================
# 4. Ê†∏ÂøÉÁîüÊàêÈÄªËæë
# ===========================
def process_3d_data(image_np, depth_map, seg_map, depth_scale, depth_gamma):
    h, w, _ = image_np.shape

    depth_norm = (depth_map - depth_map.min()) / (depth_map.max() - depth_map.min())
    if depth_gamma != 1.0:
        depth_norm = np.power(depth_norm, depth_gamma)

    z_flat = depth_norm.flatten() * 200.0 * depth_scale
    z_grid = z_flat.reshape(h, w)

    fx, fy = w * 1.1, w * 1.1
    cx, cy = w / 2, h / 2
    x_grid, y_grid = np.meshgrid(np.arange(w), np.arange(h))

    x3d = (x_grid - cx) * z_grid / fx
    y3d = (y_grid - cy) * z_grid / fy
    z3d = z_grid * -1.0

    vertices = np.stack([x3d.flatten(), -y3d.flatten(), z3d.flatten()], axis=1)
    colors = image_np.reshape(-1, 3) / 255.0

    # ÊûÑÂª∫ Mesh
    faces = []
    depth_threshold = 0.05
    use_semantic = (seg_map is not None)

    for r in range(h - 1):
        for c in range(w - 1):
            idx00 = r * w + c
            idx01 = r * w + (c + 1)
            idx10 = (r + 1) * w + c
            idx11 = (r + 1) * w + (c + 1)

            d00, d01 = depth_norm[r, c], depth_norm[r, c+1]
            d10, d11 = depth_norm[r+1, c], depth_norm[r+1, c+1]

            depth_ok_1 = abs(d00 - d01) < depth_threshold and abs(d00 - d10) < depth_threshold
            depth_ok_2 = abs(d01 - d10) < depth_threshold and abs(d01 - d11) < depth_threshold

            sem_ok_1 = sem_ok_2 = True
            if use_semantic:
                s00 = seg_map[r, c]
                sem_ok_1 = (s00 == seg_map[r+1, c] == seg_map[r, c+1])
                sem_ok_2 = (seg_map[r, c+1] == seg_map[r+1, c] == seg_map[r+1, c+1])

            if depth_ok_1 and sem_ok_1: faces.append([idx00, idx10, idx01])
            if depth_ok_2 and sem_ok_2: faces.append([idx01, idx11, idx10])

    mesh = o3d.geometry.TriangleMesh()
    mesh.vertices = o3d.utility.Vector3dVector(vertices)
    mesh.vertex_colors = o3d.utility.Vector3dVector(colors)
    mesh.triangles = o3d.utility.Vector3iVector(np.array(faces))
    mesh.remove_unreferenced_vertices()
    mesh.compute_vertex_normals()

    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(vertices)
    pcd.colors = o3d.utility.Vector3dVector(colors)

    return mesh, pcd, vertices, colors

# ===========================
# 5. ÊâßË°åÊµÅÁ®ã
# ===========================
def run_pipeline(input_pil, v_corr, h_corr, zoom, depth_scale, depth_gamma, point_density):
    if input_pil is None: return None, None, None, None, None, "‚ùå ËØ∑‰∏ä‰º†ÂõæÁâá"
    if depth_estimator is None: return None, None, None, None, None, "‚ùå Ê∑±Â∫¶Ê®°ÂûãÊú™Âä†ËΩΩ"

    try:
        # 1. ÈÄèËßÜÁü´Ê≠£
        if v_corr != 0 or h_corr != 0:
            processed_pil = warp_image(input_pil, v_corr, h_corr, zoom)
        else:
            processed_pil = input_pil

        # 2. Áº©Êîæ
        process_size = (512, 512)
        img_resized = processed_pil.resize(process_size, Image.LANCZOS)
        image_np = np.array(img_resized)

        # 3. AI Êé®ÁêÜ
        depth_map = np.array(depth_estimator(img_resized)["depth"])

        seg_map = None
        seg_viz = None
        if semantic_segmenter is not None:
            try:
                seg_results = semantic_segmenter(img_resized)
                seg_map = np.zeros((process_size[1], process_size[0]), dtype=np.int32)
                for i, res in enumerate(seg_results):
                    mask = np.array(res['mask'].resize(process_size, Image.NEAREST))
                    seg_map[mask > 0] = i + 1
                seg_viz = Image.fromarray((seg_map * (255.0 / (len(seg_results)+1))).astype(np.uint8)).convert("L")
                seg_viz = seg_viz.resize(processed_pil.size, Image.NEAREST)
            except:
                seg_map = None

        if seg_viz is None: seg_viz = Image.new("L", processed_pil.size, 128)

        # 4. ÁîüÊàê 3D
        mesh, pcd, all_points, all_colors = process_3d_data(image_np, depth_map, seg_map, depth_scale, depth_gamma)

        # 5. ‰øùÂ≠òÊñá‰ª∂ (ÂÖ≥ÈîÆ‰øÆÂ§çÔºö‰∏∫È¢ÑËßà‰øùÂ≠ò GLB)

        # A. ‰øùÂ≠ò .obj (Áî®‰∫é‰∏ãËΩΩ)
        obj_path = OUTPUT_DIR / "model.obj"
        o3d.io.write_triangle_mesh(str(obj_path), mesh, write_vertex_normals=True, print_progress=False)

        # B. ‰øùÂ≠ò .ply (Áî®‰∫é‰∏ãËΩΩ)
        ply_path = OUTPUT_DIR / "model.ply"
        o3d.io.write_point_cloud(str(ply_path), pcd, print_progress=False)

        # C. [Êñ∞Â¢û] ‰øùÂ≠ò .glb (Áî®‰∫éÁΩëÈ°µÈ¢ÑËßàÔºåËß£ÂÜ≥Á©∫ÁôΩÈóÆÈ¢ò)
        glb_path = OUTPUT_DIR / "preview.glb"
        o3d.io.write_triangle_mesh(str(glb_path), mesh, print_progress=False)

        # 6. ÁÇπ‰∫ëÈ¢ÑËßà
        total = len(all_points)
        sample = min(total, int(point_density))
        if sample > 0:
            indices = np.random.choice(total, sample, replace=False)
            web_points = all_points[indices]
            web_colors = all_colors[indices]
            color_strings = [f'rgb({int(c[0]*255)},{int(c[1]*255)},{int(c[2]*255)})' for c in web_colors]
            z_range = web_points[:,2].max() - web_points[:,2].min()
            visual_z_ratio = (z_range / max(process_size)) * 1.5 if max(process_size) > 0 else 1.0
        else:
            web_points = np.zeros((1, 3))
            color_strings = ['rgb(0,0,0)']
            visual_z_ratio = 1.0

        fig = go.Figure(data=[go.Scatter3d(
            x=web_points[:,0], y=web_points[:,1], z=web_points[:,2],
            mode='markers', marker=dict(size=2, color=color_strings, opacity=1.0)
        )])

        fig.update_layout(
            margin=dict(l=0,r=0,b=0,t=0),
            scene=dict(
                xaxis=dict(visible=False), yaxis=dict(visible=False), zaxis=dict(visible=True, title="Depth"),
                aspectmode='manual', aspectratio=dict(x=1, y=1, z=visual_z_ratio),
                camera=dict(eye=dict(x=1.5, y=0.2, z=0.3), center=dict(x=0,y=0,z=0), up=dict(x=0,y=1,z=0))
            )
        )

        return fig, processed_pil, str(glb_path), str(obj_path), str(ply_path), "‚úÖ ÁîüÊàêÂÆåÊàê (Â∑≤ÂêØÁî®GLBÈ¢ÑËßà)"

    except Exception as e:
        import traceback
        traceback.print_exc()
        return None, None, None, None, None, f"‚ùå ÈîôËØØ: {e}"

# ===========================
# 6. ÁïåÈù¢
# ===========================
with gr.Blocks(title="AI 3D Suite (GLB Fix)") as demo:
    gr.Markdown("# üöÄ AI 3D ÂÖ®ËÉΩÁâà (È¢ÑËßà‰øÆÂ§ç)")

    with gr.Row():
        with gr.Column(scale=2):
            input_img = gr.Image(label="ÂéüÂõæ", type="pil", height=250)

            with gr.Accordion("üìê ÈÄèËßÜÂèòÂΩ¢Áü´Ê≠£", open=False):
                v_corr = gr.Slider(-1.0, 1.0, value=0, step=0.05, label="ÂûÇÁõ¥Ê¢ØÂΩ¢")
                h_corr = gr.Slider(-1.0, 1.0, value=0, step=0.05, label="Ê∞¥Âπ≥Ê¢ØÂΩ¢")
                zoom = gr.Slider(0.5, 1.5, value=1.0, step=0.05, label="Áº©Êîæ")

            with gr.Group():
                depth_scale = gr.Slider(0.1, 5.0, value=2.0, label="Ê∑±Â∫¶Êãâ‰º∏")
                density = gr.Slider(5000, 30000, value=15000, label="ÁÇπ‰∫ëÈ¢ÑËßàÁ≤æÁªÜÂ∫¶")

            run_btn = gr.Button("üöÄ Áü´Ê≠£Âπ∂ÁîüÊàê 3D", variant="primary")
            status = gr.Textbox(label="Áä∂ÊÄÅ")

        with gr.Column(scale=4):
            warped_view = gr.Image(label="Áü´Ê≠£ÂêéÂèÇËÄÉÂõæ", type="pil", height=200)

            with gr.Tabs():
                with gr.Tab("‚òÅÔ∏è ÁÇπ‰∫ëÈ¢ÑËßà"):
                    preview_plot = gr.Plot(label="ÁªìÊûÑÈ¢ÑËßà")
                with gr.Tab("üßä ÂÆû‰ΩìÊ®°ÂûãÈ¢ÑËßà (Mesh)"):
                    # ‰øÆÂ§çÔºöËøôÈáåÊé•Êî∂ .glb Êñá‰ª∂
                    preview_mesh = gr.Model3D(label="ÂÆû‰ΩìË°®Èù¢", clear_color=[0,0,0,0], display_mode="solid")

            with gr.Row():
                dl_obj = gr.File(label="üì• ‰∏ãËΩΩ .obj (ÈÄöÁî®Ê†ºÂºè)")
                dl_ply = gr.File(label="‚òÅÔ∏è ‰∏ãËΩΩ .ply (ÁÇπ‰∫ëÊ†ºÂºè)")

    run_btn.click(
        fn=run_pipeline,
        inputs=[input_img, v_corr, h_corr, zoom, depth_scale, gr.Number(value=1.0, visible=False), density],
        outputs=[preview_plot, warped_view, preview_mesh, dl_obj, dl_ply, status]
    )

print("Ê≠£Âú®ÂêØÂä®...")
demo.launch(share=True, debug=True)

Ê≠£Âú®‰ºòÂåñËøêË°åÁéØÂ¢É...
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m447.7/447.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m737.0/737.0 kB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.9/7.9 MB[0m [31m157.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m139.8/139.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.2/2.2 MB[0m [31m82.5 MB/s

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/99.2M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cuda:0


  -> Âä†ËΩΩËØ≠‰πâÊ®°Âûã...
  ‚ö†Ô∏è ËØ≠‰πâÊ®°ÂûãÂä†ËΩΩÂ§±Ë¥•ÔºåÂàáÊç¢Ëá≥Á∫ØÂá†‰ΩïÊ®°Âºè„ÄÇ
Ê≠£Âú®ÂêØÂä®...
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://b9082c5ef051085678.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


