# Literature

> Important Papers I have read on the topic

In [None]:
#| default_exp core

In [None]:
#| hide
from nbdev.showdoc import *

## SceneNet: an Annotated Model Generator for Indoor Scene Understanding

SceneNet

* Framework for generating high-quality annotated 3D scenes

Goal:

* Aid Indoor Scene Understanding
* Flexible use for supervised training, 3D reconstruction benchmarks, rendered annotated videos / image sequences.

Method:

* Uses manually-annotated datasets of real-world scenes (e.g. NYUv2)
* learn statistics about object co-occurrences, spatial relationships.
* Hierarchical simulated annealing optimisation
* unlimited number of new annotated scenes
* Objects and Textures taken from existing databases

Contributions:

* Dataset with 57 scenes, 5 scene categories
* Created by human designers and manually annotated at an object instance level.
* Method to automatically generate new physically realistic scenes.
* Scene generation formulated as an optimisation task
* Object relationships, co-occurence and spatial arrangement learned from base scenes and NYUv2 dataset.
* Introduction of scene variety by sampling objects and textures from libraries.


*  What is [NYUv2](https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html)?
    * Video Sequences from indoor scenes
    * Recorded in RGBD using Microsoft Kinect
    * Partially labeled dense multi-class labels
    * Indoor Segmentation and Support Inference from RGBD Images
    * Interpret the major surfaces, objects and support relations of an indoor scene from an RGBD image.
    * typical, messy, indoor scenes
    * floor, walls, supporting surfaces, object regions, recover support relationships
    * how do 3D cues inform a structured 3D interpretation?
    * Provides dataset of 464 diverse indoor scenes with detailed annotations.
    * Improved object segmentation by being able to infer support structures

Different Scene Categories: (10 scenes per category, 15-250 objects per scene)

* Bedrooms
* Office Scenes
* Kitchens
* Living Rooms
* Bathrooms


Automated Scene Generation

* Simulated Annealing
* Man-made scenes as base (SN-BS, NYUv2)
* extract meaningful statistics that allow to generate new configurations of objects
* Replace objects in generated scene with objects of same category with are sampled from databases (ModelNet, Archive3D)
* Scene generation inspired by automatic furniture placement methods formulating the problem as an energy optimisation problem.
* A weighted sum of **constraints** is minimised via simulated annealing:
    * Bounding box intersection: Object bounding boxes should not intersect. Penalise deviation from constraint.
    * Pairwise distance: Pair together objects that are more likely to co-occur. (Maximum reccomended distance $M$ is a metric in these pairwise constraints)
    * Visibility: ensure that one object is fully visible from the other (why is this necessary? Probably so that objects are evenly spread around the room and not clumped in one corner, just like normal interior design would be done)
    * Distance to wall: formulate likeliness of objects to be positioned against the wall in distance
    * Angle to wall: ...and also in angling against the wall
* Plug all these constraints into an overall energy function as a weighted sum of all the partial constraint.
* Algorithm proposes configurations and then optimises these values in an annealing process where the pertubations in orientation and position are decreased in each iteration according to the annealing schedule.
    * Initialize with all objects centered at the origin.
    * Each iteration, variables for randomly selected objects are locally pertubed, until a maximum number of iterations are reached, this accounts as one epoch.
    * After the epoch check the bounding boxes and visibility constraints, continue next epoch until a feasible configuration is found. (1-3 epochs)
    * After this Object Placement is finished we get a realisticly cluttered and laid out scene.
* To get even better results object groups are defined and moved together as part of the optimizaion process. This allows each of the grouped objects to be more complex and realistic as if it would be possible with a pure global optimization.
* There is no limit to the complexity and combination of the layers of groups in the scene that can be generated.
* One problem with the 3d objects is ensuring that they are all the same relative scale, this is done using an approach by Savva et al.
* Each object comes untextured and is getting applied a texture from a texture library that is appropriate for it. It is uv-mapped automatically in blender. This doesn't necessarily realistic textures, also as only whole objects and not subparts of objects are textured individually, but it's main purpose on providing some visual appearance features is still fulfilled.

## SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?

SceneNet RGB-D

* Dataset
* Provides pixel-perfect ground truth for scene understanding problems (semantic segmentation, instance segmentation, object detection)
* camera poses, depth data (allows optical flow, camera pose estimation, 3d scene labeling)
* random scene layouts, physically simulated object configurations.

Goal:

* Comparison of semantic segmentation performance of SceneNet vs. VGG-16 ImageNet.
* Both fine tuned on the same SUN RGB-D and NYUv2 Datasets
* With Depth data included the performance is even better.
* Large-scale synthetic datasets with task-specific labels > real-world generic pre-training


What is interesting for me here?

* There were some open questions in the original SceneNet paper that could be answered here.
* More realistic rendering with raytracing
* Addition of Physics engine for object placement instead of just the annealing process.

The problem of Getting good data

* A core need in developing automated methods for scene understanding is having good labeled data with as much information as possible.
* ImageNet was a first step in this direction.
* Obtaining more Data such as RGB-D data is very hard though if done manually.
* One step in this direction has been done by sceneNN and scanNet, which use reconstructions from path to get the scene geometry and manually annotated the resulting 3d scenes.
* Getting other and more reliable data is even more complicated from the real world, and requires additional equipment or is not even possible, e.g. wen thinking about dynamic scenes.
* Generating high quality synthetic data with realistic object placement, rendering, human like camera poses and visual effects has the potential to solve many of these problems effectively

Contributions:

* Very large dataset with high-quality ray-traced RGB-D images, with lighting effects, motion blur, ground truth labels
* Dataset Generation pipeline relying on fully automatic randomised methods wherever possible.
* Proposition of an algorithm to generate camera trajectories
* Comparison of a Pretrained RGB-CNN from synthetic data with one that is trained on real-world data.

What is actually interesting here is that they are using this randomized physics based object placement approach and completey left out the object placement approach that we saw in the normal SceneNet



## BlenderProc SceneNet main.py

Parser:

* We read one scene file using the `scene_net_obj_path`, which also references the associated `.mtl` material file.
* The `scene_texture_path` defines the folder in which all the textures are stored. These are used to map them to the individual objects corresponding to the object types.
* The `output_dir` is just the path to where the generated hdf5 files will be saved

Label Mapping:

* I don't really understand how the label mappings work. We use a nyu_idset.csv file, that is a internal file from the blenderproc utilities.
* My guess would be theat the load_scenenet method can somehow extract an identifier from the obj or material file, that using this object type mapping can be use to infer object labels in the custom property `category_id`.

Objects:

* we use the special `load_scenenet`  method that loads the obj file, and maps the textures from the folder to the object using the previously computed label_mapping.

Handle Floors and Walls:

* Look for all walls by filtering the loaded objects by the custom property `category_id` and looking for the id `wall`.
* From these wall objects we extract floors using a builtin BlenderProc method and rename these newly generated objects as floor.
* We do the same with the ceilings, with the same builtin method but now looking for the inverse up-vector.
* Both the newly created floors and ceilings get set a custom property of either "floor" and "ceiling"

Handle Lighting

* Lamps should emit light, so we look for lights by their name using a regex and add a light surface with a relatively high emission, and having the emission color of the material defined for the lamp.
* Also the ceilings emit a small bit of light for lighting up the whole room, simulating maybe some light coming in from the windows.

From the objects we create a bounding volume hierarchy.

Finding a camera location:

* We try 10000 time to find 5 poses
* We sample a location above the floor level, at minimum 1.5 meter and maximum 1.8 meter height.
* We check that we don't stand on a object with the camera
* We find some random orientation for the camera
* We check that there are no objects directly in front of the camera (1.0 meter)
* We check that we have a good coverage of the scene with the objects that we are looking at. For doing that we use the bounding volume hierarchy of the scene.
* Once we passed all these checks we add the resulting pose to the valid cam poses and try again until we found 5 of them.

Now that we have objects loaded, floors and ceilings seperated, and defined good poses for the camera we render the normal maps, depth map, segmentation maps using the custom property `category_id`

Then we just render the scene and write the results in the hdf5 format  to the output dir.



## BlenderProc SceneNetLoader.py




## My takeaways

Large scale labelled datasets are important for supervised learning algorithms:

* We dont even want to do supervised learning in any form, but we want labelled datasets of 3d scenes that we can render into rgb-d, semantic segmentation and instance segmentation data.
* Instead of loading arbitrary scenes and inferring object semantics, why not just generate arbitrary scenes that are still realistic.
* Or at least use pre generated scenes for first simple tasks that already contain all the information that we need.
* This full control over a scene and it's generation, gives us the power to also apply this to scene abstraction.
* The concept of there being rules and constrained on how a realistic scene is composed and laid out can be reused when abstracting a scene.
* The control over the scene generation allows us to replace individual object instances with abstracted object instances and to create a fully abstracted scene from a composition of abstracted objects.



## Key Information

* What is abstract?
    * Mathematical Abstraction (bounding with intervals)
    * Conceptual Abstraction (represent a family of possible renderings rather than one fixed image)

In [None]:
#| export
def foo(): pass

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()