# Example notebook for processing point cloud data for PointNet

For this example I simply downloaded the "Oakland" dataset (training) http://www.cs.cmu.edu/~vmr/datasets/oakland_3d/cvpr09/doc/ and converted the dataset to multiple LAZ files for demonstration purposes.

In [1]:
from pathlib import Path


import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
from laspy.file import File
import morton
# also requires fastparquet

#%load_ext line_profiler

## Reading LAS (or LAZ)

In [2]:
#lasfile_dir = Path('data/training')
lasfile_dir = Path('/root/data/test')
lasfiles = sorted(list(lasfile_dir.glob('*.las')))
lasfiles

[PosixPath('/root/data/test/notw_grid00_00-00.las')]

In [3]:
#dropped_columns = ['flag_byte', 'scan_angle_rank', 'user_data', 'pt_src_id']
dropped_columns = ['flag_byte', 'scan_angle_rank', 'user_data', 'pt_src_id','gps_time']

meta = pd.DataFrame(np.empty(0, dtype=[('X',float),('Y',float),('Z',float),
                                       ('intensity',int),('raw_classification',int)]))

@delayed
def load(file):
    with File(file.as_posix(), mode='r') as las_data:
        las_df = pd.DataFrame(las_data.points['point'], dtype=float).drop(dropped_columns, axis=1)
        las_df.X = las_df.X*0.01
        las_df.Y = las_df.Y*0.01
        las_df.Z = las_df.Z*0.01
        return las_df

In [4]:
dfs = [load(file) for file in lasfiles]
df = dd.from_delayed(dfs, meta=meta)
df = df.repartition(npartitions=20)

I often write intermediate steps to Parquet storage to be able to experiment freely with the dataframe. I believe loading Parquet is not (or not much) faster than loading LAS

In [6]:
#df.to_parquet('/root/data/test/oakland', compression='GZIP')

## Spatial partitioning

### Translate origin

In [7]:
#df = dd.read_parquet('/home/tom/vision/data/training/oakland')

In [5]:
df['X'] = df.X - df.X.min()
df['Y'] = df.Y - df.Y.min()

In [9]:
#%time df.to_parquet('/home/tom/vision/data/training/oakland_trans', compression='GZIP')

### Compute grid cell identifier

In [10]:
#df = dd.read_parquet('/home/tom/vision/data/training/oakland_trans')

In [6]:
grid_size = 10.0 #feet
m = morton.Morton(dimensions=2, bits=64)

In [7]:
def get_hash(point, grid_size=grid_size):
    return m.pack(int(point.X // grid_size), int(point.Y // grid_size))

In [8]:
df['hash'] = df[['X', 'Y']].apply(get_hash, grid_size=grid_size, meta=('hash', int), axis=1)

In [9]:
%time df.to_parquet('/root/data/test/00_hash', compression='GZIP')

CPU times: user 4min 20s, sys: 2.5 s, total: 4min 22s
Wall time: 4min 10s


## Normalization

In [15]:
#df = dd.read_parquet('/home/tom/vision/data/training/oakland_hash')

In [9]:
meta = pd.DataFrame(np.empty(0, dtype=list(zip(list(df.columns), list(df.dtypes))) + \
                             list(zip(['XN', 'YN'], [np.dtype('float64')]*2))))
print(meta)
def normalise(df):
    df = df.copy()
    df['XN'] = (df.X - df.X.mean()) / (df.X.max() - df.X.min())
    df['YN'] = (df.Y - df.Y.mean()) / (df.Y.max() - df.Y.min())
    df['raw_classification'] = df['raw_classification'].replace(8.0,1.0)
    return df
df = df.groupby('hash').apply(normalise, meta=meta).reset_index(drop=True)
df = df.groupby('hash').apply(normalise, meta=meta)
#df.groupby('hash')
#df = df.rename(columns={'hash': 'hash30f'})
print(df.head())

df['ZN'] = (df.Z - df.Z.mean()) / (df.Z.max() - df.Z.min())
#new_hashes = df['raw_classification'].unique().compute()
#print(new_hashes)
#print(new_hashes)
%time df.to_parquet('/root/data/test/00_norm2', compression='GZIP')


Empty DataFrame
Columns: [X, Y, Z, intensity, raw_classification, hash, XN, YN]
Index: []
                 X      Y       Z  intensity  raw_classification  hash  \
hash                                                                     
82   13361  120.14  17.66  612.91       83.0                 1.0    82   
     13362  121.64  17.16  632.87       79.0                 1.0    82   
     13409  124.21  18.32  632.40       81.0                 1.0    82   
     13410  124.73  15.15  632.57       92.0                 1.0    82   
     13411  126.64  15.41  632.60       94.0                 1.0    82   

                  XN        YN  
hash                            
82   13361 -0.457499  0.265800  
     13362 -0.301736  0.214359  
     13409 -0.034861  0.333701  
     13410  0.019137  0.007569  
     13411  0.217475  0.034318  


ValueError: cannot insert hash, already exists

In [16]:
#df.index.name = None
#new_hashes = df['hash'].unique().compute()
#print(new_hashes)
#df2 = df.drop(columns=['hash'])
#df2.sort_index(inplace=True)
#print(df2.head())
%time df2.to_parquet('/root/data/test/00_norm', compression='GZIP')

KeyError: "None of ['index'] are in the columns"

That's it for data preparation, the final normalized dataset is your dataset.

## Split dataset

Before training and testing the model you should split this dataset into `train`, `test` and `validation`. I also implemented this code in the `train_custom.py`.

In [2]:
#df = dd.read_parquet('/home/tom/vision/data/training/oakland_norm')
#hashes = df.index.unique()
#new_hashes = df['raw_classification'].unique().compute()
#print(hashes)
df = dd.read_parquet('/root/data/test/00_norm')
print(df.head())

            X      Y       Z  intensity  raw_classification  hash        XN  \
index                                                                         
0      120.14  17.66  612.91       83.0                 1.0    82 -0.457499   
1      121.64  17.16  632.87       79.0                 1.0    82 -0.301736   
2      124.21  18.32  632.40       81.0                 1.0    82 -0.034861   
3      124.73  15.15  632.57       92.0                 1.0    82  0.019137   
4      126.64  15.41  632.60       94.0                 1.0    82  0.217475   

             YN        ZN  
index                      
0      0.265800 -0.006580  
1      0.214359  0.001008  
2      0.333701  0.000829  
3      0.007569  0.000894  
4      0.034318  0.000905  


In [8]:
##df = df.compute()
df2 = df.set_index('hash',drop=False)
df2.index.name = 'index'
#print(df2.head())
#df2 = df2.groupby('hash').filter(lambda x: len(x) > 100)
#print(df2.head())

hashes = df2.index.unique().values
# first split in train/test
train_test_msk = np.random.rand(len(hashes))
train_val_hashes = hashes[train_test_msk < 0.8]
test_hashes = hashes[~(train_test_msk < 0.8)]
# then split train again in train/val
#train_val_msk = np.random.rand(len(train_val_hashes))
#train_hashes = train_val_hashes[train_val_msk < 0.8]
#validation_hashes = train_val_hashes[~(train_val_msk < 0.8)]


In [10]:
import json
hashes = df['hash'].unique().compute().values
print(hashes)
train_test_msk = np.random.rand(len(hashes))
train_val_hashes = hashes[train_test_msk < 0.8]
test_hashes = hashes[~(train_test_msk < 0.8)]

train_val_msk = np.random.rand(len(train_val_hashes))
train_hashes = train_val_hashes[train_val_msk < 0.8]
validation_hashes = train_val_hashes[~(train_val_msk < 0.8)]

with open('/root/data/test/data_split2.json', 'w') as data_split:
    json.dump({'train': train_hashes.tolist(), 'validation': validation_hashes.tolist(), 'test': test_hashes.tolist()}, data_split)

[    82    102    111 ... 673292 673347 673608]


In [5]:
print(len(df))
print(len(df2))

6101474
999249


## Generator

To feed the data to the deep learning network you need a generator. I also implemented this code in the `train_custom.py`.

In [11]:
#```python
def generator(df, hashes, BATCH_SIZE, NUM_POINT, N_AUGMENTATIONS, shuffled=True):
    """
    Generator function to serve the data to the algorithm.
    
    IN: df (the entire dataframe), hashes (the indices to serve), 
        BATCH_SIZE and NUM_POINTS (to set output shape),
        N_AUGMENTATIONS (the number of "augmentations" or iterations of sampling)
    OUT: data, label (batch of data and corresponding labels)
    """
    data_channels = ['X', 'Y', 'Z', 'XN', 'YN','intensity']
    
    seed_hash = []
    for seed in range(N_AUGMENTATIONS):
        for h in hashes:
            seed_hash.append((seed, h))
    shuffle(seed_hash)
    
    batches = [seed_hash[i:i+BATCH_SIZE] for i in range(0,len(seed_hash),BATCH_SIZE)]
    if len(batches[-1]) < BATCH_SIZE: batches = batches[:-1]
    if shuffled: [shuffle(batch) for batch in batches]
        
    def random_sample_block(group, seed):
        """
        Sample entirely random for the entire grid cell
        IN: group (all points in a grid cell), seed (random state value)
        OUT: data_group (a subset of the points in the grid cell; a training sample)
        """
        if len(group) > NUM_POINT:
            data_group = group.sample(n=NUM_POINT, replace=False, random_state=seed)
        else:
            data_group = group.sample(n=NUM_POINT, replace=True, random_state=seed)
        return data_group

    for batch in batches:
        df_batch = [random_sample_block(df.loc[h], s) for s,h in batch]
        data = np.stack([b[data_channels].values for b in df_batch])
        label = np.stack([l.raw_classification.values for l in df_batch])
        yield data, label
#```
from random import shuffle
BATCH_SIZE=12
NUM_POINT=400
N_AUGMENTATIONS=1
num_batches = 0
for batch_data, batch_label in generator(df2, test_hashes, BATCH_SIZE, NUM_POINT, N_AUGMENTATIONS):
    num_batches += 1 * BATCH_SIZE
    if num_batches % 10 == 0:
        print('Current batch num: {0}'.format(num_batches))

Current batch num: 60
Current batch num: 120
Current batch num: 180
Current batch num: 240
Current batch num: 300
Current batch num: 360
Current batch num: 420
Current batch num: 480
Current batch num: 540
Current batch num: 600
Current batch num: 660
Current batch num: 720
Current batch num: 780
Current batch num: 840
Current batch num: 900
Current batch num: 960
Current batch num: 1020
Current batch num: 1080
Current batch num: 1140
Current batch num: 1200
Current batch num: 1260
Current batch num: 1320
Current batch num: 1380
Current batch num: 1440


## Adapting `train.py`

Check out the `train_custom.py` for my adaptations to the `train.py` from the original PointNet codebase. This implements the data splitting and generator I mentioned earlier.

## Train PointNet

In [25]:
import os
os.chdir('/root/pointnet/sem_seg/') 
print(os.getcwd())
%run train_custom.py --log_dir=log --max_epoch=50 --num_point=4096 --batch_size=12 --grid_size=30 --xyz --intensity --n_augmentations=1

/root/pointnet/sem_seg


AttributeError: module 'tensorflow' has no attribute 'placeholder'

In [2]:
df = dd.read_parquet('/root/data/test/00_norm')
print(df.head())

                 X          Y       Z  intensity  raw_classification  \
index                                                                  
0      13418154.94  322151.85  612.60       83.0                 2.0   
1      13418155.03  322169.09  612.70       71.0                 2.0   
2      13418155.54  322165.97  612.52       81.0                 2.0   
3      13418156.05  322162.81  612.66       79.0                 2.0   
4      13418156.58  322159.62  612.58       83.0                 2.0   

              hash        XN        YN        ZN  
index                                             
0      87401410077 -0.491372 -0.130226 -0.006698  
1      87401410077 -0.473336  0.476603 -0.006660  
2      87401410077 -0.371131  0.366782 -0.006729  
3      87401410077 -0.268927  0.255554 -0.006675  
4      87401410077 -0.162714  0.143269 -0.006706  


In [9]:
print(df.index.max().compute())

626276
