Creating tools to handle raster and vector data to split it into small pieces equaled in size for machine learning datasets
-
Install docker https://docs.docker.com/engine/install/ (macos, Windows or Linux)
-
Clone the Repository :
git clone https://github.com/Youssef-Harby/split-rs-data.git
-
Go to project directory :
cd split-rs-data
-
Copy and paste your raster(.tif) and vector(.shp) files into a seperated folders :
-
./split-rs-data/DataSet/ # Dataset root directory |--raster # Original raster data | |--xxx1.tif (xx1.png) | |--... | └--... | |--vector # All shapefiles in the same place (.shx, .shp..etc) | |--xxx1.shp | |--xxx1.shx / .prj / .cpg / .dbf .... | └--xxx2.shp
-
Build the docke image :
docker compose up --build
-
go to http://127.0.0.1:8888/
-
you will find your token in the cli of the image.
-
Open Tutorial.ipynb to learn
-
Or define your vector and raster folders in multi_raster_vector.py file and run it in docker by open cli and type :
python multi_raster_vector.py
- Creating Docker Image for development env.
- Splitting raster data into equal pieces with rasterio (512×512) thanks to @geoyee.
- Splitting raster data into equal pieces with GDAL , https://gdal.org/.
- Rasterize shapefile to raster in the same satellite pixel size and projection.
- Convert 24 or 16 bit raster to 8 bit.
- Export as jpg (for raster) and png (for rasterized shapefile) with GDAL.
- Validation of training and testing datasets for paddlepaddle.
- GUI
- QGIS Plugin ➡️ Deep Learning Datasets Maker
all these tools made for prepare data for paddlepaddlea.
from osgeo import gdal, ogr
- fn_ras = Input raster data (GTiff)
- fn_vec = input vector data (Shapefile)
fn_ras = 'DataSet/raster/01/01.tif'
fn_vec = 'DataSet/vector/01/01.shp'
output = 'DataSet/results/lab_all_values.tif'
import the GDAL driver "ESRI Shapefile" to open the shapefile
driver = ogr.GetDriverByName("ESRI Shapefile")
open raster and shapefile datasets with (shapefile , 1)
- (shapefile , 1) read and write in the shapefile
- (shapefile , 0) read onle the shapefile
ras_ds = gdal.Open(fn_ras)
vec_ds = driver.Open(fn_vec, 1)
Get the :
- GetLayer (Only shapefiles have one lyrs other fomates maybe have multi-lyrs) #VECTOR
- GetGeoTransform #FROM RASTER
- GetProjection #FROM RASTER
lyr = vec_ds.GetLayer()
geot = ras_ds.GetGeoTransform()
proj = ras_ds.GetProjection() # Get the projection from original tiff (fn_ras)
geot
(342940.8074133941,
0.18114600000000536,
0.0,
3325329.401211367,
0.0,
-0.1811459999999247)
Open the shapefile feature to edit in it
layerdefinition = lyr.GetLayerDefn()
feature = ogr.Feature(layerdefinition)
feature.GetFieldIndex
make you to know the id of a specific field name
you want to read/edit/delete
- Also you can list all fields on the shapefile by :
schema = []
for n in range(layerdefinition.GetFieldCount()):
fdefn = layerdefinition.GetFieldDefn(n)
schema.append(fdefn.name)
- Then I will delete the field called "MLDS" has been assumed by me
yy = feature.GetFieldIndex("MLDS")
if yy < 0:
print("MLDS field not found, we will create one for you and make all values to 1")
else:
lyr.DeleteField(yy)
add new field to the shapefile with a default value "1"
and don't
forget to close feature after the edits
new_field = ogr.FieldDefn("MLDS", ogr.OFTInteger)
lyr.CreateField(new_field)
for feature in lyr:
feature.SetField("MLDS", 1)
lyr.SetFeature(feature)
feature = None
Set the projection from original tiff (fn_ras) to the rasterized tiff
drv_tiff = gdal.GetDriverByName("GTiff")
chn_ras_ds = drv_tiff.Create(
output, ras_ds.RasterXSize, ras_ds.RasterYSize, 1, gdal.GDT_Byte)
chn_ras_ds.SetGeoTransform(geot)
chn_ras_ds.SetProjection(proj)
chn_ras_ds.FlushCache()
gdal.RasterizeLayer(chn_ras_ds, [1], lyr, burn_values=[1], options=["ATTRIBUTE=MLDS"])
chn_ras_ds = None
vec_ds = None
DONE
ds = gdal.Open(fn_ras)
gt = ds.GetGeoTransform()
get coordinates of upper left corner
xmin = gt[0]
ymax = gt[3]
resx = gt[1]
res_y = gt[5]
resy = abs(res_y)
import math
import os.path as osp
the tile size i want (may be 256×256 for smaller memory size)
needed_out_x = 512
needed_out_y = 512
round up to the nearest int
xnotround = ds.RasterXSize / needed_out_x
xround = math.ceil(xnotround)
ynotround = ds.RasterYSize / needed_out_y
yround = math.ceil(ynotround)
print(xnotround)
print(xround)
print(ynotround)
print(yround)
9.30078125
10
5.689453125
6
pixel to meter - 512×10×0.18
pixtomX = needed_out_x * xround * resx
pixtomy = needed_out_y * yround * resy
print (pixtomX)
print (pixtomy)
927.4675200000274
556.4805119997686
size of a single tile
xsize = pixtomX / xround
ysize = pixtomy / yround
print (xsize)
print (ysize)
92.74675200000274
92.74675199996143
create lists of x and y coordinates
xsteps = [xmin + xsize * i for i in range(xround + 1)]
ysteps = [ymax - ysize * i for i in range(yround + 1)]
xsteps
[342940.8074133941,
343033.5541653941,
343126.3009173941,
343219.0476693941,
343311.7944213941,
343404.54117339413,
343497.28792539414,
343590.03467739414,
343682.78142939415,
343775.5281813941,
343868.2749333941]
set the output path
cdpath = "DataSet/image/"
loop over min and max x and y coordinates
for i in range(xround):
for j in range(yround):
xmin = xsteps[i]
xmax = xsteps[i + 1]
ymax = ysteps[j]
ymin = ysteps[j + 1]
# gdal translate to subset the input raster
gdal.Translate(osp.join(cdpath, \
(str("01") + "-" + str(j) + "-" + str(i) + "." + "jpg")),
ds,
projWin=(abs(xmin), abs(ymax), abs(xmax), abs(ymin)),
xRes=resx,
yRes=-resy,
outputType=gdal.gdalconst.GDT_Byte,
format="JPEG")
ds = None
For all data that is not divided into training set, validation set, and test set, PaddleSeg provides a script to generate segmented data and generate a file list.
The data file structure is as follows:
./DataSet/ # Dataset root directory
|--image # Original image catalog
| |--xxx1.jpg (xx1.png)
| |--...
| └--...
|
|--label # Annotated image catalog
| |--xxx1.png
| |--...
| └--...
Among them, the corresponding file name can be defined according to needs.
The commands used are as follows, which supports enabling specific functions through different Flags.
python tools/split_dataset_list.py <dataset_root> <images_dir_name> <labels_dir_name> ${FLAGS}
Parameters:
- dataset_root: Dataset root directory
- images_dir_name: Original image catalog
- labels_dir_name: Annotated image catalog
FLAGS:
FLAG | Meaning | Default | Parameter numbers |
---|---|---|---|
--split | Dataset segmentation ratio | 0.7 0.3 0 | 3 |
--separator | File list separator | " | " |
--format | Data format of pictures and label sets | "jpg" "png" | 2 |
--label_class | Label category | '__background__' '__foreground__' | several |
--postfix | Filter pictures and label sets according to whether the main file name (without extension) contains the specified suffix | "" ""(2 null characters) | 2 |
After running, train.txt
, val.txt
, test.txt
and labels.txt
will
be generated in the root directory of the dataset.
Note: Requirements for generating the file list: either the original image and the number of annotated images are the same, or there is only the original image without annotated images. If the dataset lacks annotated images, a file list without separators and annotated image paths will be generated.
python tools/split_dataset_list.py <dataset_root> images annotations --split 0.6 0.2 0.2 --format jpg png
- If you need to use a custom dataset for training, it is recommended to organize it into the following structure: custom_dataset | |--images | |--image1.jpg | |--image2.jpg | |--... | |--labels | |--label1.png | |--label2.png | |--... | |--train.txt | |--val.txt | |--test.txt
The contents of train.txt and val.txt are as follows:
image/image1.jpg label/label1.png
image/image2.jpg label/label2.png
...
Full Docs : https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.3/docs/data/custom/data_prepare.md
import sys
import subprocess
theproc = subprocess.Popen([
"python",
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\split_dataset_list.py", #Split text py script
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet", # Root DataSet ath
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\image", #images path
r"C:\Users\Youss\Documents\pp\New folder\split-rs-data\DataSet\label",
# "--split",
# "0.6", # 60% training
# "0.2", # 20% validating
# "0.2", # 20% testing
"--format",
"jpg",
"png"])
theproc.communicate()
(None, None)