# Step 2 - Load the shapefiles into PostGIS

In this section we will use the GDAL command line utility [ogr2ogr](https://gdal.org/programs/ogr2ogr.html), a powerful tool that converts between almost dataformats. We will use it to load the shapefiles into PostGIS. The following commands might look intimidating at first due to the many parameters, but we will explain them step by step later.

**Your turn:**
- Similar to the previous step, open a terminal and navigate to the folder of this datastory (the same folder where this notebook is located).
- Replace DATABASE_NAME, HOST, PORT, USERNAME and PASSWORD in the commands below with the connection information of the PostGIS sandbox component.
- Run both commands below to load the data into the database.

Load the road network data by running this command in the terminal:
```shell
ogr2ogr \
-f "PostgreSQL" \
-progress \
-nln "zh_roads" \
-nlt PROMOTE_TO_MULTI \
-lco FID=fid \
-lco GEOMETRY_NAME=geom \
--config OGR_TRUNCATE YES \
PG:"dbname='DATABASE_NAME' host='HOST' port='PORT' user='USERNAME' password='PASSWORD'" \
"./data/20220405_veloFusswegnetzZurich/taz_mm.tbl_routennetz.shp"
```

***
Let's have a look at the parameters:
- `-f "PostgreSQL"` - Specify the target format to be a PostgreSQL (PostGIS) table.
- `-progress` - Display a progress bar when loading the data.
- `-nln "zh_roads"` - The name of the new database table should be zh_roads.
- `-nlt PROMOTE_TO_MULTI` - If single and multi geometries are mixed, promote all to multi to have uniform geometries.
- `-lco FID=fid` - Create a feature id column named fid.
- `-lco GEOMETRY_NAME=geom` - Set name the geometry column to geom.
- `--config OGR_TRUNCATE YES` - Drop all rows before loading data if a table with that name already exists. This allows overwrites existing data without destroying views on the data. 
- `PG:"dbname='DATABASE_NAME' host='HOST' port='PORT' user='USERNAME' password='PASSWORD'"` - A connection string holds all necessary data to establish a connection to the database. Replace DATABASE_NAME, HOST, PORT, USERNAME and PASSWORD with the connection information of the PostGIS sandbox component. 
- `"./data/20220405_veloFusswegnetzZurich/taz_mm.tbl_routennetz.shp"` - Path to the file to load.


***
Now load the districts data:

```shell
ogr2ogr \
-f "PostgreSQL" \
-progress \
-nln "zh_districts" \
-nlt PROMOTE_TO_MULTI \
-lco FID=fid \
-lco GEOMETRY_NAME=geom \
--config OGR_TRUNCATE YES \
PG:"dbname='DATABASE_NAME' host='HOST' port='PORT' user='USERNAME' password='PASSWORD'" \
"./data/20220405_statistischeQuartiereZurich/stzh.adm_statzonen_v.shp"
```

**That's it, congratulations!** You now have the data of the two shapefiles available as their own tables in PostGIS. Your coworkers can now access the data from a central place by connecting to the database, for example via QGIS.

***
# (OPTIONAL) Use pgAdmin to check the new tables in the database

**Your turn:**
- Use pgAdmin to connect to the database and check if you see the new tables.

![check tables](./story_images/check_tables.gif)

***
# (OPTIONAL) Explore data & load to PostGIS using Python GeoPandas
It is also possible to use the Python ecosystem to explore spatial data and load it to PostGIS. This section is optional, as the result is exactly the same as when using `ogrinfo` and `ogr2ogr`. It is simply a different set of tools to achieve the same goal. Depending on your preferences and already existing technology stack, you might prefer one way over the other. Here Python is used to explore and load the districts dataset into PostGIS.

We will make use of [GeoPandas](https://geopandas.org/en/stable/), which is built on the famous Python package Pandas. GeoPandas interfaces with many other specialized packages of the Python geo-ecosystem to provide an amazing user experience.

The following sections will only consider the districts data. The procedure would be identical for the road network data.

### Explore district data
Run the following cells to read and explore the data.

In [None]:
import geopandas

# Reading data is straight-forward with GeoPandas. Nice to know: Under the hood GeoPandas uses the 
# specialized Fiona package, which is all about reading and writing data.
districts_data = geopandas.read_file("./data/20220405_statistischeQuartiereZurich/stzh.adm_statzonen_v.shp")

In [None]:
# Using .head(N) we can display the first N rows of data.
districts_data.head(3)

In [None]:
# GeoPandas makes it easy to obtain all kind of information about the data we loaded.
print(f'Nr of features: {len(districts_data)}')
print(f'Coordinate reference system: {districts_data.crs}')
print(f'Nr of attribute columns: {len(districts_data.columns)}')
print(40*'-')
print(f'Column names:')
for column in districts_data.columns:
    print(column)

In [None]:
# Using .plot() generates static visualizations. 
# It uses the famous matplotlib package under the hood. 
districts_data.plot()

In [None]:
# There is even the possibility to visualize data in an interactive way using .explore().
# This is possible thanks to GeoPandas making use of the folium Python package.
districts_data.explore(column='stzname', legend=False)

### Load the data into PostGIS
GeoPandas uses the packages [GeoAlchemy2](https://geoalchemy-2.readthedocs.io/en/latest/) and [SQLAlchemy](https://www.sqlalchemy.org/) under the hood which are specialized in interacting with databases. A first step is the creation of a _connection string_, a simple text which contains all information needed to connect to the database following a particular convention. This connection string is then used to establish a connection to the database (called engine below) which is used by GeoPandas to load the data into PostGIS.

**Your turn:**
- Replace DATABASE_NAME, HOST, PORT, USERNAME and PASSWORD in the cell below with the connection information of the PostGIS sandbox component. Make sure to keep the quotes (') so that Python reads the connection information as strings (text).
- Run both cells below to load the data into the database.
- Once again you can use pgAdmin to check the newly created table in the database.

In [None]:
user = 'USER'
password = 'PASSWORD'
host = 'HOST'
port = 'PORT'
database_name = 'DATABASE_NAME'

connection_string = f'postgresql://{user}:{password}@{host}:{port}/{database_name}'
print(f'{connection_string=}')

In [None]:
from sqlalchemy import create_engine

table_name = 'zh_districts_from_geopandas' 
print(f'Start loading to PostGIS table with name {table_name}...')
engine = create_engine(connection_string)
districts_data.to_postgis(table_name, engine, if_exists='replace', index=False)
print('Successfully loaded')