# Alpine Peaks from Openstreetmap
## How to combine a bunch of Shapefiles (or QGIS vector layers) into a single one without dublicates
I've just downloaded a lot of points - all peaks of the Alps - from [Openstreetmap](https://www.openstreetmap.org) using the [QuickOSM](https://plugins.qgis.org/plugins/QuickOSM/) plugin in [QGIS](https://www.qgis.org) (search key: natural, value: peak). But to avoid connection timeouts, I had to narrow down my area of interest for each request. Now I want to combine the resulting QGIS layers (shapefiles) into one layer, without the duplicates. This would be tricky in QGIS, but Python comes to help.

In the second section of this notebook, I explain [how to check if a feature is within a polygon (e.g. within a country)](#section2).

Jupyter Notebook by [Florian Neukirchen](https://www.riannek.de/).

Sorry, this notebooks comes without the data. It should be easy to adapt the code to be used with your own project.

![QGIS with 8 layers of peak data downloaded from openstreetmap](https://raw.githubusercontent.com/florianneukirchen/jupyter-notebooks/main/alpinepeaks.png)

In [1]:
import pandas as pd
import geopandas as gpd
import os

folder = 'alpinepeaks'

Create a list of all shapefiles

In [2]:
filelist = []
for file in os.listdir(folder):
    if file.endswith(('.shp')):
        filelist.append(file)

filelist

FileNotFoundError: [Errno 2] No such file or directory: 'alpinepeaks'

Read all shapefiles into a single geopandas GeoDataFrame

In [122]:
gdf = gpd.read_file(os.path.join(folder, filelist[0]))
gdf.set_index('full_id', inplace=True)

for file in filelist[1:]:
    newgdf = gpd.read_file(os.path.join(folder, file))
    newgdf.set_index('full_id', inplace=True)
    gdf = pd.concat([gdf, newgdf], sort=True) # Sort columns

In [123]:
# remove duplicates
gdf = gdf[~gdf.index.duplicated(keep='first')]

In [124]:
gdf.head()

Unnamed: 0_level_0,osm_id,osm_type,natural,note_de,source_add,source_not,peak,operator,checkpoint,checkpoi_1,...,ele_AT,ele_nn_1,cross_star,image_pano,ele_müm,name_de_DE,name_de_AT,cross_mate,name_fa,nat_name
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n26862532,26862532,node,peak,,,,,,,,...,,,,,,,,,,
n26862596,26862596,node,peak,,,,,,,,...,,,,,,,,,,
n26862749,26862749,node,peak,,,,,,,,...,,,,,,,,,,
n26862811,26862811,node,peak,,,,,,,,...,,,,,,,,,,
n26863052,26863052,node,peak,,,,,,,,...,,,,,,,,,,


In [125]:
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 53440 entries, n26862532 to n10594918978
Columns: 299 entries, osm_id to nat_name
dtypes: geometry(1), object(298)
memory usage: 124.3+ MB


### Get rid of uninteresting data
Chances are we have a lot of Nan values. First I remove rows without name (I don't need them)

In [126]:
gdf = gdf[gdf['name'].notna()]

We have a lot of useless columns (mostly NaN):

In [127]:
len(list(gdf.columns))

299

In [128]:
sorted(list(gdf.columns))

['KT HPO',
 'PDOP_2',
 'TOP_NOMAI',
 'access',
 'addr_city',
 'addr_hamle',
 'addr_place',
 'addr_stree',
 'aerialway',
 'aerialway_',
 'alias',
 'alt_NHN',
 'alt_WGS84',
 'alt_loc_na',
 'alt_müm',
 'alt_name',
 'alt_name2',
 'alt_name_1',
 'alt_name_2',
 'alt_name_d',
 'alt_name_e',
 'alt_name_f',
 'alt_name_h',
 'alt_name_i',
 'alt_name_l',
 'alt_name_r',
 'alt_name_s',
 'alt_name_v',
 'amenity',
 'archaeolog',
 'artist_nam',
 'artwork_ty',
 'backrest',
 'bench',
 'board_type',
 'boundary',
 'c2c_id',
 'castle_t_1',
 'castle_typ',
 'check_date',
 'checkpoi_1',
 'checkpoint',
 'climbing',
 'climbing_g',
 'climbing_s',
 'colour',
 'comment',
 'communic_1',
 'communicat',
 'constructi',
 'cross',
 'cross_colo',
 'cross_heig',
 'cross_mate',
 'cross_name',
 'cross_star',
 'denominati',
 'descript_1',
 'descript_2',
 'descriptio',
 'designatio',
 'direction',
 'dispute_wi',
 'dsou',
 'ele',
 'ele_AT',
 'ele_NN',
 'ele_barome',
 'ele_de',
 'ele_ft',
 'ele_müa',
 'ele_müm',
 'ele_nn_1',
 'e

To check for e.g. 'outdoor':

In [129]:
gdf[gdf['outdoor'].notna()].dropna(axis=1) # Only two rows

Unnamed: 0_level_0,osm_id,osm_type,natural,name,ele,geometry,sport,outdoor,climbing_s
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
n2315712582,2315712582,node,peak,Gitschenwand,1527,POINT (13.25964 47.63427),climbing,yes,yes
n3608997834,3608997834,node,peak,Wilhelmswand,1527,POINT (13.23998 47.63271),climbing,yes,yes


Only keep useful columns


In [130]:
useful = ['name', 'prominence', 'ele', 'geometry', 'name_en', 'name_fr', 'name_it', 'name_de', 'alias', 'alt_name', 'sport', 'name_1', 'name_de_AT',
 'name_de_DE', 'name_ch', 'name_sl']

In [131]:
gdf = gdf[useful]

Add lat and lon columns

In [None]:
gdf['lon'] = gdf.geometry.x
gdf['lat'] = gdf.geometry.y

### Save
I save the result as geojson

In [132]:
gdf.to_file(os.path.join(folder, "peaks.geojson"), driver='GeoJSON')

<a id='section2'></a>

## How to check if a feature is within a polygon (e.g. within a country)
To my peak data I want to add the corresponding countries and mountain areas. A peak can be on a border and belong to more than one country, but it should only be within one mountain area. This means we need different approaches. 

### Add countries
For this, I downloaded the countries of the area with QuickOSM (key: admin_level, value: 2). 

In [133]:
# Import if you haven't already
import pandas as pd
import geopandas as gpd
import os

folder = 'alpinepeaks'

# Optionally load again
gdf = gpd.read_file(os.path.join(folder, "peaks.geojson"))
gdf.set_index('full_id', inplace=True)

In [134]:
countries = gpd.read_file(os.path.join(folder, 'countries.gpkg'))

In [135]:
# Get rid of the double Monaco
countries.replace('Monaco (territorial waters)', 'Monaco', inplace=True)
countries = countries.dissolve(by='name:en')

The last line also sets 'name:en' as index. That means we only need index + geometry

In [136]:
countries = countries[[ 'geometry']]

Add a buffer of 50 m to countries (first reproject from WGS84 to UTM) to make sure that peaks on the border really get both countries.

In [137]:
countries = countries.to_crs("EPSG:32634")
countries.geometry = countries.geometry.buffer(50)
# Back to WGS84
countries = countries.to_crs("EPSG:4326")

I create one column per country with True or False. 

**The next cell is slow.**

In [138]:
for c in countries.index:
    print('processing', c)
    gdf[c] = gdf['geometry'].within(countries.loc[c].geometry) 

processing Austria
processing France
processing Germany
processing Italy
processing Liechtenstein
processing Monaco
processing Slovenia
processing Switzerland


In [139]:
gdf.head()

Unnamed: 0_level_0,name,prominence,ele,name_en,name_fr,name_it,alias,alt_name,sport,name_1,...,name_de_DE,geometry,Austria,France,Germany,Italy,Liechtenstein,Monaco,Slovenia,Switzerland
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n26862532,Basališče,606,1272,,,,,,,,...,,POINT (15.24009 46.39020),False,False,False,False,False,False,True,False
n26862596,Boč,657,978,,,,,,,,...,,POINT (15.59975 46.28959),False,False,False,False,False,False,True,False
n26862749,Črni vrh,945,1543,,,,,,,,...,,POINT (15.23320 46.48562),False,False,False,False,False,False,True,False
n26862811,Dürrenstein,809,1878,,,,,,,,...,,POINT (15.05943 47.78573),True,False,False,False,False,False,False,False
n26863052,Göller,632,1766,,,,,,,,...,,POINT (15.49189 47.79369),True,False,False,False,False,False,False,False


Save again

In [140]:
gdf.to_file(os.path.join(folder, "peaks2.geojson"), driver='GeoJSON')

### Add mountain area
From openstreetmap I downloaded relations with region:type=mountain_area. I assume that a peak is only in one mountain area and I add only one column.

In [141]:
area = gpd.read_file(os.path.join(folder, 'mountain_region.gpkg'))

In [142]:
area.head()

Unnamed: 0,full_id,osm_id,osm_type,name:th,name:pnb,name:li,name:kk,name:he,name:bh,name:ast,...,name:fr,name:en,name:de,name:cs,name,boundary,TMC:cid_58:tabcd_1:LocationCode,TMC:cid_58:tabcd_1:LCLversion,TMC:cid_58:tabcd_1:Class,geometry
0,r2110285,2110285,relation,,,,,,,,...,Alpes de Berchtesgaden,Berchtesgaden Alps,Berchtesgadener Alpen,,Berchtesgadener Alpen,,6003.0,9.0,Area,"MULTIPOLYGON (((13.16332 47.59015, 13.16348 47..."
1,r2110286,2110286,relation,,,,,,,,...,,,Göllstock,,Göllstock,,,,,"MULTIPOLYGON (((13.07942 47.71139, 13.07956 47..."
2,r2110287,2110287,relation,,,,,,,,...,,,Hagengebirge,,Hagengebirge,,,,,"MULTIPOLYGON (((13.16822 47.58268, 13.16680 47..."
3,r2110288,2110288,relation,,,,,,,,...,Hochkönig,Hochkönig,Hochkönigstock,,Hochkönigstock,,,,,"MULTIPOLYGON (((13.21158 47.44340, 13.21271 47..."
4,r2110289,2110289,relation,,,,,,,,...,,,Untersberg,,Untersberg,,,,,"MULTIPOLYGON (((13.04564 47.71541, 13.04552 47..."


I define a function so I don't need to loop over the data frame.

In [143]:
def mountain_area(point):
    for a in area.index:
        if point.within(area.loc[a].geometry):
            return area.loc[a]['name']
    return None
    

In [144]:
gdf['mountain_area'] = gdf['geometry'].map(mountain_area)

In [145]:
gdf.head()

Unnamed: 0_level_0,name,prominence,ele,name_en,name_fr,name_it,alias,alt_name,sport,name_1,...,geometry,Austria,France,Germany,Italy,Liechtenstein,Monaco,Slovenia,Switzerland,mountain_area
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n26862532,Basališče,606,1272,,,,,,,,...,POINT (15.24009 46.39020),False,False,False,False,False,False,True,False,Karavanke
n26862596,Boč,657,978,,,,,,,,...,POINT (15.59975 46.28959),False,False,False,False,False,False,True,False,
n26862749,Črni vrh,945,1543,,,,,,,,...,POINT (15.23320 46.48562),False,False,False,False,False,False,True,False,Karawanken und Bachergebirge
n26862811,Dürrenstein,809,1878,,,,,,,,...,POINT (15.05943 47.78573),True,False,False,False,False,False,False,False,Ybbstaler Alpen
n26863052,Göller,632,1766,,,,,,,,...,POINT (15.49189 47.79369),True,False,False,False,False,False,False,False,Mürzsteger Alpen


Save again

In [146]:
gdf.to_file(os.path.join(folder, "peaks2.geojson"), driver='GeoJSON')

## Get rid of Multipoints and other problems
These were crashing my QGIS workflow

In [197]:
# This ele cant be converted to float
gdf.at['n670140790', 'ele'] # '1843.41;1843.69'

'1843.41'

In [198]:
gdf.at['n670140790', 'ele'] = '1843.41'

In [202]:
from shapely.geometry import Point
gdf[gdf.geometry.geom_type != 'Point']

Unnamed: 0_level_0,name,prominence,ele,name_en,name_fr,name_it,alias,alt_name,sport,name_1,...,geometry,Austria,France,Germany,Italy,Liechtenstein,Monaco,Slovenia,Switzerland,mountain_area
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n26862538,Becco di Filadonna,,2150,,,,,,,,...,MULTIPOINT (11.19347 45.96363),False,False,False,True,False,False,False,False,Prealpi Vicentine
n26862539,Col Becchei Dessora,,2794,,,,,,,,...,MULTIPOINT (12.04457 46.60775),False,False,False,True,False,False,False,False,Dolomiti
n26862546,Benediktenwand,,1800,,,,,,,,...,MULTIPOINT (11.46554 47.65317),False,False,True,False,False,False,False,False,Bayerische Voralpen
n26862641,Buchstein,,1701,,,,,,,,...,MULTIPOINT (11.67937 47.63307),False,False,True,False,False,False,False,False,Bayerische Voralpen
n26862661,Kesselkogel - Catinaccio d'Antermoia,,3004,,,Catinaccio d'Antermoia,,,,,...,MULTIPOINT (11.64383 46.47409),False,False,False,True,False,False,False,False,Dolomiti
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
n10588808919,Monte del Riposo,,2109,,,,,,,,...,MULTIPOINT (11.43760 46.61288),False,False,False,True,False,False,False,False,Sarntaler Alpen
n10588859449,Colle Sella,,1059,,,,,,,,...,MULTIPOINT (11.35257 46.53738),False,False,False,True,False,False,False,False,Sarntaler Alpen
n10589170841,Monte Fondoli,,1534,,,,,,,,...,MULTIPOINT (11.53246 46.65265),False,False,False,True,False,False,False,False,Sarntaler Alpen
n10589263330,Cornetto Bianco,,2383,,,,,,,,...,MULTIPOINT (11.46531 46.69760),False,False,False,True,False,False,False,False,Sarntaler Alpen


In [206]:
gdf = gdf.explode(index_parts=False)

In [207]:
gdf[gdf.geometry.geom_type != 'Point']

Unnamed: 0_level_0,name,prominence,ele,name_en,name_fr,name_it,alias,alt_name,sport,name_1,...,Austria,France,Germany,Italy,Liechtenstein,Monaco,Slovenia,Switzerland,mountain_area,geometry
full_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [208]:
gdf.to_file(os.path.join(folder, "peaks2.geojson"), driver='GeoJSON')