# Creating two unified databases, with only the necessary information

We are going to create a database, containing only information on a select few cities in every country. As a reference, we chose to consider one city for 5 million inhabitants of the country. 

We will, as such, consider the first few largest cities (worth 5 million inhabitants), until reaching the entire countries' population.

We'll use a webscraping method to collect the information on a country, and save the list of cities we'll consider in a dictionary

In [1]:
import pandas as pd
import geopandas as gpd
import countriesAndCities

In [3]:
countries = dict() #a dictionary containing the cities we will work on during this project

# 1. Germany

We'll start by getting the different relevant cities for Germany, and then working on the different geojson files to create a relevant database

In [4]:
countries['Germany'] = countriesAndCities.listCities('Germany')

In [5]:
print(countries)

{'Germany': ['Berlin', 'Hamburg', 'Munich', 'Köln', 'Frankfurt am Main', 'Essen', 'Stuttgart', 'Dortmund', 'Düsseldorf', 'Bremen', 'Hannover', 'Leipzig', 'Duisburg', 'Nürnberg', 'Dresden', 'Wandsbek', 'Bochum']}


In this dictionary, the city of Munich uses it's anglicised name. However, in the actual database, the name used is the german one: München

In [None]:
countries['Germany'][2] = 'München'

The city of Frankfurt am Main is refered to as Frankfurt in the database

In [None]:
countries['Germany'][4] = 'Frankfurt'

In [None]:
countries

# 1.1. Stations

We'll start by creating a database of all stations in the selected cities

In [None]:
stations = 'Germany/railwayStationNodes.geojson'

deutscheBahnStations = gpd.read_file(stations)

In [None]:
deutscheBahnStations.columns

We can start by dropping the column containing the nature of the node, and the index of the node

In [None]:
deutscheBahnStations = deutscheBahnStations.drop('formOfNode', axis = 1)
deutscheBahnStations = deutscheBahnStations.drop('id', axis = 1)

In [None]:
deutscheBahnStations

We check, for every single row, if the name of the station is located in one of the select few cities. The name of the station is the third value (index 2) of each row
Furthermore, we check that there is space after the name of each city, as to avoid other cities with street names (as in Berlin -> Berliner).

We are going to select every line containing data on a station in one of the cities, and concatenate every one of these separate dataframe

In [None]:
dfListStations = []
for city in countries['Germany']:
    tempFrame = deutscheBahnStations.loc[deutscheBahnStations['geographicalName'].str.contains(city + ' ')]
    dfListStations.append(tempFrame)

We define the geodataframe with the chosen coordinate system, EPSG:4258 (documentation available at https://www.geoportal.de/Metadata/55134453-193d-47ea-9b20-0f7016702c91, in german)

In [None]:
workFrameStations = gpd.GeoDataFrame(pd.concat(dfListStations, ignore_index=True), crs=4258)

In [None]:
workFrameStations

Certain nodes are the same station. We will keep a single occurence of every station, based on the railwayStationCode variable

In [None]:
workFrameStations = workFrameStations.drop_duplicates(subset='railwayStationCode')

In [None]:
workFrameStations

Working on the different railway stations, we can convert the geometry to gps coordinates

In [None]:
workFrameStations.crs

In [None]:
type(workFrameStations.crs)

In [None]:
workFrameStations.to_crs(crs = 'epsg:4236')

Finally, we can add a column, indicating that every value in this geodataframe is located in Germany

In [None]:
workFrameStations.insert(0, 'country', ['Germany']*len(workFrameStations))

In [None]:
workFrameStations

# 1.2. Lines

We can do the exact same thing the dataframe of the different train lines

In [None]:
deutscheBahnLines = gpd.read_file('Germany/railwayLines.geojson')

In [None]:
deutscheBahnLines

We can drop any railway line that doesn't start or end in one of our selected cities, i.e. create a new dataframe, with the lines that start or end in one of these cities

In [None]:
dfListLines = []
for city in countries['Germany']:
    tempFrameLines = deutscheBahnLines.loc[deutscheBahnLines['geographicalName'].str.contains(city + ' ')]
    dfListLines.append(tempFrameLines)

In [None]:
workFrameLines = gpd.GeoDataFrame(pd.concat(dfListLines, ignore_index = True), crs = 4258)

In [None]:
workFrameLines

And drop the duplicate lines

In [None]:
workFrameLines = workFrameLines.drop_duplicates(subset='railwayLineCode')

In [None]:
workFrameLines

Furthermore, we can drop the line id column

In [None]:
workFrameLines = workFrameLines.drop('id', axis=1)

In [None]:
workFrameLines

Add a column indicating that these lines are located in Germany

In [None]:
workFrameLines.insert(0, 'country', ['Germany']*len(workFrameLines))

In [None]:
workFrameLines

# 2. Austria

We now have a database with the different stations and lines in Germany. We will now add the values for Austria to this database.

In [None]:
countries['Austria'] = countriesAndCities.listCities('Austria')

In [None]:
countries

The database uses the german name for Vienna, Wien

In [None]:
countries['Austria'][0] = 'Wien'

# 2.1. Stations

In [None]:
stationsAustria = 'Austria/GIP_Betriebsstellen_DelEUV_JSON.json'
stationsAustriaFrame = gpd.read_file(stationsAustria)

In [None]:
stationsAustriaFrame

In [None]:
stationsAustriaFrame.columns

Quite a few columns here are useless. We can remove these columns

In [None]:
columnsToRemove = ['BSTS_ID', 'DB640_CODE', 'OBJECTID', 'GIP_OBID', 'EXTERNALID', 'REGIONALCO', 'VALIDFROM', 'VALIDTO', 'OWNER_NAME', 'PV_EVA_NR', 'ANZ_AUFZUG', 'ANZ_FAHRTR', 'ANZ_UHREN',
                  'ANZ_AKUSTI','ANZ_OPTISC', 'INFOPOINT', 'MUEZ', 'MUEZ_KURZ', 'HILFE_MOBI', 'ANZ_ROLLST', 'ANZ_E_LADE', 'RUD_PARKPL', 'VERIFIZIER',
                  'PUBL_WLAN', 'MUEZ_LANG', 'BEMERKUNG']

In [None]:
for column in columnsToRemove:
    stationsAustriaFrame = stationsAustriaFrame.drop(column, axis=1)

In [None]:
stationsAustriaFrame

We can now focus on retrieving the rows with information on the two cities of interest

In [None]:
dfStationsAustria = []
for city in countries['Austria']:
    tempFrame = stationsAustriaFrame.loc[stationsAustriaFrame['NAME_FPL'].str.contains(city + ' ')]
    dfStationsAustria.append(tempFrame)

In [None]:
dfStationsAustria

In [None]:
workFrameAustria = gpd.GeoDataFrame(pd.concat(dfStationsAustria), crs = 31287)

In [None]:
workFrameAustria

Add the country to the work dataframe

In [None]:
workFrameAustria.insert(0, 'country', ['Austria']*len(workFrameAustria))

In [None]:
workFrameAustria

Convert the coordinates to the European standard Coordinate Reference System

In [None]:
workFrameAustria.to_crs(4258)

# 2.2. Lines

We can do the exact same with the train lines database

In [2]:
linesAustria = 'Austria/GIP_Strecken_MLA.json'
linesAustriaFrame = gpd.read_file(linesAustria)

In [None]:
linesAustriaFrame

In [None]:
linesAustriaFrame.columns

Once again, quite a few columns are useless, and we can get rid of them

In [None]:
uselessColumns = ['GIP_OBID', 'BST_ID', 'FOW_NAME', 'FRC_NAME', 'REGION', 'VALIDFROM', 'VALIDTO', 'CROSSSECT', 'CROSS_NAME', 
                  'ELEKTRI', 'EXPDATE']

for column in uselessColumns:
    linesAustriaFrame = linesAustriaFrame.drop(column, axis=1)

In [None]:
linesAustriaFrame

We are left with an id for the line, the name of the line, the geographical region in which the line lies (between 'NODEFROM' and 'NODETO'), and the geometry of the lines.

We can't get rid of any further rows, as each row contains unique geometric information