# Creating two unified databases, with only the necessary information

We are going to create a database with the largest train stations in each country of interest. 

These countries are France, Belgium, Switzerland, Germany and Austria. Furthermore, we can look at the trains in other parts of Europe aswell

We'll use a webscraping method to collect the information on a country, and save the list of cities we'll consider in a dictionary

In [None]:
!pip install openpyxl
!pip install pandas fiona shapely pyproj rtree 
!pip install geopandas
!pip install folium

"Libraries Import"
import geopandas as gpd
import contextily as ctx
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
import pandas as pd
import geopandas as gpd
import countriesAndCities
import dataGathering

We will create a dictionary for Austria and Germany, containing each countries' largest train stations, based on these two links

In [None]:
largestStation = dict()

urlGermany = 'https://bahnauskunft.info/bahnhoefe-deutschland/'
urlAustria = 'https://www.omio.at/bahnhoefe'

At the same time, we will have to change certain dictionary keys. As such, we can create a function that does exactly this.

In [None]:
def changeKeys(country, valueToChange, newValue):
    '''A method that takes the keys for a country in the largestStations dictionary, and replacey certain values
     @param country: the country with a value to change, of type string
     @param valueToChange: the value in the key to change
     @param newValue: the new value in the key
     @return largestStations: a dictionary with the information, of type dict'''
    listKeys = list(largestStations[country].keys())
    oldKeys = []
    for i in range (len(listKeys)):
        station = listKeys[i]
        if valueToChange in listKeys[i]:
            oldKey = station
            oldKeys.append(oldKey)
            newKey = station.replace(valueToChange, '') + newValue
            largestStations[country][newKey] = largestStations[country][oldKey]
    
    for station in oldKeys:
        largestStations[country].pop(station)
    
    return(largestStations)

# 1. Germany

We'll start by getting the different relevant cities for Germany, and then working on the different geojson files to create a relevant database

In [None]:
largestStations['Germany'] = dataGathering.gather(urlGermany)

In [None]:
largestStations

The deutsche Bahn's database uses 'Hbf' instead of 'Hauptbahnhof' so we must change the key

In [None]:
largestStations = changeKeys('Germany', 'Hauptbahnhof', 'Hbf')

# 1.1. Stations

We'll start by creating a database of all stations in the selected cities

In [None]:
stations = 'Germany/railwayStationNodes.geojson'

deutscheBahnStations = gpd.read_file(stations)

In [None]:
deutscheBahnStations.columns

We can start by dropping the column containing the nature of the node, and the index of the node

In [None]:
deutscheBahnStations = deutscheBahnStations.drop('formOfNode', axis = 1)
deutscheBahnStations = deutscheBahnStations.drop('id', axis = 1)

In [None]:
deutscheBahnStations

We check, for every single row, if the name of the station is located in one of the select few cities. The name of the station is the third value (index 2) of each row
Furthermore, we check that there is space after the name of each city, as to avoid other cities with street names (as in Berlin -> Berliner).

We are going to select every line containing data on a station in one of the cities, and concatenate every one of these separate dataframe

In [None]:
dfListStations = []
for station in (list(largestStations['Germany'].keys())):
    tempFrame = deutscheBahnStations.loc[deutscheBahnStations['geographicalName'] == station]
    dfListStations.append(tempFrame)

We define the geodataframe with the chosen coordinate system, EPSG:4258 (documentation available at https://www.geoportal.de/Metadata/55134453-193d-47ea-9b20-0f7016702c91, in german)

In [None]:
workFrameStations = gpd.GeoDataFrame(pd.concat(dfListStations, ignore_index=True), crs=4258)

In [None]:
workFrameStations

Certain nodes are the same station. We will keep a single occurence of every station, based on the railwayStationCode variable

In [None]:
workFrameStations = workFrameStations.drop_duplicates(subset='railwayStationCode')

In [None]:
workFrameStations

Finally, we can add a column, indicating that every value in this geodataframe is located in Germany

In [None]:
workFrameStations.insert(0, 'country', ['Germany']*len(workFrameStations))

In [None]:
workFrameStations

# 1.2. Lines

We can do the exact same thing the dataframe of the different train lines. However, the information is segmented, and as such we cannot remove any row in the dataframe

In [None]:
deutscheBahnLines = gpd.read_file('Germany/railwayLines.geojson')

In [None]:
deutscheBahnLines.head()

# 2. Austria

We now have a database with the different stations and lines in Germany. We will now add the values for Austria to this database.

In [None]:
largestStations['Austria'] = dataGathering.gather(urlAustria')

In [None]:
largestStations

This database does the exact opposite of the German database. It uses 'Hauptbahnhof', whereas we had values with 'Hbf'

In [None]:
largestStations = changeKeys('Austria', 'Hbf', 'Hauptbahnhof')

# 2.1. Stations

In [None]:
stationsAustria = 'Austria/GIP_Betriebsstellen_DelEUV_JSON.json'
stationsAustriaFrame = gpd.read_file(stationsAustria)

In [None]:
stationsAustriaFrame

In [None]:
stationsAustriaFrame.columns

Quite a few columns here are useless. We can remove these columns

In [None]:
columnsToRemove = ['BSTS_ID', 'DB640_CODE', 'OBJECTID', 'GIP_OBID', 'EXTERNALID', 'REGIONALCO', 'VALIDFROM', 'VALIDTO', 'OWNER_NAME', 'PV_EVA_NR', 'ANZ_AUFZUG', 'ANZ_FAHRTR', 'ANZ_UHREN',
                  'ANZ_AKUSTI','ANZ_OPTISC', 'INFOPOINT', 'MUEZ', 'MUEZ_KURZ', 'HILFE_MOBI', 'ANZ_ROLLST', 'ANZ_E_LADE', 'RUD_PARKPL', 'VERIFIZIER',
                  'PUBL_WLAN', 'MUEZ_LANG', 'BEMERKUNG']

In [None]:
for column in columnsToRemove:
    stationsAustriaFrame = stationsAustriaFrame.drop(column, axis=1)

In [None]:
stationsAustriaFrame

We can now focus on retrieving the rows with information on the two cities of interest

In [None]:
dfStationsAustria = []
for station in largestStations['Austria']:
    tempFrame = stationsAustriaFrame.loc[stationsAustriaFrame['NAME_FPL'] == station]
    dfStationsAustria.append(tempFrame)

In [None]:
dfStationsAustria

In [None]:
workFrameAustria = gpd.GeoDataFrame(pd.concat(dfStationsAustria), crs = 31287)

In [None]:
workFrameAustria

Add the country to the work dataframe

In [None]:
workFrameAustria.insert(0, 'country', ['Austria']*len(workFrameAustria))

In [None]:
workFrameAustria

# 2.2. Lines

We can do the exact same with the train lines database

In [None]:
linesAustria = 'Austria/GIP_Strecken_MLA.json'
linesAustriaFrame = gpd.read_file(linesAustria)

In [None]:
linesAustriaFrame

In [None]:
linesAustriaFrame.columns

Once again, quite a few columns are useless, and we can get rid of them

In [None]:
uselessColumns = ['GIP_OBID', 'BST_ID', 'FOW_NAME', 'FRC_NAME', 'REGION', 'VALIDFROM', 'VALIDTO', 'CROSSSECT', 'CROSS_NAME', 
                  'ELEKTRI', 'EXPDATE']

for column in uselessColumns:
    linesAustriaFrame = linesAustriaFrame.drop(column, axis=1)

In [None]:
linesAustriaFrame

# 3. Swtitzerland

# 3.1. Stations

I use the BAV_List_future_timetable.xlsx from https://opentransportdata.swiss/fr/dataset/bav_liste that I named suissedata1.xlsx


In [None]:

df=pd.read_excel('suissedata1.xlsx')
df=df.drop(columns=['Remarque','Statut','Localité','N° commune','Ct.','Carte','Carte.1','N° sv.85','py','N° sv.','Cc','PE','PT','N° ET','Sigle ET','N° GO','Sigle GO','Nom long','Sigle sv.','PC','PP','ST'])
df.head()


#We only use the dataframe where the transport is equal to 'Zug' (Train)

df1=df.copy()

df1=df1[df1['Moyen de transport']=='Zug']
df1.head()


df2=df1[df1['Longueur']>20]
df3=df2.drop(columns=['Longueur', 'Moyen de transport','Altitude','Commune'])
df3.head()


final_df=df3.assign(Pays="Suisse")
final_df.head()
final_df.to_csv(r'stations.csv', index = False)

# 3.2. Lines

I use a geojson you can find on https://data.sbb.ch/explore/dataset/linie-mit-polygon/export/?fbclid=IwAR3vTCN6GkY4UXZRrm4RNjTRIn726lOGLZmni_K_bi5s-XjerqQ9eCemsrk

In [None]:
import folium
print(folium.__version__)

lines_suisse=gpd.read_file('linie-mit-polygon.geojson')




We are left with an id for the line, the name of the line, the geographical region in which the line lies (between 'NODEFROM' and 'NODETO'), and the geometry of the lines.

We can't get rid of any further rows, as each row contains unique geometric information