# Creating two unified databases, with only the necessary information

We are going to create a database, containing only information on a select few cities in every country. As a reference, we chose to consider one city for 5 million inhabitants of the country. 

We will, as such, consider the first few largest cities (worth 5 million inhabitants), until reaching the entire countries' population.

We'll use a webscraping method to collect the information on a country, and save the list of cities we'll consider in a dictionary

In [None]:
import pandas as pd
import geopandas as gpd
import countriesAndCities
import dataGathering

# 1 Stations

In [None]:
largestStations = dict()

urlGermany = 'https://bahnauskunft.info/bahnhoefe-deutschland/'
urlAustria = 'https://www.omio.at/bahnhoefe'

In [None]:
largestStations['Germany'] = dataGathering.gather(urlGermany)
largestStations['Austria']= dataGathering.gather(urlAustria, start=1)


In [None]:
largestStations

For both databases, we will need to be changing certain things in the keys of our largestStations dictionary.

Since the operations are strikinlgy similar, we can use a function

In [None]:
def changeKeys(country, valueToChange, newValue):
    '''A method that takes the keys for a country in the largestStations dictionary, and replacey certain values
     @param country: the country with a value to change, of type string
     @param valueToChange: the value in the key to change
     @param newValue: the new value in the key
     @return largestStations: a dictionary with the information, of type dict'''
    listKeys = list(largestStations[country].keys())
    oldKeys = []
    for i in range (len(listKeys)):
        station = listKeys[i]
        if valueToChange in listKeys[i]:
            oldKey = station
            oldKeys.append(oldKey)
            newKey = station.replace(valueToChange, '') + newValue
            largestStations[country][newKey] = largestStations[country][oldKey]
    
    for station in oldKeys:
        largestStations[country].pop(station)
    
    return(largestStations)

For Germany, we must replace 'Hauptbahnhof' by 'Hbf'

For Austria, we must do the exact opposite: replace 'Hbf' by 'Hauptbahnhof'

In [None]:
largestStations = changeKeys('Germany', 'Hauptbahnhof', 'Hbf')
largestStations = changeKeys('Austria', 'Hbf', 'Hauptbahnhof')

In [None]:
largestStations

# 1.1 Stations in Germany

In [None]:
stations = 'Germany/railwayStationNodes.geojson'

deutscheBahnStations = gpd.read_file(stations)

In [None]:
deutscheBahnStations.head()

In [None]:
deutscheBahnStations.columns

We can start by dropping the column containing the nature of the node, and the index of the node

In [None]:
deutscheBahnStations = deutscheBahnStations.drop('formOfNode', axis = 1)
deutscheBahnStations = deutscheBahnStations.drop('id', axis = 1)

In [None]:
deutscheBahnStations

In [None]:
deutscheBahnStations.loc[deutscheBahnStations['geographicalName'].str.contains('Frankfurt')]

We can now select the stations in the entire database

In [None]:
dfListStations = []
for station in (list(largestStations['Germany'].keys())):
    tempFrame = deutscheBahnStations.loc[deutscheBahnStations['geographicalName'] == station]
    dfListStations.append(tempFrame)

In [None]:
dfListStations

And transform it to a geodataframe, using the GPS coordinate system (EPSG:4258)

In [None]:
workFrameStations = gpd.GeoDataFrame(pd.concat(dfListStations, ignore_index=True), crs=4258)

In [None]:
workFrameStations

Certain nodes are duplicate, we can drop them

In [None]:
workFrameStations = workFrameStations.drop_duplicates(subset='railwayStationCode')

In [None]:
workFrameStations

# 1.2 Stations in Austria

Similarly, we can do the same thing for Austria.

In [None]:
stationsAustria = 'Austria/GIP_Betriebsstellen_DelEUV_JSON.json'
stationsAustriaFrame = gpd.read_file(stationsAustria)

In [None]:
stationsAustriaFrame

In [None]:
stationsAustriaFrame.columns

Quite a few of these columns are useless, we can remove them

In [None]:
columnsToRemove = ['BSTS_ID', 'DB640_CODE', 'OBJECTID', 'GIP_OBID', 'EXTERNALID', 'REGIONALCO', 'VALIDFROM', 'VALIDTO', 'OWNER_NAME', 'PV_EVA_NR', 'ANZ_AUFZUG', 'ANZ_FAHRTR', 'ANZ_UHREN',
                  'ANZ_AKUSTI','ANZ_OPTISC', 'INFOPOINT', 'MUEZ', 'MUEZ_KURZ', 'HILFE_MOBI', 'ANZ_ROLLST', 'ANZ_E_LADE', 'RUD_PARKPL', 'VERIFIZIER',
                  'PUBL_WLAN', 'MUEZ_LANG', 'BEMERKUNG']

In [None]:
for column in columnsToRemove:
    stationsAustriaFrame = stationsAustriaFrame.drop(column, axis=1)

We can now focus on retrieving the useful stations

In [None]:
dfStationsAustria = []
for station in largestStations['Austria']:
    tempFrame = stationsAustriaFrame.loc[stationsAustriaFrame['NAME_FPL'] == station]
    dfStationsAustria.append(tempFrame)

And convert it to a geodataframe

In [None]:
workFrameAustria = gpd.GeoDataFrame(pd.concat(dfStationsAustria), crs = 31287)

In [None]:
workFrameAustria

# 2 Train lines

We can also work on certain databases with train lines.

However, there is less work in that case. 

We cannot get rid of any row, as each row contains specific information, that isn't available in any other row.
Should we get rid of one line, we would lose one bit of unreplicable information

# 2.1 Train lines in Germany

In [28]:
deutscheBahnLines = gpd.read_file('Germany/railwayLines.geojson')

In [29]:
deutscheBahnLines.head()

Unnamed: 0,id,geographicalName,railwayLineCode,geometry
0,Line-1078435,Grötzingen - Eppingen,4201,"MULTILINESTRING ((8.49314 49.00583, 8.49356 49..."
1,Line-1078629,Berlin Ostbf - Berlin-Spandau,6109,"MULTILINESTRING ((13.23049 52.52917, 13.23024 ..."
2,Line-1078434,Karlsruhe Gbf - West,4215,"MULTILINESTRING ((8.39203 48.98890, 8.39183 48..."
3,Line-1078437,Marnheim - Monsheim,3561,"MULTILINESTRING ((8.20148 49.63515, 8.20170 49..."
4,Line-1078436,Kall - Hellenthal,2635,"MULTILINESTRING ((6.55488 50.53503, 6.55517 50..."


We can drop the id column

In [30]:
deutscheBahnLines = deutscheBahnLines.drop('id', axis=1)

Unnamed: 0,geographicalName,railwayLineCode,geometry
0,Grötzingen - Eppingen,4201,"MULTILINESTRING ((8.49314 49.00583, 8.49356 49..."
1,Berlin Ostbf - Berlin-Spandau,6109,"MULTILINESTRING ((13.23049 52.52917, 13.23024 ..."
2,Karlsruhe Gbf - West,4215,"MULTILINESTRING ((8.39203 48.98890, 8.39183 48..."
3,Marnheim - Monsheim,3561,"MULTILINESTRING ((8.20148 49.63515, 8.20170 49..."
4,Kall - Hellenthal,2635,"MULTILINESTRING ((6.55488 50.53503, 6.55517 50..."
...,...,...,...
1489,Leipzig Bayer Bf - Gaschwitz,6377,"MULTILINESTRING ((12.38109 51.33009, 12.38157 ..."
1490,Abzw Rainweg - HH-Eidelstedt,1232,"MULTILINESTRING ((9.94184 53.56297, 9.94179 53..."
1491,HH Oberhafen - Hamburg Hbf,1250,"MULTILINESTRING ((10.02458 53.53495, 10.02476 ..."
1492,Lauda - Wertheim,4920,"MULTILINESTRING ((9.71127 49.56368, 9.71075 49..."


# 2.2 Train lines in Austria

In [31]:
linesAustria = 'Austria/GIP_Strecken_MLA.json'
linesAustriaFrame = gpd.read_file(linesAustria)

In [32]:
linesAustriaFrame.columns

Index(['GIP_OBID', 'KMSYS_CODE', 'BST_ID', 'FOW_NAME', 'FRC_NAME', 'REGION',
       'MAINNAME', 'VALIDFROM', 'VALIDTO', 'NODEFROM', 'NODETO', 'CROSSSECT',
       'CROSS_NAME', 'ELEKTRI', 'EXPDATE', 'geometry'],
      dtype='object')

Once again, quite a few columns are useless, and we can get rid of them

In [33]:
uselessColumns = ['GIP_OBID', 'BST_ID', 'FOW_NAME', 'FRC_NAME', 'REGION', 'VALIDFROM', 'VALIDTO', 'CROSSSECT', 'CROSS_NAME', 
                  'ELEKTRI', 'EXPDATE']

for column in uselessColumns:
    linesAustriaFrame = linesAustriaFrame.drop(column, axis=1)

We are left with an id for the line, the name of the line, the geographical region in which the line lies (between 'NODEFROM' and 'NODETO'), and the geometry of the lines.

We can't get rid of any further rows, as each row contains unique geometric information