# Shinkansen Stations in Japan

This project aims to analyze and visualize the Shinkansen stations in Japan.

The first part will be a short data preparation with python, the second part will continue with the visualization in Tableau Public.

The data is from kaggle and was provided by Kaito. It was last updated: 2023-05-09. Link to dataset: https://www.kaggle.com/datasets/japandata509/shinkansen-stations-in-japan/data

## Data Understanding

- Name: the name of the station(English)
- Line: the line that the station belongs
- Year: The year when the station opened
- Prefecture: The prefecture where the station is located
- Distance from Tokyo st: the distance(km) from Tokyo station to the station
- Company: the company that has the station

## Data Preparation

In [1]:
# load needed libraries

import pandas as pd 
from geopy.geocoders import Bing # to get the coordinates of the stations using the geocoder Bing
import folium # for a quick visualization

In [2]:
# load data

df_station = pd.read_csv('Shinkansen_stations_inJapan.csv',
                        sep=',')
df_station.head() # look at first rows

Unnamed: 0,Station_Name,Shinkansen_Line,Year,Prefecture,Distance from Tokyo st,Company
0,Tokyo,Tokaido_Shinkansen,1964,Tokyo,0.0,JR_Central
1,Shinagawa,Tokaido_Shinkansen,2003,Tokyo,6.8,JR_Central
2,Shin-Yokohama,Tokaido_Shinkansen,1964,Kanagawa,28.8,JR_Central
3,Odawara,Tokaido_Shinkansen,1964,Kanagawa,83.9,JR_Central
4,Atami,Tokaido_Shinkansen,1964,Shizuoka,104.6,JR_Central


In [3]:
# check data types

df_station.dtypes 

Station_Name               object
Shinkansen_Line            object
Year                        int64
Prefecture                 object
Distance from Tokyo st    float64
Company                    object
dtype: object

In [4]:
# check column names

df_station.columns 

Index(['Station_Name', 'Shinkansen_Line', 'Year', 'Prefecture',
       'Distance from Tokyo st', 'Company'],
      dtype='object')

In [5]:
# check distinct values of all columns to get to know the data and to see if there are any placeholders or weird symbols

df_station.loc[:, 'Station_Name'].unique()

array(['Tokyo', 'Shinagawa', 'Shin-Yokohama', 'Odawara', 'Atami',
       'Mishima', 'Shin-Fuji', 'Shizuoka', 'Kakegawa', 'Hamamatsu',
       'Toyohashi', 'Mikawa-Anjo', 'Nagoya', 'Gifu-Hashima', 'Maibara',
       'Kyoto', 'Shin-Osaka', 'Shin-Kobe', 'Nishi-Akashi', 'Himeji',
       'Aioi', 'Okayama', 'Shin-Kurashiki', 'Fukuyama', 'Shin-Onomichi',
       'Mihara', 'Higashi-Hiroshima', 'Hiroshima', 'Shin-Iwakuni',
       'Tokuyama', 'Shin-Yamaguchi', 'Asa', 'Shin-Shimonoseki', 'Kokura',
       'Hakata', 'Ueno', 'Omiya', 'Oyama', 'Utsunomiya', 'Nasushiobara',
       'Shin-Shirakawa', 'Koriyama', 'Fukushima', 'Shiroishi-Zao',
       'Sendai', 'Furukawa', 'Kurikoma-Kogen', 'Ichinoseki',
       'Mizusawa-Esashi', 'Kitakami', 'Shin-Hanamaki', 'Morioka',
       'iwate-Numakunai', 'Ninohe', 'Hachinohe', 'Shichinohe-Towada',
       'Shin-Aomori', 'Kumagaya', 'Honjo-Waseda', 'Takasaki',
       'Jomo-Kogen', 'Echigo-Yuzawa', 'Urasa', 'Nagaoka', 'Tsubame-Sanjo',
       'Niigata', 'Yonezawa', 'Takaha

In [6]:
df_station.loc[:, 'Shinkansen_Line'].unique()

array(['Tokaido_Shinkansen', 'Tokaido_Shinkansen,Sanyo_Shinkansen',
       'Sanyo_Shinkansen', 'Sanyo_Shinkansen,Kyushu-Shinknsen',
       'Tohoku_Shinkansen', 'Tohoku_Shinkansen,Joetsu_Shinkansen',
       'Tohoku_Shinkansen,Yamagata_Shinkansen',
       'Tohoku_Shinkansen,Akita_Shinkansen',
       'Tohoku_Shinkansen,Hokkaido_Shinkansen', 'Joetsu_Shinkansen',
       'Joetsu_Shinkansen,Hokuriku_Shinkansen', 'Yamagata_Shinkansen',
       'Akita_Shinkansen', 'Hokuriku_Shinkansen', 'Kyushu_Shinkansen',
       'Hokkaido_Shinkansen', 'Nishi_Kyushu_Shinkansen'], dtype=object)

-> It seems that some stations belong to two shinkansen lines.

In [7]:
df_station.loc[:, 'Year'].unique()

array([1964, 2003, 1969, 1988, 1972, 1975, 1999, 1991, 1985, 1982, 1990,
       2002, 2010, 2004, 1992, 1997, 2015, 2011, 2016, 2022], dtype=int64)

In [8]:
df_station.loc[:, 'Prefecture'].unique()

array(['Tokyo', 'Kanagawa', 'Shizuoka', 'Aichi', 'Gifu', 'Shiga', 'Kyoto',
       'Osaka', 'Hyogo', 'Okayama', 'Hiroshima', 'Yamaguchi', 'Fukuoka',
       'Saitama', 'Tochigi', 'Fukushima', 'Miyagi', 'Iwate', 'Aomori',
       'Gunma', 'Niigata', 'Yamagata', 'Akita', 'Nagano', 'Toyama',
       'Ishikawa', 'Saga', 'Kumamoto', 'Kagoshima', 'Hokkaido',
       'Nagasaki'], dtype=object)

In [9]:
df_station.loc[:, 'Distance from Tokyo st'].unique()

array([   0. ,    6.8,   28.8,   83.9,  104.6,  120.7,  146.2,  180.2,
        229.3,  257.1,  293.6,  336.3,  336. ,  396.3,  445.9,  513.6,
        552.6,  589.5,  612.3,  644.3,  665. ,  732.9,  758.1,  791.2,
        811.3,  822.8,  862.4,  894.2,  935.6,  982.7, 1027. , 1062.1,
       1088.7, 1107.7, 1174.9,    3.6,   30.3,   80.6,  109.5,  157.8,
        185.4,  226.7,  272.8,  306.8,  351.8,  395. ,  416.2,  445.1,
        470.1,  487.5,  500. ,  535.3,  566.4,  601. ,  631.9,  668. ,
        713.7,   64.7,   86. ,  105. ,  151.6,  199.2,  228.9,  270.6,
        293.8,  333.9,  312.9,  322.7,  328.9,  347.8,  359.9,  373.2,
        380.9,  386.3,  399.7,  421.4,  551.3,  575.4,  594.1,  610.9,
        662.6,  123.5,  146.8,  164.4,  189.2,  222.4,  252.3,  281.9,
        318.9,  358.1,  391.9,  410.8,  450.5, 1203.5, 1210.6, 1226.4,
       1244.2, 1265.3, 1293.3, 1326.2, 1369. , 1385. , 1417.7, 1463.8,
        752.2,  827. ,  862.5, 1253.9, 1264.8, 1286.1, 1298.6, 1319.9])

In [10]:
df_station.loc[:, 'Company'].unique()

array(['JR_Central', 'JR_West', 'JR_East', 'JR_Kyushu', 'JR_Hokkaido'],
      dtype=object)

In [11]:
# check for missing values

df_station.isna().sum()

Station_Name              0
Shinkansen_Line           0
Year                      0
Prefecture                0
Distance from Tokyo st    0
Company                   0
dtype: int64

-> There are no missing values.

## Create new columns for latitude and longitude data for each Station

To visualize the stations on a map, geographic data is needed. To get the coordinates (latitude and longitude) for each station, geopy will be used - a geocoding library for python.

Documentation: https://geopy.readthedocs.io/

In [12]:
# testing geopy with the example of Berlin

# initialize Bing API
geocoder = Bing('AhqyDv9jUyLw1MzaTuz3rHSGIbSpwEdb0ztB06WQquakX08v8oF7NCyKbINvfmcu') # using personalized API key

location = geocoder.geocode('Berlin')

print("The latitude of the location is: ", location.latitude)
print("The longitude of the location is: ", location.longitude)

The latitude of the location is:  52.52342987
The longitude of the location is:  13.41143608


In [13]:
# getting the coordinates for our dataset using geopy

# creating two lists for latitude and longitude
lat = []
long = []

# initialize Bing API
geocoder = Bing('AhqyDv9jUyLw1MzaTuz3rHSGIbSpwEdb0ztB06WQquakX08v8oF7NCyKbINvfmcu') # using personalized API key

# get the coordinates for each station
for station_name in df_station['Station_Name']:

    # hard code the stations that the geocoder didn't get right (source: Google Maps)
    # Asa
    if station_name == 'Asa':
        lat.append(34.05413788939898) # append latitude
        long.append(131.16059736441778) # append longitude

    # Urasa
    elif station_name == 'Urasa':
        lat.append(37.16824450976684) # append latitude
        long.append(138.92284915942884) # append longitude

    # for all other stations that get found corretly
    else:
        location = geocoder.geocode(station_name + ' Station') # find station & add ' Station' to each one as this is how they are named
        lat.append(location.latitude) # append a stations latitude
        long.append(location.longitude) # append a stations longitude

print('Done!')

Done!


In [14]:
# create columns for latitude and longitude and put them into the dataset

df_station['Latitude'] = lat
df_station['Longitude'] = long

In [15]:
# look at the data
df_station

Unnamed: 0,Station_Name,Shinkansen_Line,Year,Prefecture,Distance from Tokyo st,Company,Latitude,Longitude
0,Tokyo,Tokaido_Shinkansen,1964,Tokyo,0.0,JR_Central,35.680832,139.766937
1,Shinagawa,Tokaido_Shinkansen,2003,Tokyo,6.8,JR_Central,35.628159,139.739105
2,Shin-Yokohama,Tokaido_Shinkansen,1964,Kanagawa,28.8,JR_Central,35.465443,139.622833
3,Odawara,Tokaido_Shinkansen,1964,Kanagawa,83.9,JR_Central,35.264565,139.152161
4,Atami,Tokaido_Shinkansen,1964,Shizuoka,104.6,JR_Central,35.095997,139.071533
...,...,...,...,...,...,...,...,...
108,Takeo-Onsen,Nishi_Kyushu_Shinkansen,2022,Saga,1253.9,JR_Kyushu,33.196289,130.022903
109,Ureshino-Onsen,Nishi_Kyushu_Shinkansen,2022,Saga,1264.8,JR_Kyushu,33.106602,129.998962
110,Shin-Omura,Nishi_Kyushu_Shinkansen,2022,Nagasaki,1286.1,JR_Kyushu,32.933048,129.957138
111,Isahaya,Nishi_Kyushu_Shinkansen,2022,Nagasaki,1298.6,JR_Kyushu,32.843426,130.053085


In [16]:
# research the min and max coordinates for Japan and check if they are in that range
# source: https://en.wikipedia.org/wiki/Geography_of_Japan

min_lat_jp = 20.00000
max_lat_jp = 45.00000

min_long_jp = 122.00000
max_long_jp = 153.00000

# code for latitude
min_lat = df_station['Latitude'].min()
max_lat = df_station['Latitude'].max()
min_lat_range = (min_lat >= min_lat_jp) & (min_lat < max_lat_jp)
max_lat_range = (max_lat > min_lat_jp) & (max_lat <= max_lat_jp)

print('The min. and max. latitude:', min_lat, ', ', max_lat, 'are in the range of:', min_lat_jp, ', ', max_lat_jp, ': ', min_lat_range,
      ' ', max_lat_range, '\n')

# code for longitude
min_long = df_station['Longitude'].min()
max_long = df_station['Longitude'].max()
min_long_range = (min_long >= min_long_jp) & (min_long < max_long_jp)
max_long_range = (max_long > min_long_jp) & (max_long <= max_long_jp)

print('The min. and max. longitude:', min_long, ', ', max_long, 'are in the range of:', min_long_jp, ', ', max_long_jp, ': ', min_long_range,
      ' ', max_long_range)

The min. and max. latitude: 31.5837841 ,  41.90464401 are in the range of: 20.0 ,  45.0 :  True   True 

The min. and max. longitude: 129.87982178 ,  141.48840332 are in the range of: 122.0 ,  153.0 :  True   True


In [17]:
# Testing area for checking wrong coordinates

# check the wrong coordinates
df_station.loc[df_station.loc[:, 'Latitude'] > max_lat_jp]

Unnamed: 0,Station_Name,Shinkansen_Line,Year,Prefecture,Distance from Tokyo st,Company,Latitude,Longitude


-> Now all coordinates are in the defined range and therefore seem to be correct. A quick map visualization will show if this is really the case.

In [18]:
# create a quick map using folium (inital code by ChatGPT)

# create a base map centered at a specific location (e.g., average of latitudes and longitudes)
map_center = [df_station['Latitude'].mean(), df_station['Longitude'].mean()]
my_map = folium.Map(location=map_center, zoom_start=10)

# add markers for each location in the DataFrame
for index, row in df_station.iterrows():
    folium.Marker(
        location=[row['Latitude'], row['Longitude']],
        popup=row['Station_Name']
    ).add_to(my_map)

# display map
my_map

-> By zooming out of the map one can see that all coordinates are placed in Japan!

In [19]:
# save dataframe as csv-file

df_station.to_csv('Shinkansen_stations_inJapan_geo.csv', index=False)

## Data Visualization

This part will continue in Tableau.