Pandas is a useful Python package which allows us to handle large amounts of data in tables. It includes many useful functions for getting summary statistics of the data held in its tables, and for visualization.

Numpy is a scientific package, containing useful classes and functions for manipulating quantitative and qualitative data.

os is a package for dealing with file directory structures in a way that avoids typos from entering filepaths manually as strings, and which works across different operating systems.

Here, we import these 3 packages for use.

In [1]:
import pandas as pd
import numpy as np
import os

First, we open the csv file containing the latlon data of the points between which we would like to measure the distance.

In [8]:
file = os.path.join(os.pardir,"data","distancedummy.csv")
df=pd.read_csv(file)

We can print out the dataframe, here named 'df', to see its contents. 
We can also point to a specific cell, and examine its data type. As we can see here, the numeric columns have been automatically converted by Pandas from a text format to a numpy.float64, a numeric format.

In [37]:
print(df)
print()
print(str(df['homelat'][1]) + ': ' + str(type(df['homelat'][1])))

     id   homelat     homelon        mrtname    mrtlat      mrtlon  directdist
0   bob  1.271684  103.807672  Telok Blangah  1.270575  103.809731    0.259835
1   tom  1.350142  103.935288       Tampines  1.270575  103.809731   16.515159
2  lars  1.425065  103.834088         Yishun  1.429334  103.834966    0.484316
3   ron  1.336155  103.698430        Pioneer  1.337614  103.697152    0.215510

1.350142: <class 'numpy.float64'>


To calculate the distance between latlons, we use the haversine formula, which places the coordinates on a sphere, of radius 6367km. The distance returned is also in km.

In [4]:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

Next, apply the haversine formula to each row, to calculate the direct distance from their latlon to the nearest MRT station.
Here we use a lambda function to apply the haversine formula, and specify the axis=1 to apply it to each row (the default if not specified is axis=0, which applies to each column)

In [16]:
df['directdist'] = df.apply(lambda person: haversine(person['homelon'], person['homelat'], person['mrtlon'], person['mrtlat']), axis=1)
print(df)

     id   homelat     homelon        mrtname    mrtlat      mrtlon  directdist
0   bob  1.271684  103.807672  Telok Blangah  1.270575  103.809731    0.259835
1   tom  1.350142  103.935288       Tampines  1.270575  103.809731   16.515159
2  lars  1.425065  103.834088         Yishun  1.429334  103.834966    0.484316
3   ron  1.336155  103.698430        Pioneer  1.337614  103.697152    0.215510
