# Project 2 report

*You can delete this cell, and add new cells to put your code and write your report.*

*To add a new code or Markdown cell, click the <kbd>+</kbd> button in the toolbar. By default, this will be a code cell; to change it to Markdown, click the drop-down menu in the toolbar which says <kbd>Code</kbd>, and select <kbd>Markdown</kbd> in the list.*

*To format the text in a Markdown cell, just run the cell. To edit it again, double-click on it. For examples of Markdown formatting, you can double-click on any Markdown cell in any of the course notebooks. For instance, there are lots of examples at the top of the Week 1 tutorial notebook.*

# Part 3 investigation

After description and visualization, bike usage situation in Scotland is quite clear, yet what hides behind data itself has a large room to investigate the bike usage in Scotland, so this part will disclose those hidings. After group discussion, we chose two questions to investigate. Firstly, because crossing border is deemed like long-distance travel, bikes are believed seldom in crossing the border. Nevertheless, we will investigate the bike usage of people crossing the border between Scotland and England to test the majorities believe. Secondly, we are curious about bike usage at main places (Hospital, University, Supermarket, etc.) in Scotland. However, due to the workload, we will just compare the bike usage at hospitals and that at supermarkets in Scotland by using pooled t-test. Therefore, this part will create two methodologies by utilizing python, data science techniques and statistical analysis to deal with the above two questions and will show the final results with discussions.

Throughout part 3, two functions will be used in both questions, so we put it first as modular programming. Firstly, the LoadData Function will read the Original dateset and conduct some format switches by using pandas. Secondly, considering spherical surface of earth, GetDistance Function will calculate the distance between two points with coordinates by using geopy.

In [1]:
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import time
import os
import scipy as sc

def GetDistance(keyPlace, roadpoint):
    '''
    This mini-script will be used for calculating the geodesic distance between
    two places, a key place and the road point, using the geodesic function of 
    geopy. This is an approximation of a distance.

    Inputs:
    keyPlace: Specific spot (market or hospital) to calculate the distance from.
              The expected type is a tuple containing the coordinates (lat, long)
    
    roadpoint: pandas dataframe containing the roadpoints. 

    Output:
    calculatedDistance = Geodesic distance (in km) rounded to 2 decimals

    '''
    
    # Create the tuple containing the roadpoint coordinates
    roadPoint = (roadpoint['latitude'], roadpoint['longitude'])

    # Calculate the distance in km and round to 2 decimales
    calculatedDistance = round(geodesic(keyPlace, roadPoint).kilometers,2)

    return calculatedDistance

def LoadData():

    '''
    This function loads the DFT dataset csv file as a pandas dataframe. It will
    process the dataframeand convert columns to a suitable format for data analysis

    Input: None
    Output: rawData (DFT dataframe) 
    '''

     # Import the csv parsing dates and setting the junction road names columns 
     # as strings. Rename the "Year" column to reflect a more appropriate name
    rawData = pd.read_csv("dft_rawcount_region_id_3.csv", parse_dates = ["count_date"], 
                          dtype={'start_junction_road_name': str,
                                 'end_junction_road_name': str})
    #rawData.rename(columns = {"year":"date"}, inplace = True)
    rawData["count_date"] = pd.to_datetime(rawData["count_date"], yearfirst = True, infer_datetime_format= True)

    return rawData



## Question 1: Bike usage at Scotland-England border

The basic assumption is that bike usage at road points within an interval (will be different under different methods) will be deemed as the bike usage of crossing the border. Overall, idea is that we use coordinates and intervals to find the target road points, and then we use road points' bike usage to represent the bike usage on the border. Within programming, we make three upgrades. Each upgrade accurate the border and precise the intervals, and all three methods' codes are down below.

In [2]:
import pandas as pd
import math
from geopy.distance import geodesic
from geopy.point import Point as Point
import os
import border

def CalculateMidpoint():

    '''
    This midpoint formula will calculate the midpoint of the Scottish border
    using the equation found at http://www.movable-type.co.uk/scripts/latlong.html
    This function uses the Point function from geopy

    Input: None

    Output: 
    midPoint: a tupple containing the midpoint of the border. 
    '''

    borderWest = (54.995769, -3.052872)
    borderEast = (55.806881, -2.042987)

    # Convert the coordinates to radians
    borderWestLat, borderWestLon = math.radians(borderWest[0]), math.radians(borderWest[1])
    borderEastLat, borderEastLon = math.radians(borderEast[0]), math.radians(borderEast[1])

    # Follow the equations found at http://www.movable-type.co.uk/scripts/latlong.html
    # Constants
    deltaLon = borderEastLon - borderWestLon
    bx = math.cos(borderEastLat) * math.cos(deltaLon)
    by = math.cos(borderEastLat) * math.sin(deltaLon)
    
    # Coordinates of the midpoint
    midLat = math.atan2(math.sin(borderWestLat) + math.sin(borderEastLat),
        math.sqrt(((math.cos(borderWestLat) + bx) ** 2 + by ** 2)))
    midLon = borderWestLon + math.atan2(by, math.cos(borderWestLat) + bx)
    
    # Normalise the longitude to a −180 … +180 range
    midLon = (midLon + 3 * math.pi) % (2 * math.pi) - math.pi

    # Convert back to degrees
    midLat = math.degrees(midLat)
    midLon = math.degrees(midLon)

    # Calculate the midpoint using the Point geopy function
    midpoint = Point(latitude = midLat, longitude=midLon)

    return midpoint

def LocateBikesBorders():

    '''
    This function will locate the bicycles found on a 10 km distance to the Anglo
    Scottish border using a midpoint approach. The midpoint is calculated by using
    the far east and far west coordinates of the border.

    Input: None

    Output
    bikesOnBorders: a pandas dataframe with the roadpoint, coordinates, distance
    to the border and number of bikes found across the whole DFT dataset
    '''

    #Load the DFT data
    rawData = border.LoadData()

    # Select data only from the Scottish borders county and only with pedal_cycles
    borderData = rawData.loc[(rawData["local_authority_name"] == "Scottish Borders")
                             & (rawData["pedal_cycles"] > 0)]

    # The maximum distance to find a roadpoint from the midpoint is 10 km
    distance = 10

    # Get the midpoint and cast the coordinates in tuple form
    midpoint = CalculateMidpoint()
    midpointCoordinates = (midpoint.latitude, midpoint.longitude)

    # Prepare a copy of the original dataframe for calculating the distances
    bikesOnBorders = borderData.copy()

    # Find the geodesic distance to the midpoint
    bikesOnBorders["distance_to_border"] = borderData.apply(lambda row: border.GetDistance(midpointCoordinates, row), axis=1)

    # Create a new dataframe by selecting only the required information from the original dataframe
    bikesOnBorders = bikesOnBorders[["year", "count_point_id", "road_type","road_name", 
                                  "latitude", "longitude", "pedal_cycles", "distance_to_border"]]

    # Select only the roadpoints within the allowed distance and sort by year
    bikesOnBorders = bikesOnBorders.loc[(bikesOnBorders["distance_to_border"] <= distance)]
    bikesOnBorders = bikesOnBorders.sort_values(by=["year"], ascending = True)

    return bikesOnBorders

#show the result
a=LocateBikesBorders()
print(a)

        year  count_point_id road_type road_name   latitude  longitude  \
272864  2000           30737     Major       A68  55.367704  -2.492211   
233085  2004          996524     Minor     B6358  55.470236  -2.635329   
233089  2004          996524     Minor     B6358  55.470236  -2.635329   
185442  2007          996524     Minor     B6358  55.470236  -2.635329   
185432  2007          996524     Minor     B6358  55.470236  -2.635329   
185427  2007          996524     Minor     B6358  55.470236  -2.635329   
173409  2007           30737     Major       A68  55.367704  -2.492211   
173393  2007           30737     Major       A68  55.367704  -2.492211   
169758  2008          996524     Minor     B6358  55.470236  -2.635329   
136406  2009           10731     Major       A68  55.355528  -2.481019   
135177  2009             729     Major       A68  55.465386  -2.553529   
135172  2009             729     Major       A68  55.465386  -2.553529   
135163  2009             729     Major

In [8]:
import pandas as pd
import math
from geopy.distance import geodesic
from geopy.point import Point as Point
import os



def LocateBikesBordersInterval():

    '''
    This function will locate the bicycles found on a 1.1 km distance to the Anglo
    Scottish border using an interval approach done on the latitude or the longitude
    from every major crossing point. 

    Input: None

    Output
    bikesOnBorders: a pandas dataframe with the roadpoint, coordinates, distance
    to the border and number of bikes found across the whole DFT dataset
    '''

    #Load the DFT data
    rawData = LoadData()

    # Select data only from the Scottish borders county and only with pedal_cycles
    borderData = rawData.loc[(rawData["local_authority_name"] == "Scottish Borders")
                             & (rawData["pedal_cycles"] > 0)]

    # Border crossing coordinates
    borderWest = [54.995769, -3.052872]
    borderEast = [55.806881, -2.042987]
    jedburghCross = [55.354486, -2.478119]
    carlisleCross = [55.049714, -2.960520]
    coldstreamCross = [55.654783, -2.242233]
    ladykirkCross = [55.718895, -2.177002]
    deadwaterCross = [55.355528  -2.481019]
    kelsoCross = [55.566299, -2.263140]

    # Set an interval of 0.005 degrees (equivalent to 555 meters)
    interval = 0.005

    # Filter the roadpoints within the border area in latitude: border + interval
    # and in longitude: border - interval (where necessary)
    bikesOnBorders = borderData.loc[((borderData["latitude"] >= borderWest[0]) &
                                     (borderData["latitude"] <= borderWest[0] + interval)) | 
                                     ((borderData["latitude"] >= jedburghCross[0]) &
                                     (borderData["latitude"] <= jedburghCross[0] + interval)) |
                                     ((borderData["longitude"] >= jedburghCross[1]) &
                                     (borderData["longitude"] <= jedburghCross[1] - interval)) |
                                     ((borderData["latitude"] >= carlisleCross[0]) &
                                     (borderData["latitude"] <= carlisleCross[0] + interval))|
                                     ((borderData["latitude"] >= kelsoCross[0]) &
                                     (borderData["latitude"] <= kelsoCross[0] - interval)) |
                                     ((borderData["longitude"] >= coldstreamCross[1]) &
                                     (borderData["longitude"] <= coldstreamCross[1] - interval))|
                                     ((borderData["latitude"] >= deadwaterCross[0]) &
                                     (borderData["latitude"] <= deadwaterCross[0] + interval))|
                                     ((borderData["latitude"] >= borderEast[0]) &
                                     (borderData["latitude"] <= borderEast[0] + interval))]

    # Create a new dataframe by selecting only the required information from the original dataframe
    bikesOnBorders = bikesOnBorders[["year", "count_point_id", "road_type","road_name", 
                                  "latitude", "longitude", "pedal_cycles"]]
    bikesOnBorders = bikesOnBorders.sort_values(by=["year"], ascending = True)

    return bikesOnBorders

### Display the result

The final 

In [10]:
#Display the Dataframe of final outcome
print(bikesOnBorders)

        year  count_point_id road_type road_name   latitude  longitude  \
272740  2000           30709     Major        A1  55.807170  -2.043051   
204840  2005           30709     Major        A1  55.807170  -2.043051   
204859  2005           30709     Major        A1  55.807170  -2.043051   
204855  2005           30709     Major        A1  55.807170  -2.043051   
204847  2005           30709     Major        A1  55.807170  -2.043051   
136406  2009           10731     Major       A68  55.355528  -2.481019   
73621   2015          996729     Minor     B6355  55.815054  -2.151394   
73623   2015          996729     Minor     B6355  55.815054  -2.151394   
73620   2015          996729     Minor     B6355  55.815054  -2.151394   
73635   2015          996729     Minor     B6355  55.815054  -2.151394   
73640   2015          996729     Minor     B6355  55.815054  -2.151394   
73643   2015          996729     Minor     B6355  55.815054  -2.151394   
73631   2015          996729     Minor

In [7]:
import border
import folium
import pandas as pd

# Import the data 
bikesOnBorders = border.FindBikesInterval()

# Create a map of the Scottish Borders
borderLat = 55.5486
borderLong = -2.786

# Instantiate a feature group for the stations in the dataframe
bikes = folium.map.FeatureGroup()

# Add each roadpoint where bikes are found to the feature group
for lat, lng, in zip(bikesOnBorders.latitude, bikesOnBorders.longitude):
    bikes.add_child(
        folium.CircleMarker([lat, lng], radius=5, color='black', fill=True, fill_color='blue', fill_opacity=0.4))

# Add roadpoints to map
borderMap = folium.Map(location=[borderLat, borderLong], zoom_start=8)
borderMap.add_child(bikes)

# Draw the border line
borderWest = (54.995769, -3.052872)
borderEast = (55.806881, -2.042987)
borderCarham = (55.63214657346817, -2.336614595368269)
borderCheviot = (55.47085898747139, -2.1707897097633344)

border1 = [(borderCarham[0], borderCarham[1]), (borderEast[0], borderEast[1])]
border2 = [(borderCheviot[0], borderCheviot[1]),(borderCarham[0], borderCarham[1])]
border3 = [(borderWest[0], borderWest[1]), (borderCheviot[0], borderCheviot[1])]

folium.PolyLine(border1, color='green', weight=4, opacity=0.8).add_to(borderMap)
folium.PolyLine(border2, color='green', weight=4, opacity=0.8).add_to(borderMap)
folium.PolyLine(border3, color='green', weight=4, opacity=0.8).add_to(borderMap)

# Display the map
borderMap

### Discussion 

All three methods, even the original method which is the most inaccurate, generate a short result dataframes and the bike usages in the dataframe. It is always a single figure, indicating that the bike usage on the border is quite seldom. Hence, the majority believe that bikes are a unusual way to cross the border. The final dataframe (the third method's result) only possesses two count points belonging two roads with only 1 bike used to record. Besides, those bike usages happened 10 years ago. In summary, the bike usages on the Scotland-England border are seldom nowadays.


## Question 2: The difference between bike usage heading to hospitals and supermarkets in Scotland 

The basic assumption is that bike usage of road points that are the closest to hospitals or supermarkets will represent the bike usage to hospitals or supermarket. Frankly, comparing the sum or average of bike usage of those nearest road points is reasonable, but it is not robust because road points' data is just from a sample day (sample average != Population average) and because outlier influences may be significant in calculating summations or average. Therefore, to be accurate, we decided to utilize statistics knowledge to deal with this question, pooled t-test, because it is impossible to conduct a “census” to calculate the population (more about it later) variances and because all its required assumptions will be fulfilled. Firstly, the independence of the two samples (more about it later) is existing because hospital bike usage has nothing to do with supermarket bike usage. 

Secondly, after looking at the mid-term outcomes, each sample contains over 100 sample data, indicating that the normality assumption is fulfilled. As for the details of this hypothesis test, the population is the nearest road points' bike usage at any time, on any day, in any year; the samples here are the nearest road points' bike usage in csv dataset of top 5 Scotland cities: Glasgow, Edinburgh, Perth, Aberdeen and Dundee. 

Due to our aim is to find out whether bike using at a hospital is greater than that at supermarkets, the hypothesis is 

H_0: μ_Hospital - μ_supermarket = 0; 

H_1: μ_Hospital - μ_supermarketg > 0. 

As for question one, coordinates are needed in calculating distances. We add two new csv documents which include the coordinates of main hospitals and supermarkets in top 5 cities of Scotland. In a nutshell, the procedures are using coordinates to capture the nearest road points of main hospitals or supermarkets and utilizing bike usages of those nearest points as samples to proceed with the hypothesis test.

In [6]:
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import time
import os
import scipy as sc

def Hospitals():

    '''
    This function loads the Hospitals.csv file  as a pandas dataframe.
    It will convert the latitude and longitude coordinates to float types to
    keep the distance calculations consistent

    Input: None
    Output: hospitalData (Hospital dataframe) 
    '''

    # Get the current directory and import the csv 
    curDir = os.getcwd()
    hospitalCsv = os.path.join(curDir,"Hospital.csv")

    # Get the dataframe and convert the coordinates columns into floats
    hospitalData = pd.read_csv(hospitalCsv, dtype={"latitude": float, "longitude": float}, 
                            usecols=["city","latitude","longitude", "name"])
    
    return hospitalData

def Markets():

    '''
    This function loads the Main_markets.csv file  as a pandas dataframe.
    It will convert the latitude and longitude coordinates to float types to
    keep the distance calculations consistent

    Input: None

    Output: 
    marketData (Market dataframe) 
    '''

    #Get the current directory and import the csv 
    curDir = os.getcwd()
    marketCsv = os.path.join(curDir,"Main_markets.csv")

    #Import the csv and convert the coordinates columns into float
    marketData = pd.read_csv(marketCsv, dtype={"latitude": float, "longitude": float}, 
                            usecols=["city","latitude","longitude", "name"])

    return marketData


def CityData(rawData):

    '''
    This function will call the DFT dataframe and will process only the top 5 cities
    in Scotland: Edinburgh, Glasgow, Perth, Aberdeen and Dundee

    Input
    rawData: DFT dataframe

    Output
    cityRoads: Filtered dataset containing the data for the top 5 cities
    
    '''

    # Select data considering the top 5 cities
    cityRoads = rawData.loc[(rawData["local_authority_name"] == "City of Edinburgh") | 
                            (rawData["local_authority_name"] == "Glasgow City") | 
                           (rawData["local_authority_name"] == "Perth & Kinross") |
                           (rawData["local_authority_name"] == "Aberdeen City") |
                           (rawData["local_authority_name"] == "Dundee City")]
         
    # Group the roads
    cityRoads = pd.DataFrame({'count': cityRoads.groupby(["count_point_id", "year",
                           "latitude", "longitude","pedal_cycles"]).size()}).reset_index()

    return cityRoads

def BikesOnPopularSpots(option):
    
    '''
    This function will calculate the closest roadpoint to each hospital or market 
    listed in the input Hospital.csv or Main_market.csv files, and will output a 
    dataframe containing the original input csv file and the closest roadpoint, 
    and the number of bicyles that passed by it.

    Input 
    option: The user selection for finding the bikes passing around markets or 
            hospitals. This argument is a string containing the selected option. 
            The possible options are: "markets" and "hospitals". 
            i.e. Bikes("markets") will calculate the closest point to each market
            and find the number of bicycles that passed by it.

    Output
    cycles: Pandas dataframe containing the hospital or market information, closest
            roadpoint and the number of cycles found in the whole dataset.
    '''

    # Load the top 5 cities data
    rawData = LoadData()
    cityRoads = CityData(rawData)

    # Load hospitals or markets data
    if (option == "hospitals"):

        # Hospital's data
        spotInfo = Hospitals()

    elif (option == "markets"):
        
        # Market's data
        spotInfo = Markets()
    
    else:
        
        # Neither hospitals nor markets
        print("Invalid option. The only valid options are markets or hospitals. Try again")
        return
        
    # Select only the coordinates found in the markets/hospitals dataframe
    spotCoordinates = spotInfo[["latitude", "longitude"]]

    # Empty lists to be filled with the distances and roadpoints
    closest_roadpoint = []
    distance_roadpoint = []

    # Find the geodesic distance between a place and a roadpoint    
    for i in range(0, len(spotInfo)):
        coordinates = (spotCoordinates.iloc[i,0], spotCoordinates.iloc[i,1])
        cityRoads[f"market_{i}"] = cityRoads.apply(lambda row: GetDistance(coordinates, row), axis=1)

        # Find the minimum distance and the index location
        minDistance = cityRoads[f"market_{i}"].min()
        idx = cityRoads[f"market_{i}"].idxmin()

        # Get the roadpoint id and append to a list
        roadpoint = cityRoads.iloc[idx,0]
        closest_roadpoint.append(roadpoint)
        distance_roadpoint.append(minDistance)

    # Add the filled lists into the dataframe as columns
    spotInfo["count_point_id"] = closest_roadpoint
    spotInfo["distance"] = distance_roadpoint

    # Select only the pedal_cycles columns and road point it from the DFT dataframe
    bikesOnRoads = cityRoads[["count_point_id","pedal_cycles", "year"]]

    # Merge the datasets using inner join
    cycles = pd.merge(spotInfo, bikesOnRoads, how = "inner", on = "count_point_id")

    return cycles

