# Glastonbury Journey Planner
The project shows a great example of how to use web scraping to gather data for data science projects. I was able to scrape train, bus and tube data for specific locations and help plan a route to Glastonbury festival. The idea came about when discussing with friends what Glastonbury ticket to purchase on the first release of tickets. So where does data come in here? Well, during the first release window, you can only buy a Glastonbury ticket if you also buy a coach ticket with a pre-defined departure point.  I therefore decided to analyse prices and duration of each journey to work out the quickest, cheapest and most effective routes, whilst also allowing me to discard any unrealistic journeys. 

I produced a data table that contains all the scraped information and I sorted the column ‘total time’ from smallest to largest.  The output shows that if the London ticket was to be sold out, we should purchase the Bristol, Bath or Leicester ticket etc.


#### Idea: 
__Find the the fastest or cheapest option from your home to glastonbury via coach or train__
- Therefore finding what are the next best options if the London coach ticket has sold out

- 1) Calculate the distance from each coach station city to glastonbury (using geocode function)

- 2) Calculate the time and distance from each coach station city to glastonbury (via scraping checkmybus.co.uk)

- 3) Calculate the time, distance and cost for a train to the coach station from London (via scraping nationalrail.co.uk)

- 4) Calculate the time from your home to the London train station that gets you to the right coach station (via TFL open api)


#### The final output involves calling a function that takes the argument ‘City’ and outputs your planned route. 
- E.g. Bristol
- Step 1: Make your way to Paddington train station. This takes 34 minutes.
- Step 2: Get the train to Bristol train station. This takes 103 minutes.
- Step 3: Get the coach to Glastonbury. This takes 12 minutes.
- Summary. Total time: 149. Total price: £83.1


In [1]:
from bs4 import BeautifulSoup
import ssl
import json
from time import ctime
import time
import os
import requests
import urllib.request
import urllib.parse
import urllib.error
import re
import pandas as pd
import numpy as np
from urllib.request import Request, urlopen
import nltk
from nltk.corpus import stopwords
import unicodedata
nltk.download('stopwords')
from scrapy import Selector
from scrapy.http import TextResponse
import datetime as dt
from geopy.distance import great_circle
from geopy.geocoders import Nominatim

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\robert.lowe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv('coach locations.csv')

In [3]:
df

Unnamed: 0,City,Price
0,Bristol,33.0
1,London,52.5
2,Bath,31.0
3,Birmingham,49.0
4,Brighton,52.0
5,Cambridge,68.5
6,Cardiff,39.0
7,Leeds,64.0
8,Leicester,50.5
9,Lincoln,63.0


In [4]:
geolocator = Nominatim(user_agent="specify_your_app_name_here",
                       format_string="%s, london, UK")
address, (latitude, longitude) = geolocator.geocode("london")
print(latitude, longitude)

51.5073219 -0.1276474


In [5]:
def coord(x):
    geolocator = Nominatim(user_agent="specify_your_app_name_here", format_string="%s, {}, UK".format(x))
    address, (latitude, longitude) = geolocator.geocode(x)
    return latitude, longitude
    

In [6]:
coord('Leeds')

(53.7948592, -1.54802881274469)

In [7]:
df['coordinates'] = df['City'].apply(coord)

In [8]:
df.head()

Unnamed: 0,City,Price,coordinates
0,Bristol,33.0,"(51.4538022, -2.5972985)"
1,London,52.5,"(51.5073219, -0.1276474)"
2,Bath,31.0,"(51.3813864, -2.3596963)"
3,Birmingham,49.0,"(52.4632175, -1.86001916063752)"
4,Brighton,52.0,"(50.8220399, -0.1374061)"


In [10]:
def distance(x):
    Glasto = (51.1459537, -2.7045787586418)
    City = x
    return great_circle(Glasto, City).km

In [11]:
distance((53.4794892, -2.2451148))

261.34921467816133

In [12]:
df['Distance'] = df['coordinates'].apply(distance)

In [13]:
df.sort_values(by='Distance')

Unnamed: 0,City,Price,coordinates,Distance
0,Bristol,33.0,"(51.4538022, -2.5972985)",35.034371
2,Bath,31.0,"(51.3813864, -2.3596963)",35.512908
6,Cardiff,39.0,"(51.4816546, -3.1791934)",49.81497
20,Southampton,52.0,"(50.9025349, -1.404189)",94.890911
15,Oxford,47.0,"(51.7534512, -1.2699542)",120.189911
18,Reading,48.0,"(51.456659, -0.9696512)",125.462542
16,Plymouth,41.0,"(50.3712659, -4.1425658)",132.854514
3,Birmingham,49.0,"(52.4632175, -1.86001916063752)",157.561292
4,Brighton,52.0,"(50.8220399, -0.1374061)",183.269534
1,London,52.5,"(51.5073219, -0.1276474)",183.497103


In [14]:
df['Train_station_name'] = df['City'].str.replace('Stoke', 'Stoke-on-Trent')
df['Train_station_name'] = df['Train_station_name'].str.replace('Preston', 'Preston (Lancs)')
df['Train_station_name'] = df['Train_station_name'].str.replace('Bath', 'Bath Spa')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
City                  24 non-null object
Price                 24 non-null float64
coordinates           24 non-null object
Distance              24 non-null float64
Train_station_name    24 non-null object
dtypes: float64(2), object(3)
memory usage: 1.0+ KB


In [16]:
df

Unnamed: 0,City,Price,coordinates,Distance,Train_station_name
0,Bristol,33.0,"(51.4538022, -2.5972985)",35.034371,Bristol
1,London,52.5,"(51.5073219, -0.1276474)",183.497103,London
2,Bath,31.0,"(51.3813864, -2.3596963)",35.512908,Bath Spa
3,Birmingham,49.0,"(52.4632175, -1.86001916063752)",157.561292,Birmingham
4,Brighton,52.0,"(50.8220399, -0.1374061)",183.269534,Brighton
5,Cambridge,68.5,"(52.194144, 0.1375027)",228.00867,Cambridge
6,Cardiff,39.0,"(51.4816546, -3.1791934)",49.81497,Cardiff
7,Leeds,64.0,"(53.7948592, -1.54802881274469)",304.774424,Leeds
8,Leicester,50.5,"(52.6361398, -1.1330789)",197.694248,Leicester
9,Lincoln,63.0,"(51.6467723, -0.0653782)",191.37072,Lincoln


## Coach Distance and Time (Break down of code at the bottom of the page)
- 1) Coach Time 2) Coach Distance
- Use the website checkmybus.co.uk, then input the city and scrape the time and distance

In [17]:
def coach_time(city):
    link = 'https://www.checkmybus.co.uk/glastonbury/{}'.format(city)
    try:
        
        
        #Standard beautiful soup to scrape website
        req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
        webpage = urlopen(req).read()
        soup = BeautifulSoup(webpage, "html.parser")
        time = soup.findAll('table', class_= "route-summary")
        
        # Find text and clean
        time = [x.text for x in time]
        time = [x.replace("Cheapest Bus", "") for x in time]
        time = [x.replace("Fastest Bus", ",") for x in time]
        time = [x.replace("Distance", ",") for x in time]
        time = [x.replace("Coach CompaniesNational Express", "") for x in time]
        
        #convert list to string
        time = ", ".join(time)
        lst = time.split(",")
        time = lst[1]
        time = time.replace("m", '')
        time = time.split("h")
        
        #Convert time to minutes
        time_in_min = (int(time[0])*60) + int(time[1])
        return time_in_min
    
    except:
        
        resp = requests.get(link)
        resp = TextResponse(body=resp.content, url=link)
        
        #Those with no summary coach data, we need to subract the time from one another
        initial_time = resp.xpath('//*[@id="connections-main"]/div[3]/div/div/div/ul/li/div[1]/div[1]/div/div/div/div[2]/div/div[2]/div[1]/text()[2]').extract()
        initial_time = ", ".join(initial_time)
        initial_time = initial_time.replace(' ', '')
        final_time = resp.xpath('//*[@id="connections-main"]/div[3]/div/div/div/ul/li/div[1]/div[1]/div/div/div/div[2]/div/div[2]/div[2]/text()[2]').extract()
        final_time = ", ".join(final_time)
        final_time = final_time.replace(' ', '')

        try: 
            #Convert time to minutes
            start_dt = dt.datetime.strptime(initial_time, '%H:%M')
            end_dt = dt.datetime.strptime( final_time, '%H:%M')
            diff = (end_dt - start_dt) 
            return diff.seconds/60 
        except:
            return 'No coaches available'

In [18]:
import requests
def coach_distance(city):
    link = 'https://www.checkmybus.co.uk/glastonbury/{}'.format(city)
    req = requests.get(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
    html = req.content
    #webpage = urlopen(req).read()
    soup = BeautifulSoup(html, "html.parser")
    try:
        # Use the above code but requires more cleaning
        time = soup.findAll('table', class_= "route-summary")
        time = [x.text for x in time]
        time = [x.replace("Cheapest Bus", "") for x in time]
        time = [x.replace("Fastest Bus", ",") for x in time]
        time = [x.replace("Distance", ",") for x in time]
        time = [x.replace("Coach CompaniesNational Express", "") for x in time]
        time = [x.replace("ALSA", "") for x in time]
        time = [x.replace("megabus", "") for x in time]
        time = [x.replace("Greyhound US", "") for x in time]
        time = [x.replace(" National Express", "") for x in time]
        time = [x.replace("Coach CompaniesAir Decker", "") for x in time]
        time = [x.replace("Greyhound Australia", "") for x in time]
        time = [x.replace("Rede Expressos", "") for x in time]
        time = [x.replace("Air Decker", "") for x in time]
        time = [x.replace("Crucero del Norte", "") for x in time]
        time = [x.replace("Citi Express", "") for x in time]
        time = [x.replace("El Práctico", "") for x in time]
        time = [x.replace("CitiExpress", "") for x in time]
        time = [x.replace(" ", "") for x in time]

        #Convert list to string
        time = ", ".join(time)
        lst = time.split(",")
        time = lst[1]
        
        #A string with 'km' will be the distance, therefore use 'in' to find
        lst2 = ['k' in x for x in lst]
        return lst[lst2.index(True)]
    except:
        try:
            time = soup.findAll('span', class_= "km")
            time = [x.text for x in time]
            return max(time)
        except:
            return 'No coaches available'


## Train Distance and Time (Break down of code at the bottom of the page)
- 1) Time 2) Price
- Use the website checkmybus.co.uk, then input the city and scrape the time and distance

#### Process:
- Input city and time (which is calculated using the ctime() function)
- Scrape Time and Price from nationalrail website

In [19]:
def time(city):
    
    try:
        # Find the current time - out put day, date,time,year 
        x = ctime()
        y = x.split()

        # refer to only the time
        y = y[-2]

        # Split by : and only use hours [0] and minutes [1]
        time = y.split(':')
        current_time = time[0] + time[1]

        # Website for national rail from London
        # Use format to add in the city to travel to and current time to search for that time
        link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/{}/today/{}/dep'.format(city, current_time)

        #Standard beautiful soup to scrape website
        req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
        webpage = urlopen(req).read()
        soup = BeautifulSoup(webpage, "html.parser")

        #Search for durationHours - a dictionary of important information is the output
        the_word = 'durationHours'
        price = soup.find(text=lambda text: text and the_word in text)

        ###########################  Clean up string to give the dictionary #######################
        price = price.replace('\n\t\t\t', '')
        price = price.replace('\n\t\t', '')
        string = price.replace('{"jsonJourneyBreakdown":', '')

        #Now removing { and }
        s = string.replace("{" ,"")
        finalstring = s.replace("}" , "")

        #Splitting the string based on , we get key value pairs
        list = finalstring.split(",")

        dictionary ={}
        for i in list:
            #Get Key Value pairs separately to store in dictionary
            keyvalue = i.split(":")

            #Replacing the single quotes in the leading.
            m= keyvalue[0].strip('\'')
            m = m.replace("\"", "")
            dictionary[m] = keyvalue[1].strip('"\'')
        ##########################################################################################

        return (int(dictionary.get('durationHours')) *60) + int(dictionary.get('durationMinutes'))
    except:
        return 0

In [20]:
def price(city):
    
    try:
        # Find the current time - out put day, date,time,year 
        x = ctime()
        y = x.split()

        # refer to only the time
        y = y[-2]

        # Split by : and only use hours [0] and minutes [1]
        time = y.split(':')
        current_time = time[0] + time[1]

        # Website for national rail from London
        # Use format to add in the city to travel to and current time to search for that time
        link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/{}/today/{}/dep'.format(city, current_time)

        #Standard beautiful soup to scrape website
        req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
        webpage = urlopen(req).read()
        soup = BeautifulSoup(webpage, "html.parser")

        #Search for durationHours - a dictionary of important information is the output
        the_word = 'durationHours'
        price = soup.find(text=lambda text: text and the_word in text)

        ###########################  Clean up string to give the dictionary #######################
        price = price.replace('\n\t\t\t', '')
        price = price.replace('\n\t\t', '')
        string = price.replace('{"jsonJourneyBreakdown":', '')

        #Now removing { and }
        s = string.replace("{" ,"")
        finalstring = s.replace("}" , "")

        #Splitting the string based on , we get key value pairs
        list = finalstring.split(",")

        dictionary ={}
        for i in list:
            #Get Key Value pairs separately to store in dictionary
            keyvalue = i.split(":")

            #Replacing the single quotes in the leading.
            m= keyvalue[0].strip('\'')
            m = m.replace("\"", "")
            dictionary[m] = keyvalue[1].strip('"\'')
        #############################################################################################

        return dictionary.get('fullFarePrice')
    except:
        return 0

#### Apply functions to dataframe

In [21]:
df['Train_Price'] = df['Train_station_name'].apply(price)
df['Train_Time'] = df['Train_station_name'].apply(time)

In [22]:
df['Coach_distance'] = df['City'].apply(coach_distance)

In [23]:
df['Coach_time'] = df['City'].apply(coach_time)

In [24]:
df['Coach_distance'] = df['Coach_distance'].str.replace('km','')

In [25]:
# Remove locations where no coaches are available
df = df[df['Coach_distance'] != "No coaches available"]

In [26]:
df['Train_Time'] =df['Train_Time'].astype(float)
df['Coach_distance'] = df['Coach_distance'].astype(float)
df['Coach_time'] =df['Coach_time'].astype(float)
df['Train_Price']=df['Train_Price'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [27]:
df

Unnamed: 0,City,Price,coordinates,Distance,Train_station_name,Train_Price,Train_Time,Coach_distance,Coach_time
0,Bristol,33.0,"(51.4538022, -2.5972985)",35.034371,Bristol,50.1,103.0,35.4,12.0
1,London,52.5,"(51.5073219, -0.1276474)",183.497103,London,0.0,0.0,184.4,277.0
2,Bath,31.0,"(51.3813864, -2.3596963)",35.512908,Bath Spa,50.1,88.0,35.9,29.0
3,Birmingham,49.0,"(52.4632175, -1.86001916063752)",157.561292,Birmingham,55.7,81.0,159.5,258.0
4,Brighton,52.0,"(50.8220399, -0.1374061)",183.269534,Brighton,18.6,82.0,184.2,406.0
5,Cambridge,68.5,"(52.194144, 0.1375027)",228.00867,Cambridge,26.1,85.0,228.4,130.0
6,Cardiff,39.0,"(51.4816546, -3.1791934)",49.81497,Cardiff,47.3,126.0,49.0,80.0
7,Leeds,64.0,"(53.7948592, -1.54802881274469)",304.774424,Leeds,112.5,133.0,305.6,400.0
8,Leicester,50.5,"(52.6361398, -1.1330789)",197.694248,Leicester,89.5,62.0,198.0,115.0
9,Lincoln,63.0,"(51.6467723, -0.0653782)",191.37072,Lincoln,76.7,126.0,275.2,210.0


## London Station
- Scrape the London station associated with destination 

In [28]:
def London_station(city):
    '''The above function to scrape national rail also references the London train station the train departs from'''
    try:
        x = ctime()
        y = x.split()
        y = y[-2]
        time = y.split(':')
        current_time = time[0] + time[1]

        link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/{}/today/{}/dep'.format(city, current_time)
        req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
        webpage = urlopen(req).read()
        soup = BeautifulSoup(webpage, "html.parser")
        the_word = 'durationHours'
        price = soup.find(text=lambda text: text and the_word in text)

        price = price.replace('\n\t\t\t', '')
        price = price.replace('\n\t\t', '')
        string = price.replace('{"jsonJourneyBreakdown":', '')

        #Now removing { and }
        s = string.replace("{" ,"")
        finalstring = s.replace("}" , "")

        #Splitting the string based on , we get key value pairs
        list = finalstring.split(",")

        dictionary ={}
        for i in list:
            #Get Key Value pairs separately to store in dictionary
            keyvalue = i.split(":")

            #Replacing the single quotes in the leading.
            m= keyvalue[0].strip('\'')
            m = m.replace("\"", "")
            dictionary[m] = keyvalue[1].strip('"\'')

        return dictionary.get('departureStationName')
    except:
        'none'

In [29]:
London_station('Manchester')

'London Euston'

In [30]:
df['London_train_station'] = df['Train_station_name'].apply(London_station)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


## Postcode
- Find the london station postcode using geocode function

In [31]:
def postcode(x):
    try:
        geolocator = Nominatim(user_agent="specify_your_app_name_here", format_string="%s, {} Station,London".format(x))
        address, (latitude, longitude) = geolocator.geocode('{} Station'.format(x))
        
        #regex code to find postcodes in a string
        postcode = re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', address)
        postcode = ''.join(postcode)
        return postcode
    except:
        return 'none'
    

In [32]:
postcode('Marylebone')

'NW1 5QE'

In [33]:
df['London_train_station'] = df['London_train_station'].str.replace('London', '')

#St pancras has the same postcode as kings cross
df['London_train_station'] = df['London_train_station'].str.replace('St Pancras International', 'Kings Cross')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [34]:
df

Unnamed: 0,City,Price,coordinates,Distance,Train_station_name,Train_Price,Train_Time,Coach_distance,Coach_time,London_train_station
0,Bristol,33.0,"(51.4538022, -2.5972985)",35.034371,Bristol,50.1,103.0,35.4,12.0,Paddington
1,London,52.5,"(51.5073219, -0.1276474)",183.497103,London,0.0,0.0,184.4,277.0,
2,Bath,31.0,"(51.3813864, -2.3596963)",35.512908,Bath Spa,50.1,88.0,35.9,29.0,Paddington
3,Birmingham,49.0,"(52.4632175, -1.86001916063752)",157.561292,Birmingham,55.7,81.0,159.5,258.0,Euston
4,Brighton,52.0,"(50.8220399, -0.1374061)",183.269534,Brighton,18.6,82.0,184.2,406.0,Kings Cross
5,Cambridge,68.5,"(52.194144, 0.1375027)",228.00867,Cambridge,26.1,85.0,228.4,130.0,Kings Cross
6,Cardiff,39.0,"(51.4816546, -3.1791934)",49.81497,Cardiff,47.3,126.0,49.0,80.0,Paddington
7,Leeds,64.0,"(53.7948592, -1.54802881274469)",304.774424,Leeds,112.5,133.0,305.6,400.0,Kings Cross
8,Leicester,50.5,"(52.6361398, -1.1330789)",197.694248,Leicester,89.5,62.0,198.0,115.0,Kings Cross
9,Lincoln,63.0,"(51.6467723, -0.0653782)",191.37072,Lincoln,76.7,126.0,275.2,210.0,Kings Cross


In [35]:
df['LDN_STATION_POSTCODE'] = df['London_train_station'].apply(postcode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### TFL API
- Use API to calculate time from work postcode

In [36]:
def time_to_station(coord):
    response = requests.get('https://api.tfl.gov.uk/Journey/JourneyResults/se10lr/to/{}'.format(coord))

    pass_times = response.json()
    pass_times = pass_times.get('journeys')
    try:
        info = pass_times[0]
        return info.get('duration')
    except:
        return 0

In [37]:
df['Time_to_station'] = df['LDN_STATION_POSTCODE'].apply(time_to_station)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [38]:
df

Unnamed: 0,City,Price,coordinates,Distance,Train_station_name,Train_Price,Train_Time,Coach_distance,Coach_time,London_train_station,LDN_STATION_POSTCODE,Time_to_station
0,Bristol,33.0,"(51.4538022, -2.5972985)",35.034371,Bristol,50.1,103.0,35.4,12.0,Paddington,W2 1RL,34
1,London,52.5,"(51.5073219, -0.1276474)",183.497103,London,0.0,0.0,184.4,277.0,,none,0
2,Bath,31.0,"(51.3813864, -2.3596963)",35.512908,Bath Spa,50.1,88.0,35.9,29.0,Paddington,W2 1RL,34
3,Birmingham,49.0,"(52.4632175, -1.86001916063752)",157.561292,Birmingham,55.7,81.0,159.5,258.0,Euston,NW1 2DN,35
4,Brighton,52.0,"(50.8220399, -0.1374061)",183.269534,Brighton,18.6,82.0,184.2,406.0,Kings Cross,N1 9SJ,38
5,Cambridge,68.5,"(52.194144, 0.1375027)",228.00867,Cambridge,26.1,85.0,228.4,130.0,Kings Cross,N1 9SJ,38
6,Cardiff,39.0,"(51.4816546, -3.1791934)",49.81497,Cardiff,47.3,126.0,49.0,80.0,Paddington,W2 1RL,34
7,Leeds,64.0,"(53.7948592, -1.54802881274469)",304.774424,Leeds,112.5,133.0,305.6,400.0,Kings Cross,N1 9SJ,38
8,Leicester,50.5,"(52.6361398, -1.1330789)",197.694248,Leicester,89.5,62.0,198.0,115.0,Kings Cross,N1 9SJ,38
9,Lincoln,63.0,"(51.6467723, -0.0653782)",191.37072,Lincoln,76.7,126.0,275.2,210.0,Kings Cross,N1 9SJ,38


In [39]:
df['Total_Time'] = df['Train_Time'] + df['Coach_time'] + df['Time_to_station']
df['Total_price'] = df['Price'] + df['Train_Price']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [40]:
df.sort_values(by='Total_Time')

Unnamed: 0,City,Price,coordinates,Distance,Train_station_name,Train_Price,Train_Time,Coach_distance,Coach_time,London_train_station,LDN_STATION_POSTCODE,Time_to_station,Total_Time,Total_price
0,Bristol,33.0,"(51.4538022, -2.5972985)",35.034371,Bristol,50.1,103.0,35.4,12.0,Paddington,W2 1RL,34,149.0,83.1
2,Bath,31.0,"(51.3813864, -2.3596963)",35.512908,Bath Spa,50.1,88.0,35.9,29.0,Paddington,W2 1RL,34,151.0,81.1
8,Leicester,50.5,"(52.6361398, -1.1330789)",197.694248,Leicester,89.5,62.0,198.0,115.0,Kings Cross,N1 9SJ,38,215.0,140.0
6,Cardiff,39.0,"(51.4816546, -3.1791934)",49.81497,Cardiff,47.3,126.0,49.0,80.0,Paddington,W2 1RL,34,240.0,86.3
5,Cambridge,68.5,"(52.194144, 0.1375027)",228.00867,Cambridge,26.1,85.0,228.4,130.0,Kings Cross,N1 9SJ,38,253.0,94.6
14,Nottingham,61.0,"(52.9470734, -1.1470758)",226.824813,Nottingham,65.0,120.0,227.6,115.0,Kings Cross,N1 9SJ,38,273.0,126.0
18,Reading,48.0,"(51.456659, -0.9696512)",125.462542,Reading,20.6,25.0,126.3,217.0,Paddington,W2 1RL,34,276.0,68.6
1,London,52.5,"(51.5073219, -0.1276474)",183.497103,London,0.0,0.0,184.4,277.0,,none,0,277.0,52.5
16,Plymouth,41.0,"(50.3712659, -4.1425658)",132.854514,Plymouth,82.3,183.0,132.0,120.0,Paddington,W2 1RL,34,337.0,123.3
15,Oxford,47.0,"(51.7534512, -1.2699542)",120.189911,Oxford,27.3,63.0,121.6,258.0,Paddington,W2 1RL,34,355.0,74.3


# Summary

In [62]:
def my_route(City, df):
    df = df[df['City'] == City]
    
    print('Step 1: Make your way to ' + str(df.London_train_station.iloc[0])+  ' train station. This takes '+  str(df.Time_to_station.iloc[0]) +  ' minutes.')
    print('Step 2: Get the train to '+  str(df.Train_station_name.iloc[0]) +  ' train station. This takes ' + str(df.Train_Time.iloc[0]) +  ' minutes.')
    print('Step 3: Get the coach to Glastonbury. This takes '+  str(df.Coach_time.iloc[0])+  ' minues.')
    print('Summary. Total time: ' + str(df.Total_Time.iloc[0]) + '. Total price: £' + str(df.Total_price.iloc[0])) 
    

In [63]:
my_route('Bristol', df)

Step 1: Make your way to Paddington train station. This takes 34 minutes.
Step 2: Get the train to Bristol train station. This takes 103 minutes.
Step 3: Get the coach to Glastonbury. This takes 12 minues.
Summary. Total time: 149. Total price: £83.1


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


____________________________________________________________________________________________________


#### A bit of the breakdown of code to show some working out

In [41]:
city = 'Cardiff'
link = 'https://www.checkmybus.co.uk/glastonbury/{}'.format(city)
#Standard beautiful soup to scrape website
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
#the_word = 'journey'
#price = soup.find(text=lambda text: text and the_word in text)
time = soup.findAll('span', class_= "km")
time = [x.text for x in time]
time[0]

'49 km'

In [5]:
from scrapy.http import TextResponse
city = 'Cardiff'
link = 'https://www.checkmybus.co.uk/glastonbury/{}'.format(city)
resp = requests.get(link)
resp = TextResponse(body=resp.content, url=link)
initial_time = resp.xpath('//*[@id="connections-main"]/div[3]/div/div/div/ul/li/div[1]/div[1]/div/div/div/div[2]/div/div[2]/div[1]/text()[2]').extract()
initial_time = ", ".join(initial_time)
initial_time = initial_time.replace(' ', '')
initial_time

'12:30'

In [6]:
final_time = resp.xpath('//*[@id="connections-main"]/div[3]/div/div/div/ul/li/div[1]/div[1]/div/div/div/div[2]/div/div[2]/div[2]/text()[2]').extract()
final_time = ", ".join(final_time)
final_time = final_time.replace(' ', '')
final_time

'13:50'

In [13]:
import datetime as dt
start_dt = dt.datetime.strptime(initial_time, '%H:%M')
end_dt = dt.datetime.strptime( final_time, '%H:%M')
diff = (end_dt - start_dt) 
diff.seconds/60 

80.0

In [267]:
#link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/{}/today/{}/dep'.format(city, current_time)
link = 'https://www.checkmybus.co.uk/glastonbury/london'
#Standard beautiful soup to scrape website
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
#the_word = 'journey'
#price = soup.find(text=lambda text: text and the_word in text)
time = soup.findAll('table', class_= "route-summary")
time = [x.text for x in time]
time = [x.replace("Cheapest Bus", "") for x in time]
time = [x.replace("Fastest Bus", ",") for x in time]
time = [x.replace("Distance", ",") for x in time]
time = [x.replace("Coach CompaniesNational Express", "") for x in time]
time = ", ".join(time)
lst = time.split(",")
time = lst[1]
distance = lst[-1]
time = time.replace("m", '')
time = time.split("h")

time_in_min = (int(time[0])*60) + int(time[1])
time_in_min


277

In [160]:
def time(city):
    link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/{}/today/1245/dep'.format(city)
    req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, "html.parser")
    the_word = 'durationHours'
    price = soup.find(text=lambda text: text and the_word in text)

    price = price.replace('\n\t\t\t', '')
    price = price.replace('\n\t\t', '')
    string = price.replace('{"jsonJourneyBreakdown":', '')

    #Now removing { and }
    s = string.replace("{" ,"")
    finalstring = s.replace("}" , "")

    #Splitting the string based on , we get key value pairs
    list = finalstring.split(",")

    dictionary ={}
    for i in list:
        #Get Key Value pairs separately to store in dictionary
        keyvalue = i.split(":")

        #Replacing the single quotes in the leading.
        m= keyvalue[0].strip('\'')
        m = m.replace("\"", "")
        dictionary[m] = keyvalue[1].strip('"\'')

    return dictionary

In [161]:
time('Truro')

{'departureStationName': 'London Paddington',
 'departureStationCRS': 'PAD',
 'arrivalStationName': 'Truro',
 'arrivalStationCRS': 'TRU',
 'statusMessage': 'on time',
 'departureTime': '13',
 'arrivalTime': '18',
 'durationHours': '4',
 'durationMinutes': '59',
 'changes': '1',
 'journeyId': '1',
 'responseId': '4',
 'statusIcon': 'GREEN_TICK',
 'hoverInformation': 'null',
 'singleJsonFareBreakdowns': '["breakdownType',
 'fareTicketType': 'Super Off-Peak Single',
 'ticketRestriction': 'YX',
 'fareRouteDescription': 'Travel is allowed via any permitted route.',
 'fareRouteName': 'ANY PERMITTED',
 'passengerType': 'Adult',
 'railcardName': '',
 'ticketType': 'Super Off-Peak Single',
 'ticketTypeCode': 'SSS',
 'fareSetter': 'GWR',
 'fareProvider': 'Great Western Railway',
 'tocName': 'Great Western Railway',
 'tocProvider': 'Great Western Railway',
 'fareId': '7',
 'numberOfTickets': '1',
 'fullFarePrice': '68.2',
 'discount': '0',
 'ticketPrice': '68.2',
 'cheapestFirstClassFare': '116.3

In [109]:
link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/BPW/today/1245/dep'
req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
price = soup.findAll("label", class_="opsingle")
prices = [x.text for x in price]
prices = [x.replace('\n\n\t\t\t\t\t\t\t\t\t', '') for x in prices]
prices = [x.replace('\n\t\t\t\t\t\t\t\t', '') for x in prices]
min(prices)

'£35.40'

In [158]:
def city(city):
    link = 'http://ojp.nationalrail.co.uk/service/timesandfares/London/{}/today/1245/dep'.format(city)
    req = Request(link, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, "html.parser")
    price = soup.findAll("label", class_="opsingle")
    prices = [x.text for x in price]
    prices = [x.replace('\n\n\t\t\t\t\t\t\t\t\t', '') for x in prices]
    prices = [x.replace('\n\t\t\t\t\t\t\t\t', '') for x in prices]
    return min(prices)

In [159]:
city('Truro')

'£143.50'