# Minilab 9 - SF Taxi trip durations

In this minilab we will begin to use [linear regression](https://en.wikipedia.org/wiki/Linear_regression) to further look at trends in datasets. We will explore how we can use linear regression to make predictions of taxi trip durations.

In [None]:
from datascience import *
import datetime
import matplotlib.pyplot as plt
import numpy as np
import warnings
%matplotlib inline
warnings.filterwarnings("ignore")

In [None]:
trips = Table.read_table('Taxi_Toy.csv')
trips

## Adding meaningful columns to our data table
The data table is rich, but the inputs are not as useful as they could be. For example we have date/time strings for arrival and departure. We may like to have information on trip duration, rather than a list of start and end times.

### Distance of trips

Trip distance is likely one of the most important variables that defines the fare. Compute an approximate trip distance from the coordinates of start and end point of the trip.

Hint: use the *distance_on_sphere(lat1, lon1, lat2, lon2)* function from minilab2.

In [None]:
# Your code here:


### Computing Trip Duration
We can write our own methods to help process the data. For example we can create a method to compute trip duration. First we have to convert the departure time and arrival time into a timestamp, then we can compute the difference. Finally we convert it to minutes. You can use the get_duration() function below.

Use the [.apply](http://data8.org/datascience/_autosummary/datascience.tables.Table.apply.html#datascience.tables.Table.apply) method to create a 'duration' column.


In [None]:
def get_hour(s): 
    return datetime.datetime.strptime(s, "%m/%d/%y %H:%M").hour

def get_date(s): 
    return datetime.datetime.strptime(s, "%m/%d/%y %H:%M").date()

def get_weekday(s): # 0 = Monday - 6= Sunday
    return datetime.datetime.strptime(s, "%m/%d/%y %H:%M").date().weekday()

def get_duration(start,finish): 
    start_time = datetime.datetime.strptime(start, "%m/%d/%y %H:%M")
    end_time = datetime.datetime.strptime(finish, "%m/%d/%y %H:%M")
    return (end_time-start_time).seconds/60.

In [None]:
# Your code here:


## Trip duration exploration

Plot a scatterplot of the trips distances and durations.

In [None]:
plt.figure(figsize = (12,8))
plt.plot(trips['distance'], trips['duration'], '.')

## Linear regression:
The functions below are straightforward implementation of the linear regression as introduced in data8. See other examples of *fit_line()* usage in http://data8.org/fa18/ labs. Feel free to use any implementation that you are most familiar with.

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    "Compute r."
    return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table.column(y))/np.std(table.column(x))

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table.column(y)) - a * np.mean(table.column(x))

def predict(x, slope, intercept):
    y = x*slope+intercept
    return y

In the cell below, we again use a scatter plot of the travel distance vs. duration, and overlay the best-fit regression line on the plot.

In [None]:
# for convenience, create x, extra variables
distance, duration = trips.column("distance"), trips.column("duration")

plt.figure(figsize = (8,6))
plt.scatter(distance, duration, color='blue', alpha = .05, label='Distance vs duration')

dist_slope = slope(trips,"distance","duration")
dist_intercept = intercept(trips,"distance","duration")



predicted_duration = predict(distance, dist_slope, dist_intercept)

plt.plot(distance, predicted_duration, color='red',
         label='Duration = %.2f * distance + %.2f'%(dist_slope, dist_intercept))

plt.xlabel("Trip distance (miles)")
plt.ylabel("Trip duration (minutes)")
plt.legend()

## Trips of the distance less than 5 miles

Repeat the prediction and create a visualization for the trips of less than 5 miles.

In [None]:
# Your code here:


What is the difference between the two plots? Do you think linear regression is an appropriate model for all trips durations, for the subset of short trips or not an appropriate model for this task at all?