# Minilab 6 - SF Taxi data

In this minilab we will begin to use [linear regression](https://en.wikipedia.org/wiki/Linear_regression) to further look at trends in datasets. We will explore how we can use linear regression to make predictions of future 


The dataset is San Francisco Taxi data from 9/1/2012 to 9/17/2012. The dataset consists of 50,000 taxi trips taken in the Bay Area during that time period. For each trip we are given the departure time, arrival time, passenger fare, departure lat/lon coordinates, arrival lat/lon coordinates, departure taz and arrival taz. 

In [None]:
from datascience import *
import datetime
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
sf_taxi = Table.read_table('data/SF_taxi_data.csv')
sf_taxi

## Step one - Adding meaningful columns to our data table
The data table is rich, but the inputs are not as useful as they could be. For example we have date/time strings for arrival and departure. We may like to have information on trip duration, rather than a list of start and end times.

### Computing Trip Duration
We can write our own methods to help process the data. For example we can create a method to compute trip duration. First we have to convert the departure time and arrival time into a timestamp, then we can compute the difference. Finally we convert it to minutes. I have written the get_dur() function below.

**Task 1** Use the [.apply](http://data8.org/datascience/_autosummary/datascience.tables.Table.apply.html#datascience.tables.Table.apply) method to create a 'trip duration (min)' column in the sf_taxi table.


In [None]:
def get_dur(start,finish): 
    start_time = datetime.datetime.strptime(start, "%m/%d/%y %H:%M")
    end_time = datetime.datetime.strptime(finish, "%m/%d/%y %H:%M")
    return (end_time-start_time).seconds/60.

In [None]:
# Your code here


## Scatter Plots of trip duration vs. distance

** Task 2** Create a scatter plot of travel time vs. distace. Make sure fit_line is enabled

In [None]:
# Your code here

### Making inferences
** Question 1** Use the linear regression result to predict the time to travel 25 miles in the bay area. Do you think this is a good prediction? Why or why not? 
What is the travel time for a 5 mile trip? What about a 35 mile trip? Do you think these are good predictions?


In [None]:
# Your answer here


## Segmenting the data.
The best fit line above includes all 50,000 trips from a 2+ week period in 2012. This may not be the best representation of the relationship for all trips. The relationship between trip duration and distance traveled on a weekday may be different than that on a weekend, for example. In the cell below I have given you some helper functions that may be useful in segmenting your data. 

** Task 3** Add a 'start hour' column and a 'day of week', column to the sf_taxi table using the helper functions below.

In [None]:
def get_hour(s): return datetime.datetime.strptime(s, "%m/%d/%y %H:%M").hour
def get_date(s): return datetime.datetime.strptime(s, "%m/%d/%y %H:%M").date()
def get_weekday(s): return datetime.datetime.strptime(s, "%m/%d/%y %H:%M").date().weekday()
# 0 = Monday - 6= Sunday

def fit(x,y): return np.poly1d(np.polyfit(x,y,1))(x)

In [None]:
# Your code here


## Scatter plot of trip distance vs. duration. 
**Task 4** <li> Create a table that contains only weekend trips. Create another that contains only weekday trips.<li> Create a scatter plot of trip distance vs. duration. Plot weekend trips in blue and weekday trips in red.<li> Add a fit line to the graph - I created a helper function called fit() to help you with this - see the example below.<li> Make sure you plots have appropriate labels, tiltes and legends

**Question 2** How do the weekend and weekday travel time trends compare?

In [None]:
weekend = sf_taxi.where(sf_taxi.column('day of week')>=5)

x, y=  weekend.column("dist (miles)"), weekend.column("duration (min)")
plt.scatter(x,y,color='blue', alpha = .2)
plt.plot(x,fit(x,y), color='blue', label = 'weekend')


# Your code here 



plt.xlabel('distance (miles)')
plt.ylabel('duration (min)')
plt.legend()

In [None]:
# Your answers here

## Create a scatter plot of midday (11-2pm) weekday trips  vs. evening (5-8pm) weekday trips. 
Color evening trips in blue, midday trips in red. Add fit lines, label, and legends as above.

In [None]:
# Your code here

**Question 3** What do you notice about the two trends - how does the midday travel time/mile compare to the evening travel time/mile? Use these linear regression results to predict the time to travel 25 miles in the bay area at midday, and at 6pm. How much do the answers differ by?

**Question 4** Notice the two cluster of trips - one around 14 mile trip distance - the other from 1-5 miles. Why do you think these two clusters emerge.

In [None]:
# Your answers here

## Find the TAZ with the most popular origin
**Task 5**
<li> Find the most popular taxi trip origin TAZ. <li> Create a scatter plot of all sf trips in grey. Overlay with a scatter plot of only trips that originate at the most popular TAZ.

**Question 4** What patterns do you notice? Do trips from the most popular TAZ look similar to trips in the rest of the area? Do you have a guess of what the most popular taxi origin might be?

In [None]:
# Your code here



In [None]:
# Your answers here