## Introduction to Maps

Welcome to Lab 12! Today we will be exploring the Bay Area Bike Share service to practice using the .map_table function. 

As usual, run the cell below to prepare everything!

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
import math
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
Table.interactive_plots()

## Mapping the Bay Area Bike Share


In class, you were introduced to the Bay Area Bike Share dataset and `map_table`, a powerful visualization tool. In this exercise, you will get to practice your new skills.

The defunct Bay Area Bike Share service (now known as BayWheels by Lyft) published a dataset describing every bicycle rental from September 2014 to August 2015 in their system. There were 354,152 rentals in all. The columns are:

- An ID for the rental
- Duration of the rental, in seconds
- Start date
- Name of the Start Station and code for Start Terminal
- Name of the End Station and code for End Terminal
- A serial number for the bike
- Subscriber type and zip code

**Question 1.** The dataset is in a file called `trip.csv`.  Load it into a table named `trips`:

In [None]:
trips = ...
trips

**Question 2.** Create a new table `starts` that has two columns: `Start Station` that has the name of every station, and `Number of Trips` that has the number of rides started at the corresponding station.

You should also create a new table called `durations`, that has two columns: `Start Station` that has the name of every station and `Average Duration` that has the average duration of rides from the corresponding station.

*Hint:* `tbl.group("column", function)` will group data in a column by a specific value. By default, if you do not specify the function argument (ie. `tbl.group("column")`), it will return the count of values in each unique group in the dataset. If you do specify the function argument, it will apply that function to every other column in the table. This [video](https://www.youtube.com/watch?v=HLoYTCUP0fc) will help explain this.

In [None]:
starts = ...
starts

In [None]:
durations = ...
durations

**Question 3.** Produce a bar graph using the `durations` table that compares each station by the average duration of rides from that station, with the bars sorted in descending order.

In [None]:
...

Our other dataset, `station.csv`, contains geographical information about each bike station, including latitude, longitude, and a "landmark", which is the name of the city where the station is located.

**Question 4.** Load `station.csv`.  Load it into a table named `stations`:

In [None]:
stations = 
stations

**Question 5.** Draw a map of where these `stations` are located, using `Marker.map_table(tbl)`. This function takes in one argument: a table in a specific format. The table must have at least two columns, in the following order: latitude, longitude, and (as a third column) an optional identifier for each point. 


In [None]:
...

The map is created using [OpenStreetMap](http://www.openstreetmap.org/#map=5/51.500/-0.100), which is an open online mapping system that you can use just as you would use Google Maps or any other online map. Zoom in to San Francisco to see how the stations are distributed. Click on a marker to see which station it is, assuming you included an identifier column.

You can also represent points on a map by colored circles using `Circle.map_table`. This works the same as `Marker.map_table`, taking in a table with columns in a specific order, but it will use colored circles instead of the marker bubbles. In this case, we also will use some optional arguments, `color` and `radius`, that will allow us to change the circles that represent each station. 

**Question 6.** You may have noticed that `stations` contains multiple cities. Let's narrow our analysis. Make a table called `sf_map_data` with the `lat`, `long`, and `name` columns of **just** the San Francisco bike stations.

In [None]:
sf_stations = ...
sf_map_data = ...

Circle.map_table(sf_map_data, color='green', radius=200)

In this map, we set `radius=200` which tells Python how big to make the circles for our points. Wouldn't it be nice if we could combine information from our `starts` table and set the size of the circles to how many trips originated from that station? Let's write a function that lets us do that.

**Question 6.** Define a function `find_trip_count` that takes a station name and returns the number of trips that started at that station. Then, define a function `find_average_duration` that takes a station name and returns the average duration for trips that started at that station..

*Hint:* It may be useful to use the tables we defined at the start of the lab: `starts` and `duration`.

In [None]:
def find_trip_count(station_name):
    ...

def find_average_duration(station_name):
    ...

Unfortunately, some of the stations in the `stations` dataset are not present in the trips dataset. We must filter them out before applying find_trip_count to the remainder.


**Question 7.** Create a new table called `count_by_station` that has the rows in `stations` that are from stations located in the `trips` table. It should also have the same columns as `stations`, but with two added columns: `Number of Trips`, which contains the number of trips from that station, as well as `Average Trip Duration`, which contains the average trip duration from that station. To create these arrays, you should use `tbl.apply`.

*Hint:* What table method allows us to filter a table for specific values? Then, what predicate allows us to check if a value is contained in an array? We recommend using that and saving the filtered table in the variable `stations_in_trips` before making `count_by_station`.

In [None]:
stations_in_trips = ...
count_by_station = ...
    
    
count_by_station

Finally, lets make our map! Below we have defined a function to help you with the colors.

In [None]:
## Just run this cell. 

def duration_to_color(average_duration):
    """Converts an average trip duration to a string describing a color.
    
    Longer durations will be closer to bright red, and shorter durations
    will be closer to black.
    
    Args:
      average_duration (float): The average trip duration for one
        station.
    
    Returns:
      (string): A string describing a color based on the given average
        trip duration.  The string is in 6-digit hexidecimal format,
        which is a common way to describe colors."""
    max_duration_color = 255
    color_bits = 8
    rescaled_duration = min(max_duration_color, int(256 * average_duration / 5000))
    red_amount = 2**(2*color_bits) * rescaled_duration
    color = '#{:06X}'.format(red_amount)
    return color

**Question 8.** Finally, to graph the data, we need to create a new table called `starts_map_data` with that has five columns in the specific order that `Circle.map_table` can use: `lat`, `long`, `name`, `colors` and `area`. 

`lat` should refer to a station's latitude, `long` a station's longitude, `name` a station's name, `colors` a color based on results of calling the function `duration_to_color` on that station's duration, and `area` a number based on the results of calling `find_trip_count` on a station.

In this case, our functions will allow us to compare the average duration of rides from a station graphically (through color) and the number of rides from a station through the size of the circle. By using a table with columns for `colors` and `area`, we don't need to specify the values by using the optional `color = ` and `radius = ` arguments.  

In [None]:
starts_map_data = ...


# The code below will show some of your table as well as graph ot. 
starts_map_data.show(3)
Circle.map_table(starts_map_data)

### Conclusions
It seems that the locations with long trip durations are mostly in Palo Alto and Redwood City, with one exception in San Jose.  These are the least urban bike stations on the map.  The data are therefore compatible with our hypothesis.

Until now, we have not proposed a causal mechanism for the association.  Here are a few that are plausible:

* Palo Alto and Redwood City are close to long bike routes in the hills to the southwest.  Perhaps people take long recreational biking trips through the hills.
* Perhaps Stanford students rent bicycles to get around campus for days at a time.
* Perhaps some people who live or work in the long suburban peninsula between San Francisco and San Jose commute for long distances by bicycle.

**Question 9:** The `trips` dataset includes the date and time of day for the start and end of each trip.  How might we use this information to test some of the proposals above? Write your answer below and discuss with a classmate or TA.

*Write your answer here, replacing this text.*

## Submission

You're done with this lab!

To submit this notebook, please download your notebook as a .ipynb file and submit to Gradescope. You can do so by navigating to the toolbar at the top of this page, clicking File > Download as... > Notebook (.ipynb). Then, go to our class's Gradescope page [here](https://www.gradescope.com/courses/136698) and upload your file under "Lab 12." 

To check your work for all autograded questions, run the cell below. 

It's fine to submit multiple times, but we will only grade the final notebook you submit for each assignment. Make sure you pass all tests to receive credit.

In [None]:
## There are no autograder tests in this assignment! Just make sure you submit.