## Problem 1 - Basic EDA with pandas and matplotlib

Back in HW2, we used R's dplyr package to analyze data from Washington DC's bike share system. Now, you'll 
be able to see what doing a similar analysis using Python is like.
For this assignment, we'll
be using data from one of the Kaggle Datasets called the Cycle Share
Dataset. Go to [https://www.kaggle.com/pronto/cycle-share-dataset](https://www.kaggle.com/pronto/cycle-share-dataset)
to read about the data. We will be using the `trip.csv` and the
`station.csv` files. I have made them available to you as part of the
assignment. Make sure you use the versions I gave you as I've done a few modifications to make things easier.

In this first problem, you are taking on the role of an analyst
who is doing some exploration of this raw trip data.

### Step 1 - get ready to analyze trip.csv file

If you're reading this, you've downloaded **hw4_files_w24.zip** and extracted it.

Rename this Jupyter notebook as` hw4_cycleshare_[your last name].ipynb`. Mine would be called `hw4_cycleshare_isken.ipynb`. 

Browse the files to get sense of structure using a text editor or the shell.

Visit the Kaggle site listed above for more about the file contents. You
will mostly be using the `trip.csv` data file but I've included the 
`station.csv` file as we'll need it too.

### Step 2 - Read data and explore rows

Now you'll need to complete the following tasks in Python. Just like we did in class, you should
use a combination of markdown text (be concise, no need to write tons of text) to explain 
what you are doing and Python code cells to actually do it.

First let's load the libraries we'll need. Ignore warnings about the version of numpy.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

In [None]:
# Create variables to store filenames to read.
file_trip_data = './data/trip.csv'
file_station_data = './data/station.csv'

**QUESTION 1.1** Read in the data files and check structure. Make sure that `starttime` and `stoptime` get read in as datetimes.

In [None]:
# Read the trip data
#trip_df = pd.read_csv(???, ???=???)

# Read the station data
#station_df = pd.read_csv(???)

In [None]:
# Check out the structure of the resulting DataFrames


**QUESTION 1.2** In two separate code cells, do the following:

- List the first fifteen rows (all the columns), 
- list rows with **index values** 500 through 510 (inclusive) but only the `starttime`, `stoptime`, `from_station_id`, and `to_station_id` columns.

In [None]:
# First 15 rows


In [None]:
# Rows 500-510 with select columns


**QUESTION 1.3** List all trips in which the `from_station_id` is 'WF-03'.

**QUESTION 1.4** This question builds on several ideas from above and then extends things. So, we want to find all the trips
in which:

- the `from_station_id` is 'WF-03' and the `bikeid` is 'SEA00082'. Use boolean indexing like we did in the ORSchedLeadTime notebook when we were selecting records from certain surgical services.
- and only return columns `bikeid`, `tripduration`, `starttime`, `stoptime`, `from_station_id`,  and `to_station_id`.

Capture the result in a new variable named `WF03_df`. Then print out the value of `WF03_df`. 

**HACKER EXTRA** 

The use of boolean indexing like you just did can get tedious, ugly and hard to read. Fans of the R package, dplyr, point out that you can use the pipe to write more readable code. To be fair, you can actually do something similar in pandas using "method chaining". On the course web page for pandas, down in the [Explore section](http://www.sba.oakland.edu/faculty/isken/courses/mis5470_f23/eda_python.html#explore-optional), you can find this [link to a blog post showing how you can do dplyr-like coding in pandas](https://stmorse.github.io/journal/tidyverse-style-pandas.html). So, now try to answer Question 1.4 by using pandas method chaining as shown in that blog post. HINT: The `query` and `filter` methods will be useful.

### Step 3 - trip counts

**QUESTION 1.5** We are interested in the number of trips into and out of each station. In fact, the difference between trips out and trips in for a given station gives a sense of the station *balance*. Start by counting the number of trips out of each station. Store the result in a variable called station_trips_out.

In [None]:
#station_trips_out = ???

In [None]:
# Use the type function to see what station_trips_out is.


Now do a similar thing for trips in by station.

In [None]:
#station_trips_in = ???

Now, create a new Series called `station_balance` that is the difference between the number of trips in and the number trips out. A negative value should correspond to more trips out than in. Then display the station_balance series sorted in ascending order. Write a few sentences discussing the implications of these results from the perspective of managing bike share systems. What are the implications of having large station inbalance levels?

In [None]:
#station_balance = ???
#station_balance.???

***Write your discussion paragraph on station balance here*** Just double click me and edit the markdown cell.




### Hacker extra - merge with station data
Combine the `station_trips_in`, `station_trips_out`, and `station_balance` series into a DataFrame called `station_trips` (use appropriate column names). Then figure out how to use merge to add the `current_dock_count` field into your `station_trips DataFrame`.

### Step 4 - Aggregate analysis

**QUESTION 1.6** Let's do some simple group by analysis to get a sense of the trip duration statistics by `to_station_id`. Compute the mean trip duration grouped by `to_station_id` and sort the results in descending order by mean trip duration.

In [None]:
# Compute the mean trip duration


Now compute summary statistics using the pandas `describe` for `trip_duration` by `to_station_id`.

In [None]:
# Compute summary statistics for trip duration



Repeat the above query but sort the results in descending order by the **median** trip duration.

In [None]:
# Sorted version



**QUESTION 1.7** Now, find the 10 most popular trips (i.e. the `from_station_id` and `to_station_id` with the 
largest number of trips from one to the other). 

In [None]:
# Your code for finding the most popular trips



### Step 5 - plots
Now you'll create a few plots to help visualize this dataset. You can use pandas, matplotlib, or seaborn or any combination of them.

***QUESTION 1.8***

Create a histogram of `trip_duration`. For this histogram, use 50 bins, change the bar color to blue and set the transparency level (alpha) to 0.80. Label the x-axis with 'Seconds' and y-axis with 'Number of rides'. Title the plot with 'Histogram of Trip Duration'.

In [None]:
# Put your code here to create the plot



***QUESTION 1.9** Now create a bar plot based on the counts of trips out of each station. Only include the 25 highest volume stations and sort the bars from highest (on the left side) to lowest. HINT: You can do this pretty easily with the built in pandas `plot` command. Of course there's a bit of work to get things sorted and only use the top 25, but pretty straightforward.

Make sure both axes are properly labelled and the plot has an appropriate title.

In [None]:
# Put your code here to create the plot

**QUESTION 1.10** Now create a line plot showing the number of rides per date. A good general hint for creating plots is to first figure out what Series or DataFrame would make the plot easy. Then make that first and base the plot off of it. In this case, you first need to figure out how to create a Series with the number of rides per date. Do **NOT** use the pandas `plot` command. Instead, I've given you skeleton matplotlib code below. Notice that it uses the *object oriented* style of maplotlib usage. You must use this approach. In particular:

- use things like `set_xlabel` to set the x and y axis labels. Similarly, use `set_title` to create a plot title.
- add gridlines


In [None]:
# Use this matplotlib skeleton code for the plot
fig, ax = plt.subplots()
ax.???
ax.???
...???

**ANSWER 1.9**

In [None]:
# Create trips by date dataframe
trips_by_date = trip_df.groupby(trip_df['starttime'].map(lambda x: x.date())).size()
trips_by_date.index = pd.to_datetime(trips_by_date.index)
trips_by_date


In [None]:
# Use this matplotlib skeleton code for the plot
fig, ax = plt.subplots()
ax.plot(trips_by_date)
ax.set_xlabel('Date')
ax.set_ylabel('Num Trips')
ax.set_title('Number of Trips by Date')
ax.grid(True)

**HACKER EXTRA** Create faceted plots of number of rides by month and year, faceted by `from_station_name`. Seaborn can do faceted plots.

This is tricky. My general strategy is to think about what data structure might make the plot easier to create. In this case, I'd like a count of trips by station by date and then turn that into trips by station by month. What ends up being tricky is doing the resampling of the date when it's part of a `MultiIndex`. The following two resources are helpful:

https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#resampling

https://stackoverflow.com/questions/15799162/resampling-within-a-pandas-multiindex

**IMPORTANT NOTE ABOUT DELIVERABLES**
Make sure you run all of your code cells and then save your notebook before submitting.