# Pandas Essentials:  Selecting Subsets of Data, Part I

The notebook exercises below provide practice in loading and grokking your data.  We focus on three months of data from the [Bay Area Bike Share](http://www.bayareabikeshare.com/open-data) program.

In [20]:
# Set max display columns and rows (for more compact view)
pd.options.display.max_columns = 10
pd.options.display.max_rows = 6

# Imports

In the cell below, import the pandas library.

In [21]:
import pandas as pd

# Loading Data

In the cell below, load the "babs_trips_april_thru_june_2016.csv" into a variable called `trips_df`.  This represents all rides for April through June, 2016.

In [22]:
trips_df = pd.read_csv("../data/babs_trips_april_thru_june_2016.csv")

In the cell below, peek at the firt few lines of the `trips_df` dataframe.

In [23]:
trips_df.head()

Unnamed: 0,Trip ID,Duration,Start Date,Start Station,Start Terminal,...,End Station,End Terminal,Bike #,Subscriber Type,Zip Code
0,1145294,991,2016-04-01 00:30:00,Embarcadero at Sansome,60,...,Townsend at 7th,65,547,Subscriber,94109
1,1145295,1164,2016-04-01 04:49:00,Temporary Transbay Terminal (Howard at Beale),55,...,2nd at Townsend,61,504,Subscriber,95113
2,1145296,729,2016-04-01 05:00:00,Market at 10th,67,...,Washington at Kearny,46,418,Subscriber,94102
3,1145297,367,2016-04-01 05:15:00,Steuart at Market,74,...,Embarcadero at Sansome,60,443,Subscriber,94015
4,1145298,366,2016-04-01 05:17:00,Market at 10th,67,...,Townsend at 7th,65,372,Subscriber,94102


# Selecting Columns

In the cell below, write one line of code to select the Duration column via bracket notation.

In [11]:
trips_df["Duration"]

0         991
1        1164
2         729
         ... 
83534     267
83535     682
83536     658
Name: Duration, dtype: int64

In the cell below, write one line of code to select the Duration column via dot notation.

In [12]:
trips_df.Duration

0         991
1        1164
2         729
         ... 
83534     267
83535     682
83536     658
Name: Duration, dtype: int64

In the cell below, write one of code to determine the average duration of all trips (in minutes).  Note that the Duration column stores the trip duration in seconds.  [You should get 13.87 minutes].

In [15]:
trips_df.Duration.mean() / 60

13.876648072111758

In the cell below, write one line of code to determine the most popular Start Terminals.

In [33]:
trips_df["Start Terminal"].value_counts()

70    6437
69    6017
50    4497
      ... 
24      11
88       8
21       2
Name: Start Terminal, dtype: int64

In the cell below, write one line of code to determine the most popular End Terminals.

In [34]:
trips_df["End Terminal"].value_counts()

70    7205
69    6473
50    4629
      ... 
24      12
89       9
21       1
Name: End Terminal, dtype: int64

# Selecting Rows, Single Conditions

In the cell below, select all trips that start at Terminal 70 and store these rows into a new data frame called `trips_70.`

In [35]:
trips_70 = trips_df[trips_df["Start Terminal"] == 70]

In the cell below, write one line of code to determine the number of rows in the `trips_70` data frame.  [You should get 6437]

In [39]:
trips_70.shape[0]

6437

In the cell below, write one line of code to determine the average Duration (in minutes) of all trips starting at Terminal 70. [You should get 12.47]

In [42]:
trips_70.Duration.mean() / 60

12.470980270312257

# Selecting Rows, Multiple Conditions

In the cell below, select all trips that start at Terminal 69 and end at Terminal 50.  Assign these records to a new data frame, called `trips_69_to_50`.

In [53]:
trips_69_to_50 = trips_df[(trips_df["Start Terminal"] == 69) & (trips_df["End Terminal"] == 50)]

In the cell below, determine the number of trips in the `trips_69_to_50` data frame.

In [56]:
trips_69_to_50.shape[0]

420

What is the average duration (in minutes) of these trips?

In [59]:
trips_69_to_50.Duration.mean() / 60

11.213412698412698

# Selecting Rows, Multiple Conditions

In [62]:
trips_subset = trips_df[trips_df["Start Terminal"].isin([50,24,89])]

In the cell below, determine the number of trips in the `trips_subset` data frame.

In [63]:
trips_subset.shape[0]

4519

What is the average duration (in minutes) of these trips?

In [64]:
trips_subset.Duration.mean() / 60

18.028992402448921