# DS3 Kaggle Workshop - Advanced Practices in Pandas

Welcome to our Advanced Practices in Pandas Jupyter Notebook. With our interactive problems, we hope to guide you in your learning process. Here, you can practice useful pandas functions for DataFrame manipulation and analysis. Have fun!

The dataset we will be using is called “Uber and Lyft Cab Prices” from Kaggle. Here is the link to the dataset: https://www.kaggle.com/ravi72munde/uber-lyft-cab-prices?select=weather.csv. For your convenience, we have downloaded it into the same repository as this Jupyter Notebook for you.

**Note:** The slideshow presentation will be published after the workshop. This will allow you to look back at the material covered and go over concepts that we were not able to get to during the timeframe.

## Importing Libraries and the Dataset

In [149]:
import pandas as pd
import numpy as np
import time

In [150]:
cab_rides = pd.read_csv('cab_rides.csv')
cab_rides

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.44,Lyft,1544952607890,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.44,Lyft,1543284023677,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.44,Lyft,1543366822198,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,0.44,Lyft,1543553582749,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,0.44,Lyft,1543463360223,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL
...,...,...,...,...,...,...,...,...,...,...
693066,1.00,Uber,1543708385534,North End,West End,13.0,1.0,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL
693067,1.00,Uber,1543708385534,North End,West End,9.5,1.0,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX
693068,1.00,Uber,1543708385534,North End,West End,,1.0,64d451d0-639f-47a4-9b7c-6fd92fbd264f,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
693069,1.00,Uber,1543708385534,North End,West End,27.0,1.0,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV


In [151]:
weather = pd.read_csv('weather.csv')
weather

Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
0,42.42,Back Bay,1.00,1012.14,0.1228,1545003901,0.77,11.25
1,42.43,Beacon Hill,1.00,1012.15,0.1846,1545003901,0.76,11.32
2,42.50,Boston University,1.00,1012.15,0.1089,1545003901,0.76,11.07
3,42.11,Fenway,1.00,1012.13,0.0969,1545003901,0.77,11.09
4,43.13,Financial District,1.00,1012.14,0.1786,1545003901,0.75,11.49
...,...,...,...,...,...,...,...,...
6271,44.72,North Station,0.89,1000.69,,1543819974,0.96,1.52
6272,44.85,Northeastern University,0.88,1000.71,,1543819974,0.96,1.54
6273,44.82,South Station,0.89,1000.70,,1543819974,0.96,1.54
6274,44.78,Theatre District,0.89,1000.70,,1543819974,0.96,1.54


## Concatenation
[`pd.concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) method: appends, or concatenates, two or more dataframes.
* Can be concatenated vertically (one atop another), which is default (axis = 0)
* Can also be concatenated horizontally (side-by-side) (axis = 1)

In [152]:
cab_rides.columns

Index(['distance', 'cab_type', 'time_stamp', 'destination', 'source', 'price',
       'surge_multiplier', 'id', 'product_id', 'name'],
      dtype='object')

In [153]:
weather.columns

Index(['temp', 'location', 'clouds', 'pressure', 'rain', 'time_stamp',
       'humidity', 'wind'],
      dtype='object')

Argument for concat is an Iterable with elements of type DataFrame. 

In [154]:
# axis = 1 for concat horizontally
pd.concat([cab_rides, weather], axis = 1)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,temp,location,clouds,pressure,rain,time_stamp.1,humidity,wind
0,0.44,Lyft,1544952607890,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared,42.42,Back Bay,1.0,1012.14,0.1228,1.545004e+09,0.77,11.25
1,0.44,Lyft,1543284023677,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux,42.43,Beacon Hill,1.0,1012.15,0.1846,1.545004e+09,0.76,11.32
2,0.44,Lyft,1543366822198,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft,42.50,Boston University,1.0,1012.15,0.1089,1.545004e+09,0.76,11.07
3,0.44,Lyft,1543553582749,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL,42.11,Fenway,1.0,1012.13,0.0969,1.545004e+09,0.77,11.09
4,0.44,Lyft,1543463360223,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL,43.13,Financial District,1.0,1012.14,0.1786,1.545004e+09,0.75,11.49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
693066,1.00,Uber,1543708385534,North End,West End,13.0,1.0,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,,,,,,,,
693067,1.00,Uber,1543708385534,North End,West End,9.5,1.0,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,,,,,,,,
693068,1.00,Uber,1543708385534,North End,West End,,1.0,64d451d0-639f-47a4-9b7c-6fd92fbd264f,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,,,,,,,,
693069,1.00,Uber,1543708385534,North End,West End,27.0,1.0,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,,,,,,,,


We need to do a little bit of work if we want to be able to use concat. Simply putting one table atop another, or side-by-side does not help us analyze our data. More specifically, if we want to concatenate the tables horizontally, simply doing the code in the previous cell will not work because the locations in the cab_rides rows and weather rows will not correspond to one another. There will be a mismatch!

Does it make more sense to concatenate vertically or horizontally with these datsets?
* Vertically (axis = 0)
* Horizontally (axis = 1)

In order to concatenate the datasets, there needs to be commonality in the corresponding rows of the two dataframes. Moreover, the number of rows should be equal, otherwise you'll have a lot of NaN values after concatenating. Which column does it make sense to append from the two data sets?

In [155]:
cab_rides.columns

Index(['distance', 'cab_type', 'time_stamp', 'destination', 'source', 'price',
       'surge_multiplier', 'id', 'product_id', 'name'],
      dtype='object')

In [156]:
weather.columns

Index(['temp', 'location', 'clouds', 'pressure', 'rain', 'time_stamp',
       'humidity', 'wind'],
      dtype='object')

In [157]:
def convert_unix_epoch_to_EST(epoch_time_sec):
    epoch_time_sec = epoch_time_sec - 5 * 60 * 60   # subtract 3 hours because EST is GMT/UTC -5
    return time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(epoch_time_sec))

In [158]:
cab_rides['time_stamp'] = cab_rides['time_stamp'] / 1000.0 # cab_rides timestamps were in ms, so convert to sec first
cab_rides['time_stamp'] = cab_rides['time_stamp'].apply(convert_unix_epoch_to_EST) # apply function along column
weather['time_stamp'] = weather['time_stamp'].apply(convert_unix_epoch_to_EST) # weather timestamps already in sec

In [159]:
cab_rides.head(3)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.44,Lyft,2018-12-16 04:30:07,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.44,Lyft,2018-11-26 21:00:23,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.44,Lyft,2018-11-27 20:00:22,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft


In [160]:
weather.head(3)

Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
0,42.42,Back Bay,1.0,1012.14,0.1228,2018-12-16 18:45:01,0.77,11.25
1,42.43,Beacon Hill,1.0,1012.15,0.1846,2018-12-16 18:45:01,0.76,11.32
2,42.5,Boston University,1.0,1012.15,0.1089,2018-12-16 18:45:01,0.76,11.07


In [164]:
cab_rides.get('time_stamp').min()

'2018-11-25 22:40:46'

In [165]:
cab_rides.get('time_stamp').max()

'2018-12-18 14:15:10'

In [169]:
len(cab_rides.get('time_stamp'))

693071

In [166]:
weather.get('time_stamp').min()

'2018-11-25 22:40:44'

In [167]:
weather.get('time_stamp').max()

'2018-12-18 13:45:02'

In [170]:
len(weather.get('time_stamp'))

6276

We can foresee a problem if we choose to concatenate the dataframes. The cab_rides and weather timestamps will not match, and this will hinder our analysis. We can try concatenating and observe this.

In [198]:
cab_rides = cab_rides.sort_values(by = 'time_stamp').reset_index(drop = True)
cab_rides.head()

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.56,Uber,2018-11-25 22:40:46,Haymarket Square,North Station,7.0,1.0,a7b50600-c6c5-4e6c-bea9-4487344196d4,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX
1,1.57,Uber,2018-11-25 22:40:46,North End,Theatre District,,1.0,9962f244-8fce-4ae9-a583-139d5d7522e1,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
2,1.23,Lyft,2018-11-25 22:40:46,West End,North End,7.0,1.0,4aa68a5d-abc0-4fdf-a47f-0003617afbae,lyft,Lyft
3,1.23,Lyft,2018-11-25 22:40:46,West End,North End,5.0,1.0,89f35ef7-7129-483d-b3e6-d89afdf6946d,lyft_line,Shared
4,1.23,Lyft,2018-11-25 22:40:46,West End,North End,13.5,1.0,9e6a67e6-9628-4fb1-94e5-bf426f61b038,lyft_premier,Lux


In [199]:
weather = weather.sort_values(by = 'time_stamp').reset_index(drop = True)
weather.head()

Unnamed: 0,temp,location,clouds,pressure,rain,time_stamp,humidity,wind
0,40.84,West End,0.87,1014.4,,2018-11-25 22:40:44,0.93,1.52
1,40.98,Haymarket Square,0.87,1014.4,,2018-11-25 22:40:44,0.92,1.57
2,40.86,South Station,0.87,1014.39,,2018-11-25 22:40:44,0.93,1.6
3,40.81,Northeastern University,0.89,1014.35,,2018-11-25 22:40:44,0.93,1.36
4,41.04,Back Bay,0.87,1014.39,,2018-11-25 22:40:45,0.92,1.46


In [221]:
pd.concat([cab_rides, weather], axis = 1).head()

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,temp,location,clouds,pressure,rain,time_stamp.1,humidity,wind
0,0.56,Uber,2018-11-25 22:40:46,Haymarket Square,North Station,7.0,1.0,a7b50600-c6c5-4e6c-bea9-4487344196d4,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,40.84,West End,0.87,1014.4,,2018-11-25 22:40:44,0.93,1.52
1,1.57,Uber,2018-11-25 22:40:46,North End,Theatre District,,1.0,9962f244-8fce-4ae9-a583-139d5d7522e1,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,40.98,Haymarket Square,0.87,1014.4,,2018-11-25 22:40:44,0.92,1.57
2,1.23,Lyft,2018-11-25 22:40:46,West End,North End,7.0,1.0,4aa68a5d-abc0-4fdf-a47f-0003617afbae,lyft,Lyft,40.86,South Station,0.87,1014.39,,2018-11-25 22:40:44,0.93,1.6
3,1.23,Lyft,2018-11-25 22:40:46,West End,North End,5.0,1.0,89f35ef7-7129-483d-b3e6-d89afdf6946d,lyft_line,Shared,40.81,Northeastern University,0.89,1014.35,,2018-11-25 22:40:44,0.93,1.36
4,1.23,Lyft,2018-11-25 22:40:46,West End,North End,13.5,1.0,9e6a67e6-9628-4fb1-94e5-bf426f61b038,lyft_premier,Lux,41.04,Back Bay,0.87,1014.39,,2018-11-25 22:40:45,0.92,1.46


Problem that we predicted earlier: the time_stamps for cab_rides and weather do not match, and the locations do not match either! Let's try another approach since `pd.concat()` does not seem to be effective.

## Merging Dataframes
[`pd.merge()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) method:

In [215]:
pd.merge(cab_rides, weather, on = ['time_stamp']).head()

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,temp,location,clouds,pressure,rain,humidity,wind
0,0.56,Uber,2018-11-25 22:40:46,Haymarket Square,North Station,7.0,1.0,a7b50600-c6c5-4e6c-bea9-4487344196d4,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,40.84,Fenway,0.88,1014.35,,0.93,1.31
1,0.56,Uber,2018-11-25 22:40:46,Haymarket Square,North Station,7.0,1.0,a7b50600-c6c5-4e6c-bea9-4487344196d4,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,41.02,North End,0.87,1014.4,,0.92,1.59
2,1.57,Uber,2018-11-25 22:40:46,North End,Theatre District,,1.0,9962f244-8fce-4ae9-a583-139d5d7522e1,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,40.84,Fenway,0.88,1014.35,,0.93,1.31
3,1.57,Uber,2018-11-25 22:40:46,North End,Theatre District,,1.0,9962f244-8fce-4ae9-a583-139d5d7522e1,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,41.02,North End,0.87,1014.4,,0.92,1.59
4,1.23,Lyft,2018-11-25 22:40:46,West End,North End,7.0,1.0,4aa68a5d-abc0-4fdf-a47f-0003617afbae,lyft,Lyft,40.84,Fenway,0.88,1014.35,,0.93,1.31


In [223]:
rideshare_and_weather = pd.merge(cab_rides, weather, left_on = ['time_stamp', 'source'], right_on = ['time_stamp', 'location']).head()
rideshare_and_weather

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,temp,location,clouds,pressure,rain,humidity,wind
0,1.23,Lyft,2018-11-25 22:40:46,West End,North End,7.0,1.0,4aa68a5d-abc0-4fdf-a47f-0003617afbae,lyft,Lyft,41.02,North End,0.87,1014.4,,0.92,1.59
1,1.23,Lyft,2018-11-25 22:40:46,West End,North End,5.0,1.0,89f35ef7-7129-483d-b3e6-d89afdf6946d,lyft_line,Shared,41.02,North End,0.87,1014.4,,0.92,1.59
2,1.23,Lyft,2018-11-25 22:40:46,West End,North End,13.5,1.0,9e6a67e6-9628-4fb1-94e5-bf426f61b038,lyft_premier,Lux,41.02,North End,0.87,1014.4,,0.92,1.59
3,1.23,Lyft,2018-11-25 22:40:46,West End,North End,19.5,1.0,a8b37ec2-b380-47da-8269-590dfaaffdbf,lyft_lux,Lux Black,41.02,North End,0.87,1014.4,,0.92,1.59
4,2.96,Lyft,2018-11-25 22:40:46,Theatre District,Fenway,11.0,1.0,52d51d09-725b-4f57-b866-382c152cdb92,lyft,Lyft,40.84,Fenway,0.88,1014.35,,0.93,1.31


**TODO**: explain the types of joins

In [219]:
pd.merge(cab_rides, weather, left_on = ['time_stamp', 'source'], right_on = ['time_stamp', 'location'], how = 'outer', indicator = True).head()

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,temp,location,clouds,pressure,rain,humidity,wind,_merge
0,0.56,Uber,2018-11-25 22:40:46,Haymarket Square,North Station,7.0,1.0,a7b50600-c6c5-4e6c-bea9-4487344196d4,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,,,,,,,,left_only
1,2.01,Lyft,2018-11-25 22:40:46,South Station,North Station,16.5,1.0,65e9134a-dff4-45b9-81e3-90ba4cae702f,lyft_premier,Lux,,,,,,,,left_only
2,3.05,Uber,2018-11-25 22:40:46,Fenway,North Station,10.5,1.0,f67b0a6b-08f9-43bb-b47d-efad7310d4c7,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,,,,,,,,left_only
3,1.57,Uber,2018-11-25 22:40:46,North End,Theatre District,,1.0,9962f244-8fce-4ae9-a583-139d5d7522e1,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,,,,,,,,left_only
4,2.66,Uber,2018-11-25 22:40:46,Fenway,Theatre District,16.0,1.0,99f3cf40-809b-4962-9d9e-acaea3afc9d6,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,,,,,,,,left_only


In general, `pd.concat()` is preferable over `pd.merge()` whenever you want to consolidate more than two dataframes. You cannot do this in `pd.merge()` unless you merge two dataframes and then merge a third dataframe to the merged dataframe, which often gets complicated.

**TODO**: explain the difference between merge and join

**TODO**: for Blake: it would be interesting to have a practice problem for the students using the merged dataframe I made: A problem along the lines of: 
- How does it seem like the **number** of rides are affected by the type of weather? (something related to a groupby count)
- Is there a relation between the surge multiplier of the ride and the weather quality?

## Transpose

In [4]:
ten_cabs= cab_rides.head(10)
ten_cabs

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.44,Lyft,1544952607890,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.44,Lyft,1543284023677,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.44,Lyft,1543366822198,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,0.44,Lyft,1543553582749,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,0.44,Lyft,1543463360223,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL
5,0.44,Lyft,1545071112138,North Station,Haymarket Square,16.5,1.0,f6f6d7e4-3e18-4922-a5f5-181cdd3fa6f2,lyft_lux,Lux Black
6,1.08,Lyft,1543208580200,Northeastern University,Back Bay,10.5,1.0,462816a3-820d-408b-8549-0b39e82f65ac,lyft_plus,Lyft XL
7,1.08,Lyft,1543780384677,Northeastern University,Back Bay,16.5,1.0,474d6376-bc59-4ec9-bf57-4e6d6faeb165,lyft_lux,Lux Black
8,1.08,Lyft,1543818482645,Northeastern University,Back Bay,3.0,1.0,4f9fee41-fde3-4767-bbf1-a00e108701fb,lyft_line,Shared
9,1.08,Lyft,1543315522249,Northeastern University,Back Bay,27.5,1.0,8612d909-98b8-4454-a093-30bd48de0cb3,lyft_luxsuv,Lux Black XL


In [5]:
ten_cabs_transpose = ten_cabs.transpose() #same as .T
ten_cabs_transpose

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
distance,0.44,0.44,0.44,0.44,0.44,0.44,1.08,1.08,1.08,1.08
cab_type,Lyft,Lyft,Lyft,Lyft,Lyft,Lyft,Lyft,Lyft,Lyft,Lyft
time_stamp,1544952607890,1543284023677,1543366822198,1543553582749,1543463360223,1545071112138,1543208580200,1543780384677,1543818482645,1543315522249
destination,North Station,North Station,North Station,North Station,North Station,North Station,Northeastern University,Northeastern University,Northeastern University,Northeastern University
source,Haymarket Square,Haymarket Square,Haymarket Square,Haymarket Square,Haymarket Square,Haymarket Square,Back Bay,Back Bay,Back Bay,Back Bay
price,5,11,7,26,9,16.5,10.5,16.5,3,27.5
surge_multiplier,1,1,1,1,1,1,1,1,1,1
id,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,4bd23055-6827-41c6-b23b-3c491f24e74d,981a3613-77af-4620-a42a-0c0866077d1e,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,f6f6d7e4-3e18-4922-a5f5-181cdd3fa6f2,462816a3-820d-408b-8549-0b39e82f65ac,474d6376-bc59-4ec9-bf57-4e6d6faeb165,4f9fee41-fde3-4767-bbf1-a00e108701fb,8612d909-98b8-4454-a093-30bd48de0cb3
product_id,lyft_line,lyft_premier,lyft,lyft_luxsuv,lyft_plus,lyft_lux,lyft_plus,lyft_lux,lyft_line,lyft_luxsuv
name,Shared,Lux,Lyft,Lux Black XL,Lyft XL,Lux Black,Lyft XL,Lux Black,Shared,Lux Black XL


In [6]:
weather_transpose = weather.T # same as .transpose()
weather_transpose

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6266,6267,6268,6269,6270,6271,6272,6273,6274,6275
temp,42.42,42.43,42.5,42.11,43.13,42.34,42.36,42.21,42.07,43.05,...,44.85,44.83,44.8,44.75,44.77,44.72,44.85,44.82,44.78,44.69
location,Back Bay,Beacon Hill,Boston University,Fenway,Financial District,Haymarket Square,North End,North Station,Northeastern University,South Station,...,Boston University,Fenway,Financial District,Haymarket Square,North End,North Station,Northeastern University,South Station,Theatre District,West End
clouds,1,1,1,1,1,1,1,1,1,1,...,0.89,0.88,0.89,0.89,0.89,0.89,0.88,0.89,0.89,0.89
pressure,1012.14,1012.15,1012.15,1012.13,1012.14,1012.15,1012.15,1012.16,1012.12,1012.12,...,1000.7,1000.7,1000.7,1000.69,1000.69,1000.69,1000.71,1000.7,1000.7,1000.7
rain,0.1228,0.1846,0.1089,0.0969,0.1786,0.2068,0.2088,0.2069,0.102,0.1547,...,,,,,,,,,,
time_stamp,1545003901,1545003901,1545003901,1545003901,1545003901,1545003901,1545003901,1545003901,1545003901,1545003901,...,1543819974,1543819974,1543819974,1543819974,1543819974,1543819974,1543819974,1543819974,1543819974,1543819974
humidity,0.77,0.76,0.76,0.77,0.75,0.77,0.77,0.77,0.78,0.75,...,0.95,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96,0.96
wind,11.25,11.32,11.07,11.09,11.49,11.49,11.46,11.37,11.28,11.58,...,1.52,1.53,1.53,1.53,1.53,1.52,1.54,1.54,1.54,1.52


In [7]:
# practice problem on merged or joined dataset

## Groupby

In [8]:
# total distance miles done based on cab_type
sum_distance = cab_rides.groupby('cab_type')[['distance']].sum()
sum_distance

Unnamed: 0_level_0,distance
cab_type,Unnamed: 1_level_1
Lyft,672293.79
Uber,845136.48


In [9]:
# total spent on cab_type
sum_price = cab_rides.groupby('cab_type')[['price']].sum()
sum_price

Unnamed: 0_level_0,price
cab_type,Unnamed: 1_level_1
Lyft,5333957.98
Uber,5221435.0


In [10]:
# average temperature based on location
weather.groupby('location')[['temp']].mean()

Unnamed: 0_level_0,temp
location,Unnamed: 1_level_1
Back Bay,39.082122
Beacon Hill,39.047285
Boston University,39.047744
Fenway,38.964379
Financial District,39.410822
Haymarket Square,39.067897
North End,39.090841
North Station,39.035315
Northeastern University,38.975086
South Station,39.394092


In [11]:
# averages based on location
weather.groupby('location').mean()

Unnamed: 0_level_0,temp,clouds,pressure,rain,time_stamp,humidity,wind
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Back Bay,39.082122,0.678432,1008.44782,0.056012,1543857000.0,0.764073,6.778528
Beacon Hill,39.047285,0.677801,1008.448356,0.057097,1543857000.0,0.765048,6.810325
Boston University,39.047744,0.679235,1008.459254,0.054688,1543857000.0,0.763786,6.69218
Fenway,38.964379,0.679866,1008.453289,0.054863,1543857000.0,0.767266,6.711721
Financial District,39.410822,0.67673,1008.435793,0.061352,1543857000.0,0.754837,6.860019
Haymarket Square,39.067897,0.676711,1008.445239,0.059593,1543857000.0,0.764837,6.843193
North End,39.090841,0.67673,1008.441912,0.058712,1543857000.0,0.764054,6.853117
North Station,39.035315,0.676998,1008.442811,0.056542,1543857000.0,0.765545,6.835755
Northeastern University,38.975086,0.678317,1008.444168,0.054197,1543857000.0,0.767648,6.749426
South Station,39.394092,0.677495,1008.438031,0.059537,1543857000.0,0.755468,6.848948


In [12]:
# practice problem on merged or joined dataset

## Aggregate

## Apply

## Lambda