# Uber Movement

### Ingest Data from Uber Dataset

In [5]:
dataset_id = "uber_staging"
!bq --location=US mk --dataset {dataset_id}  #Note: This will not work if you already have a dataset with this name

BigQuery error in mk operation: Dataset 'responsive-cab-267123:uber_staging'
already exists.


### Creating and Displaying Tables

In [6]:
#Notice that the "bmease" bucket does not have any hyphens, since GCP doesn't like that.
#Format: !bq --location=US load --autodetect --skip_leading_rows=1 --source_format=CSV {dataset_id}.Summer18 'gs://<bucketName>/<dataSourceFileName>'

#### Quarter 1 of 2018

In [10]:
!bq --location=US load --autodetect --skip_leading_rows=1 --source_format=CSV {dataset_id}.Quarter1_2018 'gs://bmease/travel_times_LA_2018_quarter1.csv'

Waiting on bqjob_r4473cf5ede977e8a_000001702c5ca3dc_1 ... (48s) Current status: DONE   


In [25]:
%%bigquery
select * from uber_staging.Quarter1_2018 limit 5


Unnamed: 0,sourceid,dstid,month,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
0,1157,1235,1,172.14,139.66,140.33,1.84
1,1142,1385,1,594.16,346.87,517.94,1.67
2,1080,1379,1,952.01,522.56,836.84,1.64
3,1095,1229,1,1287.43,651.52,1142.63,1.63
4,1165,1155,1,470.87,339.44,397.53,1.74


#### Quarter 3 of 2018

In [11]:
!bq --location=US load --autodetect --skip_leading_rows=1 --source_format=CSV {dataset_id}.Quarter3_2018 'gs://bmease/travel_times_LA_2018_quarter3.csv'

Waiting on bqjob_r1c770c8d8d2c8f59_000001702c5d6964_1 ... (65s) Current status: DONE   


In [26]:
%%bigquery
select * from uber_staging.Quarter3_2018 limit 5

Unnamed: 0,sourceid,dstid,month,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
0,1989,309,7,705.54,870.35,564.58,1.74
1,1970,499,7,839.3,503.97,737.15,1.62
2,1545,1518,7,745.48,439.5,660.61,1.61
3,1988,319,7,191.32,109.36,166.78,1.7
4,174,39,7,465.89,250.88,418.39,1.6


#### Quarter 1 of 2019

In [12]:
!bq --location=US load --autodetect --skip_leading_rows=1 --source_format=CSV {dataset_id}.Quarter1_2019 'gs://bmease/travel_times_LA_2019_quarter1.csv'

Waiting on bqjob_r22f9dc31869efa79_000001702c5e7192_1 ... (48s) Current status: DONE   


In [27]:
%%bigquery
select * from uber_staging.Quarter1_2019 limit 5

Unnamed: 0,sourceid,dstid,month,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
0,1038,902,1,426.75,296.49,370.71,1.69
1,662,2403,1,544.05,357.81,473.43,1.63
2,1001,1523,1,1693.61,382.04,1655.94,1.23
3,2318,2584,1,1215.42,230.08,1194.1,1.21
4,473,2228,1,276.66,209.22,207.47,2.19


#### Quarter 3 of 2019

In [13]:
!bq --location=US load --autodetect --skip_leading_rows=1 --source_format=CSV {dataset_id}.Quarter3_2019 'gs://bmease/travel_times_LA_2019_quarter3.csv'

Waiting on bqjob_r4793a1f29648aa48_000001702c5f36d2_1 ... (48s) Current status: DONE   


In [28]:
%%bigquery
select * from uber_staging.Quarter3_2019 limit 5

Unnamed: 0,sourceid,dstid,month,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
0,1365,1361,7,106.81,91.88,85.54,1.97
1,1345,1561,7,510.42,299.45,449.56,1.6
2,1379,1221,7,1171.94,745.51,1005.05,1.75
3,1387,1141,7,1084.61,487.81,978.11,1.59
4,253,1529,7,929.68,464.51,817.14,1.78


### Interesting Queries

From this query one can see the source location that yielded the longest travel time to reach their destination in a given month. From this data, one can infer possible congestion at the source location if the frequency of occurrence is also high.

In [42]:
%%bigquery
select sourceid, count(*) as frequency, avg(mean_travel_time) as travel_time from uber_staging.Quarter1_2018 where month=1 group by sourceid order by frequency desc limit 10

Unnamed: 0,sourceid,frequency,travel_time
0,1697,2334,2259.453967
1,2423,2262,1761.910172
2,325,2238,1849.38164
3,1230,2158,1555.766881
4,1220,2133,1535.848022
5,412,2081,1783.533763
6,1501,2064,1520.239855
7,1222,2056,1473.174115
8,1233,2055,1634.864832
9,1170,2055,1372.086618


From this query we can see the most popular locations in a given month and the average time it took to get to that location. At a glance, it seems like it took longer to get to the most popular destination than other destinations. This could also be due to congestion.

In [43]:
%%bigquery
select dstid, count(*) as frequency, avg(mean_travel_time) as travel_time from uber_staging.Quarter3_2018 where month=7 group by dstid order by frequency desc limit 10

Unnamed: 0,dstid,frequency,travel_time
0,1697,2374,2372.723665
1,2423,2277,1876.421089
2,1170,2239,1704.393394
3,1382,2222,1657.083519
4,1222,2166,1573.046925
5,1220,2165,1655.919737
6,1501,2150,1692.084288
7,1235,2128,1730.659159
8,1157,2126,1747.345235
9,1159,2124,1774.285965


This query will show popularity of Uber rides by month. It can be used to see the busiest month for Uber drivers and the slowest ones.

In [17]:
%%bigquery
select month, count(*) as frequency from uber_staging.Quarter1_2019 group by month order by frequency desc
#Here, we see that the best month in quarter one is March, and the worst month is February

Unnamed: 0,month,frequency
0,3,1872932
1,1,1597882
2,2,1564968


In [18]:
%%bigquery
select month, count(*) as frequency from uber_staging.Quarter3_2019 group by month order by frequency desc
#Here, we see that the best month in quarter one is August, and the worst month is July

Unnamed: 0,month,frequency
0,8,1827148
1,9,1804011
2,7,1780612


This query could be of help to an Uber driver who wants to know where the best place to pick up rides (during the later summer months, in this case). Assuming the driver had the context to know which sourceid represented which area of Los Angeles, he or she could pinpoint the best place to find clients.

In [16]:
%%bigquery
select sourceid, count(*) as frequency from uber_staging.Quarter3_2019 group by sourceid order by frequency desc
limit 10

Unnamed: 0,sourceid,frequency
0,1697,7300
1,2423,7118
2,325,7030
3,412,6771
4,1230,6701
5,1220,6534
6,1235,6508
7,1501,6487
8,1170,6472
9,1222,6420


Say an Uber driver has just started working for Uber, and he wants the new driver bonus (give at least 50 rides in 30 days) as quickly as possible. The month is February, so he is a bit concerned that he will have trouble getting rides. The following query will let him know which places have the most short rides (rides shorter than 20 min) in February. He can then go to those sourceid locations and get his 50-100 rides in.

In [49]:
%%bigquery
select sourceid, count(*)  as frequency, avg(mean_travel_time) as travel_time from uber_staging.Quarter1_2019 
where month=2 and mean_travel_time < 1200 group by sourceid order by frequency desc
limit 10

Unnamed: 0,sourceid,frequency,travel_time
0,1170,686,813.372609
1,1710,685,827.250599
2,1699,683,842.738243
3,1382,637,817.895447
4,1235,637,819.387221
5,1796,635,821.05148
6,264,632,832.313244
7,1158,631,829.411537
8,1709,616,824.288377
9,1804,613,832.821501
