# Solutions for the Scoot Data Set

In [1]:
import pandas as pd
import numpy as np
from pandasql import sqldf
import matplotlib.pyplot as plt


#Read in the data frame
scoot_df = pd.read_csv("Data/Scoot_Data/Ride_Customer_Info.csv")

scoot_df.head()

Unnamed: 0,ride_id,trip_time,trip_dist,plan_id,hr_start,dow,time_slot,batt_perc,dist_h,price,start_location_id,lat,lng,end_location_id,lat.1,lng.1
0,211870,11.336693,3.0,Pro,0,1,0,0.361143,0.287168,2.0,104,37.7947,-122.41096,56,37.771578,-122.442181
1,211871,36.61688,2.0,Pro,0,1,0,0.979784,0.080456,4.0,21,37.775606,-122.416976,42,37.74981,-122.43179
2,211885,17.900736,3.6,Go,4,1,0,0.5,0.019225,3.0,90,37.784174,-122.48046,9,37.791758,-122.404929
3,211890,15.330679,3.2,Other,5,1,1,0.897714,0.119214,3.0,97,37.804907,-122.441988,26,37.791638,-122.391761
4,211894,138.113334,5.3,Other,6,1,1,0.848,0.030645,8.0,55,37.77744,-122.46494,87,37.795342,-122.398735


Use SQL to create a data frame df_travel that has three columns: trip_time, trip_dist, avg_speed (= trip_dist/trip_time).

In [2]:
#Get the three columns
df_travel = sqldf("SELECT trip_time, trip_dist, trip_dist/trip_time AS avg_speed FROM scoot_df")
df_travel.head()

Unnamed: 0,trip_time,trip_dist,avg_speed
0,11.336693,3.0,0.264627
1,36.61688,2.0,0.05462
2,17.900736,3.6,0.201109
3,15.330679,3.2,0.208732
4,138.113334,5.3,0.038374


The dow column has the following interpretation:
- 0: Monday
- 1: Tuesday
- 2: Wednesday
- 3: Thursday
- 4 Friday
- 5: Saturday
- 6: Sunday

What was the average price of rides that took place on the weekend and started after noon (hr_start >= 12).

In [3]:
sqldf("SELECT AVG(price) FROM scoot_df WHERE (dow = 5 OR dow =6) AND hr_start >=12")

Unnamed: 0,AVG(price)
0,3.876614


How many trips lasted longer than 30 minutes?


In [4]:
#Find the number of trips that were longer than 30 minutes
sqldf("SELECT COUNT(*) countTrips FROM scoot_df WHERE trip_time>=30")

Unnamed: 0,countTrips
0,4238


Write a query to get the number of rides starting in each hour.


In [5]:
#Get the number of rides starting in each hour
Count_TOD = sqldf("SELECT hr_start,COUNT(*) num_rides FROM scoot_df GROUP BY hr_start")
Count_TOD

Unnamed: 0,hr_start,num_rides
0,0,113
1,1,55
2,2,43
3,3,13
4,4,27
5,5,126
6,6,355
7,7,1101
8,8,1104
9,9,663


Create a line plot - where the x-axis is the hour of the day and the y-axis is the number of rides.  Give an appropriate title and appropriate axes labels.

In [6]:
%matplotlib notebook

fig, ax = plt.subplots()

Count_TOD.plot(kind="line", x = "hr_start", y = "num_rides", ax=ax)

ax.set(title = "Demand over the Day", xlabel = "Hour", ylabel = "# of Rides")

<IPython.core.display.Javascript object>

[<matplotlib.text.Text at 0x11038d438>,
 <matplotlib.text.Text at 0x110c5de10>,
 <matplotlib.text.Text at 0x11036c978>]

Write a query to get the number of rides starting in each hour by plan.


In [11]:
Count_TOD_plan = sqldf("SELECT plan_id,hr_start,COUNT(*) num_rides FROM scoot_df GROUP BY hr_start, plan_id")
Count_TOD_plan.head(10)

Unnamed: 0,plan_id,hr_start,num_rides
0,Go,0,25
1,Intro_Pro,0,29
2,Other,0,27
3,Pro,0,32
4,Go,1,16
5,Intro_Pro,1,17
6,Other,1,8
7,Pro,1,14
8,Go,2,2
9,Intro_Pro,2,7


Write a query to find the average price of each ride that has an initial battery charge of more than 50$\%$ originating at every hour of the day. Compare this with rides where the initial charge was below 50$\%$. This helps answer the question: Are scooters with less charge, less valuable?


In [8]:
#Find the times that are the most valuable for charged scooter (>50% charge)
charged_prices = sqldf("SELECT hr_start,AVG(price) AS avg_price_charged FROM scoot_df WHERE batt_perc > 0.5 GROUP BY hr_start")

#Find the times that are the most valuable for not very charged scooter (<=50% charge)
uncharged_prices = sqldf("SELECT hr_start,AVG(price) AS avg_price_uncharged FROM scoot_df WHERE batt_perc <= 0.5 GROUP BY hr_start")

#tack on the uncharged prices
charged_prices['avg_price_uncharged'] = uncharged_prices.iloc[:,1]

#Make new column that is the different
charged_prices['diff'] = charged_prices['avg_price_charged'] - charged_prices['avg_price_uncharged']

#Find the average difference
np.mean(charged_prices['diff'])




0.7006648202905144