Note: Before you proceed with the next steps, please make sure that you have the dataset available in your databricks environment. For this quickstart, we will be using a dataset from uber.

##### This quickstart is divided into 2 parts.
- Part 1: Analyze the uber data using Pyspark (which we will be doing in this notebook)
- Part 2: Covert the pyspark code that we have written in part 1 into Snowpark code and perform the same transformations/analytics that we have achieved using pyspark code

#####Data Description
The dataset used in this quickstart is a real-world dataset from Uber and has data for span of 2 weeks.

Column Information:
- Date: A date
- Time (Local): an hour in a day
- Eyeballs: Number of people who opened the uber app.
- Zeroes: Number of people who did did not see any cars
- Requests: Number of people who requested a car
- Completed Trips: Number of completed trips out of number of requests
- Unique Drivers: Number of unique drivers available

In [0]:
#import all the required pyspark libraries
from pyspark.sql.functions import to_date
from pyspark.sql.types import DateType
from pyspark.sql.functions import * 
from pyspark.sql.window import Window
import pandas as pd



In [0]:
#Read the dataset and load it into a dataframe
df = spark.read.format("csv").option("inferschema", True)\
                .option("header", True)\
                .option("sep",",")\
                .load("dbfs:/FileStore/tables/uber/uber_dataset_1.csv")
display(df)

Date,Time (Local),Eyeballs,Zeroes,Completed Trips,Requests,Unique Drivers
10-Sep-12,7,5,0,2,2,9
10-Sep-12,8,6,0,2,2,14
10-Sep-12,9,8,3,0,0,14
10-Sep-12,10,9,2,0,1,14
10-Sep-12,11,11,1,4,4,11
10-Sep-12,12,12,0,2,2,11
10-Sep-12,13,9,1,0,0,9
10-Sep-12,14,12,1,0,0,9
10-Sep-12,15,11,2,1,2,7
10-Sep-12,16,11,2,3,4,6


Adding new column 'datetime' for analysis purpose

In [0]:
df = df.toDF('Date', 'Time', 'Eyeballs','Zeroes','Completed Trips','Requests','Unique Drivers')
df = df.toPandas()
df['Datetime'] = df['Date']+' '+df['Time'].astype('str')
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%d-%b-%y %H')
df = spark.createDataFrame(df)

In [0]:
#print schema of the dataset
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Time: integer (nullable = true)
 |-- Eyeballs: integer (nullable = true)
 |-- Zeroes: integer (nullable = true)
 |-- Completed Trips: integer (nullable = true)
 |-- Requests: integer (nullable = true)
 |-- Unique Drivers: integer (nullable = true)
 |-- Datetime: timestamp (nullable = true)



Perform some transformations like column renaming and datatype conversions.
- covert name of the column "Time (Local)" to "Time_local"
- covert name of the column "Completed Trips" to "Completed_Trips"
- covert name of the column "Unique Drivers" to "Unique_Drivers"
- convert the datatype of "date" column from string to date

In [0]:
df = df.withColumnRenamed("Time", "Time_Local")\
       .withColumnRenamed("Completed Trips", "Completed_Trips")\
       .withColumnRenamed("Unique Drivers", "Unique_Drivers")\
       .withColumn("Date",to_date("Date","dd-MMM-yy").cast(DateType()))

In [0]:
df.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Time_Local: integer (nullable = true)
 |-- Eyeballs: integer (nullable = true)
 |-- Zeroes: integer (nullable = true)
 |-- Completed_Trips: integer (nullable = true)
 |-- Requests: integer (nullable = true)
 |-- Unique_Drivers: integer (nullable = true)
 |-- Datetime: timestamp (nullable = true)



#####1. Which date had the most completed trips during the two-week period?

In [0]:
#Group the data by date and calculate sum of completed trips
trips_per_date = df.groupBy('Date').agg(sum("Completed_Trips").alias('Total_Completed_Trips'))\
                   .orderBy("Total_Completed_Trips", ascending=False)
trips_per_date.show(5)

+----------+---------------------+
|      Date|Total_Completed_Trips|
+----------+---------------------+
|2012-09-22|                  248|
|2012-09-15|                  199|
|2012-09-21|                  190|
|2012-09-23|                  111|
|2012-09-14|                  108|
+----------+---------------------+
only showing top 5 rows



In [0]:
trips_per_date.select('Date').show(1)

+----------+
|      Date|
+----------+
|2012-09-22|
+----------+
only showing top 1 row



#####2. What was the highest number of completed trips within a 24-hour period?

In [0]:
# Group the data by 24-hour window and sum the completed trips
completed_trips_24hrs = df.groupBy(window("Datetime", "24 hours"))\
                          .agg(sum("Completed_Trips").alias("sum_completed_trips"))\
                          .orderBy("sum_completed_trips", ascending=False)

In [0]:
# Get the highest number of completed trips within a 24-hour period
highest_completed_trips_in_24_hours = completed_trips_24hrs \
                                       .select("sum_completed_trips") \
                                        .first()["sum_completed_trips"]
print(highest_completed_trips_in_24_hours)

248


#####3. Which hour of the day had the most requests during the two-week period?

In [0]:
df_hour = df.groupBy("Time_Local").agg(sum('Requests').alias('Total_Requests')).orderBy("Total_Requests", ascending=False)
df_hour.show(5)

+----------+--------------+
|Time_Local|Total_Requests|
+----------+--------------+
|        23|           184|
|        22|           174|
|        19|           156|
|         0|           142|
|        18|           119|
+----------+--------------+
only showing top 5 rows



In [0]:
df_hour.select("Time_Local").first()[0]

Out[67]: 23

#####4. What percentages of all zeroes during the two-week period occurred on weekend (Friday at 5 pm to Sunday at 3 am)? Tip: The local time value is the start of the hour (e.g. 15 is the hour from 3:00 pm - 4:00 pm)

To answer this question, we need to filter the data to select only the rows that fall within the specified time range, count the total number of zeros and count the number of zeros that occurred on weekends. We can then calculate the percentage of zeros that occurred on weekends.

In [0]:
# count number of zeros that occurred on weekends
zeroes_weekend = df.filter((df.Time_Local >= 17) | (df.Time_Local < 3)).filter((dayofweek("Date") == 6) | (dayofweek("Date") == 7))\
                   .agg(sum('Zeroes').alias('zeroes_weekend')).collect()[0]['zeroes_weekend']

# total number of zeros
total_zeroes = df.agg(sum("Zeroes").alias('total_zeroes')).collect()[0]['total_zeroes']

print(zeroes_weekend/total_zeroes *100)

29.111266620014


#####5. What is the weighted average ratio of completed trips per driver during the two-week period? Tip: “Weighted average” means your answer should account for the total trip volume in each hour to determine the most accurate number in the whole period.

To answer this question, we need to calculate the ratio of completed trips to unique drivers for each hour, multiply the ratio by the total number of completed trips for that hour, and then sum the results. We can then divide this sum by the total number of completed trips for the entire period.

In [0]:
df.withColumn("trips_per_driver", df.Completed_Trips/df.Unique_Drivers)\
  .groupBy('Date', 'Time_Local')\
  .agg(avg("trips_per_driver").alias("avg_trips_per_driver"), sum("Completed_Trips").alias("total_completed_trips"))\
  .withColumn("weighted_ratio", col("avg_trips_per_driver")*col("total_completed_trips"))\
  .agg(sum("weighted_ratio") / sum("total_completed_trips")).show()

+--------------------------------------------------+
|(sum(weighted_ratio) / sum(total_completed_trips))|
+--------------------------------------------------+
|                                0.8276707747535552|
+--------------------------------------------------+



#####6. In drafting a driver schedule in terms of 8 hours shifts, when are the busiest 8 consecutive hours over the two-week period in terms of unique requests? A new shift starts every 8 hours. Assume that a driver will work the same shift each day.

To solve this, we can first calculate the number of unique requests for each hour of the day, and then slide a window of 8 hours across the hours to find the 8 consecutive hours with the highest number of unique requests.

In [0]:
unique_requests_per_hour = df.groupBy('Time_Local')\
                             .agg(countDistinct('Requests').alias('unique_requests'))

window_8hr = Window.orderBy(col('unique_requests').desc()).rowsBetween(0,7)

busiest_8_hrs = unique_requests_per_hour.select('*', sum('unique_requests').over(window_8hr).alias("sum_8_hrs"))\
                                        .orderBy(col("sum_8_hrs").desc())\
                                        .limit(1)
busiest_8_hrs.show()

+----------+---------------+---------+
|Time_Local|unique_requests|sum_8_hrs|
+----------+---------------+---------+
|        20|             12|       80|
+----------+---------------+---------+



In [0]:
#20th hour is the busiest hour

#####7. In which 72-hour period is the ratio of Zeroes to Eyeballs the highest?

In [0]:
# Group the data by 72-hour periods and calculate the ratio of zeroes to eyeballs for each period
period_ratios = (df.groupBy(((col("Date").cast("timestamp").cast("long") / (72*3600)).cast("int")).alias("period"))\
                   .agg(sum("Zeroes").alias("zeroes"), sum("Eyeballs").alias("eyeballs"))\
                   .withColumn("ratio", col("zeroes") / col("eyeballs"))
)

# Find the period with the highest ratio
highest_ratio_period = period_ratios.orderBy(col("ratio").desc()).limit(1)

# Print the result
highest_ratio_period.show()


+------+------+--------+-------------------+
|period|zeroes|eyeballs|              ratio|
+------+------+--------+-------------------+
|  5199|   443|    1763|0.25127623369256946|
+------+------+--------+-------------------+



#####8. If you could add 5 drivers to any single hour of every day during the two-week period, which hour should you add them to? Hint: Consider both rider eyeballs and driver supply when choosing

To determine which hour to add 5 drivers to, we want to look for an hour where there are a high number of rider eyeballs and a low number of unique drivers. One way to approach this is to calculate the ratio of requests to unique drivers for each hour and then choose the hour with the highest ratio. The idea here is that adding more drivers to an hour with a high ratio will result in more completed trips.

In [0]:
# Calculate requests per unique driver for each hour
requests_per_driver = (df.groupBy('Time_Local')\
                         .agg((sum('Requests') / countDistinct('Unique_Drivers')).alias('requests_per_driver'))
)

# Show the hour with the highest ratio
requests_per_driver.orderBy(desc('requests_per_driver')).show(1)

+----------+-------------------+
|Time_Local|requests_per_driver|
+----------+-------------------+
|         2|               20.0|
+----------+-------------------+
only showing top 1 row



#####9. Looking at the data from all two weeks, which time might make the most sense to consider a true “end day” instead of midnight? (i.e when are supply and demand at both their natural minimums)

Solution: Calculate average completed trips and unique drivers for each hour and show the hour with the lowest average completed trips and unique drivers

In [0]:
df.groupBy('Time_Local').agg(avg('Completed_Trips').alias('avg_requests'), avg('Unique_Drivers').alias('avg_unique_drivers'))\
  .orderBy('avg_requests', 'avg_unique_drivers').show(1)

+----------+-------------------+------------------+
|Time_Local|       avg_requests|avg_unique_drivers|
+----------+-------------------+------------------+
|         4|0.14285714285714285|0.6428571428571429|
+----------+-------------------+------------------+
only showing top 1 row

