## Questions


__Use Chipotle July 2021__

1. Return the highest bucketed dwell time for each location/month.  
  A. Convert the bucketed dwell time to a number using the lowest value in the bucket string label. For less than 5 (<5), convert it to 2.   
  B. Display the placekey, date_range_start, raw_visitor_counts, bucket value, visitor counts for that bucket, and a year column.   
  C. Create a visualization that helps us understand the change in dwell times over the years.   
2. Calculate the rolling 7-day totals using `visits_by_day` and `date_range_start`.
  A. Your reported numbers should be correctly scaled to the state estimates.   
  B. Use `F.posexplode()` as one way to build your calculations.   
  D. Sort the dataframe by the 7-day totals column in descending order and display it.   
3. Now build a new dataframe with the top store for each state and create a plot showing the rolling 7-day performance of those stores.
  A. I did this with a `groupBy()`, `partionBy()`, and and `join()` to filter to the best placekey for each state.   
  B. You should create a line chart that shows the rolling 1-week sum for the top store in each state.   
4. Explain what the following `pandas_udf()` is doing in your own words.

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
####
# descript this portion
@pandas_udf('int', PandasUDFType.SCALAR)
def day_with_max_visits(series):
  cols = [x for x in range(1,32)]
  temp = pd.DataFrame(series.to_list(), columns=cols)
  return temp.idxmax(axis=1)
#####

display(df.withColumn('day with max', day_with_max_visits(df.visits_by_day)))
```

In [0]:
# build data in parquet format for 
## Have them read it in and then do the following
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from plotnine import *
import pandas as pd

dat = spark.read.parquet("dbfs:/FileStore/dat/owhekdudn.parquet")

display(dat)

In [0]:
dat.printSchema()

## Q1: Bucket

**A. Convert the bucketed dwell time to a number using the lowest value in the bucket string label. For less than 5 (<5), convert it to 2.**

In [0]:
from pyspark.sql.functions import udf

#Using UDF to allow my function to return STRING values as a pyspark column, and allow to work with them. It returns the key of the highest value from a dictionary.

@udf("string")
def return_highest_dwell_time(data):
    dict_var = {}
    dict_var['2'] = data['<5']
    dict_var['5'] = data['5-10']
    dict_var['11'] = data['11-20']
    dict_var['21'] = data['21-60']
    dict_var['61'] = data['61-120']
    dict_var['121'] = data['121-240']
    dict_var['240'] = data['>240']
    
    highest_value = max(dict_var, key=dict_var.get)
    
    return highest_value

dat_data = dat.withColumn("bucket_value", return_highest_dwell_time(dat.bucketed_dwell_times))

display(dat_data.select("bucket_value"))

**B. Display the placekey, date_range_start, raw_visitor_counts, bucket value, visitor counts for that bucket, and a year column.**

In [0]:
@udf("integer")
def return_visitor_counts(data):
    dict_var = {}
    dict_var['2'] = data['<5']
    dict_var['5'] = data['5-10']
    dict_var['11'] = data['11-20']
    dict_var['21'] = data['21-60']
    dict_var['61'] = data['61-120']
    dict_var['121'] = data['121-240']
    dict_var['240'] = data['>240']
    
    visitor_count = max(dict_var.values()) #Same as the prior function, it returns the maximum integer value from a dictionary of values.
    
    return visitor_count

dat_data = dat_data.withColumn("visitor_counts_by_bucket", return_visitor_counts(dat_data.bucketed_dwell_times))

new_dat_data = dat_data.select(F.col("placekey"), F.col("date_range_start"), F.col("region"), F.col("visits_by_day"), F.col("raw_visit_counts"), F.col("bucket_value"), F.col("visitor_counts_by_bucket"), F.year(F.col("date_range_start")).alias("Year"), F.col("normalized_visits_by_state_scaling"))

display(new_dat_data)

**C. Create a visualization that helps us understand the change in dwell times over the years.**

In [0]:
pd_new_dat_data = new_dat_data.toPandas()

(ggplot(pd_new_dat_data)
 + geom_col(aes(x='Year', y ='visitor_counts_by_bucket', fill = 'bucket_value'), width = 0.50)
 + theme(figure_size=(10, 6))
 + labs(x='Year', y='Visitor Counts By Bucket', title="Change in Dwell Times over the Years")
)

## Q2: Rolling window

**Calculate the rolling 7-day totals using visits_by_day and date_range_start.**<br>
**A. Your reported numbers should be correctly scaled to the state estimates.**<br>
**B. Use F.posexplode() as one way to build your calculations.**

In [0]:
#I am basically using the posexplode function to get indices of each daily visits and their values in the visits_by_day column. Then, I sum up the indices to the date_range_start column to get the specific day of the month in which we got a number of daily visits. After this, the scaling to the state estimates is done.

posexploded_data = new_dat_data.select("*", F.posexplode(new_dat_data.visits_by_day))
posexploded_data = posexploded_data.withColumn("date", F.date_format(F.col("date_range_start"), "yyyy-MM-dd"))
posexploded_data = posexploded_data.withColumn("month", F.month(posexploded_data.date_range_start))
posexploded_data = posexploded_data.withColumn("col", (posexploded_data.col / posexploded_data.raw_visit_counts) * posexploded_data.normalized_visits_by_state_scaling)
posexploded_data = posexploded_data.withColumn("date", F.expr("date_add(date, pos)"))

#I did the rolling average by creating a window partition to average the current value and the prior 6 (a total of 7 days) sequentially. 

rolling7Days_window_partition = Window().rowsBetween(-6, Window.currentRow)
posexploded_data = posexploded_data.withColumn("rolling_avg", F.avg(posexploded_data.col).over(rolling7Days_window_partition))
posexploded_data = posexploded_data.filter(posexploded_data.pos >= 6)
   
rolling_sum_window_partition = Window().partitionBy("placekey", "Year", "month")
rolling7_days_data = posexploded_data.withColumn("_7_day_total", F.sum(posexploded_data.rolling_avg).over(rolling_sum_window_partition))
rolling7_days_data = rolling7_days_data.filter(rolling7_days_data.pos == 7).drop("row_number", "pos", "col", "rolling_avg", "normalized_visits_by_state_scaling", "month", "date")

display(rolling7_days_data)

**C. Sort the dataframe by the 7-day totals column in descending order and display it.**

In [0]:
rolling7_days_data = rolling7_days_data.sort(rolling7_days_data._7_day_total.desc())
display(rolling7_days_data)

## Q3: groupby and join

**Now build a new dataframe with the top store for each state and create a plot showing the rolling 7-day performance of those stores.**<br>
**A. I did this with a groupBy(), partionBy(), and and join() to filter to the best placekey for each state.**

In [0]:
#I only had to use a window partition, orderBy, and a rank function to get the top stores for each state. According to the description of question 3, the solution can also be achived by using a groupBy and a join(). I decided to use my own solution since I don't see a part where says to strictly use the groupBy and filter.

partition_by_top_store = Window.partitionBy("region").orderBy(F.col("_7_day_total").desc())
ranked_top_store = rolling7_days_data.withColumn("rank", F.rank().over(partition_by_top_store))
top_store_by_state = ranked_top_store.filter(F.col("rank") <= 1).drop("row", "rank")
display(top_store_by_state)

**B. You should create a line chart that shows the rolling 1-week sum for the top store in each state.**

In [0]:
#Florida and New Jersey got the biggest values of the rolling sum.

pd_top_store_by_state = top_store_by_state.toPandas()

(ggplot(pd_top_store_by_state, aes(x='region', y='_7_day_total'))
 + geom_line(aes(group=1), color = 'blue')
 + theme(figure_size=(16, 8))
 + labs(x='Top Store by State', y='Rolling 1-Week Sum', title="7-day Performance of Stores")
)

## Q4: Describe code

**Explain what the following pandas_udf() is doing in your own words.**

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
####
# descript this portion
@pandas_udf('int', PandasUDFType.SCALAR)
def day_with_max_visits(series):
  cols = [x for x in range(1,32)]
  temp = pd.DataFrame(series.to_list(), columns=cols)
  return temp.idxmax(axis=1)
#####

display(df.withColumn('day with max', day_with_max_visits(df.visits_by_day)))
```

Explanation: This code is using a scalar pandas UDF to perform computations on a pandas series. Firstly, we have a Pyspark Dataframe called "df" with a visits_by_day column. We will be using the spark function "withColumn" to either create a new column or modify an existing one in the 'df' spark Dataframe. We have created a function called 'day_with_max_visits' to perform computations on the 'visits_by_day' column, but the issue here is that this function is expecting a pandas series parameter. We could convert the Pyspark Dataframe into a Pandas Dataframe, but that would need more code. So, instead, we can just simplify things and use a scalar pandas UDF to allow us to work with the 'visits_by_day' column as if it were a pandas series. This pandas UDF will wrap up the function mentioned and let it work with the Pyspark column as a pandas series, and it will also return a pandas series converted into a Pyspark column back to the df Dataframe. Once we input the visits_by_day values as a pandas series into our function, we have a 'cols' variable storing a sequence with the values from 1 to 31 that will be the columns for a Dataframe, and then we have the 'temp' Dataframe with 31 columns and their values converted as a list. As the last step, we return a pandas series with the index of the maximum value of the row.