# 01/2022 Questions

This notebook will track progress on practice questions that are sent by the InterviewQs website. 

In [8]:
# THIS FUNCTION TAKES A CSV FILE FROM A GITHUB URL AND READS IT INTO A PANDAS DATA FRAME
import pandas as pd

def read_file(url):

    """
    Takes GitHub url as an argument,
    pulls CSV file located @ github URL.

    """

    url = url + "?raw=true"
    df = pd.read_csv(url)
    return df


# READ FILE FROM GITHUB REPO
url_2022_01_17_01 = "https://github.com/akstl1/InterviewQs-Data-Science-Questions/blob/main/Data/2022.01.17-ad_table.csv"
url_2022_01_17_02 = "https://github.com/akstl1/InterviewQs-Data-Science-Questions/blob/main/Data/2022.01.17-spend_table.csv"
url_2022_01_14 = "https://github.com/akstl1/InterviewQs-Data-Science-Questions/blob/main/Data/2022.01.14_data.csv"

ad_table_data= read_file(url_2022_01_17_01)
spend_table_data = read_file(url_2022_01_17_02)
channel_data = read_file(url_2022_01_14)

<hr style="border:1px solid black"> </hr>

### 1/17 Question

Given the following dataset, can you write a SQL query that returns the top 3 performing ad groups each day?
    
Here we'll define performance as the ratio between revenue and spend (e.g. revenue / spend). In other words, the higher the ratio the better the peformance. The output of the query will be the date and an array of the ad groups.

### Approach

To solve the problem, I will create a query that calculates the desired ratio, ranks groups by date and displays the result as such:
- join the two datasets shown below based on date and ad_group
- within a subquery
    - calculate revenue/spend ratio as a decimal
    - use RANK to rank rows based on date and revenue/spend ratio
    - select the ratio, rank, ad_group and date within the subquery
- after the subquery
    - limit results where rank is less than/equal to 3
    - order results by date and rank for best readability
    - select date, ad_group, revenue/spend ratio, and rank in the final result

In [133]:
import pandas as pd

In [4]:
ad_table = ad_table_data
ad_table.head()

Unnamed: 0,date,shown,clicked,converted,avg_cost_per_click,total_revenue,ad
0,10/1/15,65877,2339,43,0.9,641.62,ad_group_1
1,10/2/15,65100,2498,38,0.94,756.37,ad_group_1
2,10/3/15,70658,2313,49,0.86,970.9,ad_group_1
3,10/4/15,69809,2833,51,1.01,907.39,ad_group_1
4,10/5/15,68186,2696,41,1.0,879.45,ad_group_1


In [6]:
spend_table = spend_table_data
spend_table.head()

Unnamed: 0,date,ad,total_spend
0,10/1/15,ad_group_1,2105.1
1,10/2/15,ad_group_1,2348.12
2,10/3/15,ad_group_1,1989.18
3,10/4/15,ad_group_1,2861.33
4,10/5/15,ad_group_1,2696.0


SELECT date,ad_group, revenue_spend_ratio, ratio_rank FROM (

SELECT a.date AS date, a.ad AS ad_group, round(a.total_revenue/b.total_spend, 2) AS revenue_spend_ratio, RANK () OVER ( PARTITION BY a.date ORDER BY round(a.total_revenue/b.total_spend, 2) DESC) AS ratio_rank 
FROM ad_table a JOIN spend_table b on a.ad=b.ad AND a.date=b.date)

WHERE ratio_rank<=3
order by date, ratio_rank

<hr style="border:1px solid black"> </hr>

## 1/14/2022 Question

Given the table below, called 'orders', write code to show the average revenue by month by channel:
    

|order_id |	channel |	date |	month |	revenue|
|---|---|---|---|---|
|1|	online |	2018-09-01 |	09 |	100|
|2|	online |	2018-09-03|	09 |	125|
|3|	in_store |	2018-10-11 |	10 |	200|
|4|	in_store |	2018-08-21 | 	08 |	80|
|5|	online |	2018-08-13 |	08 |	200|

Your result should return the following in a structured table:

 Month | Channel | Avg. Revenue 

### Approach

I will solve this problem two ways, based on the wording:

1. I will solve assuming that this problem is asking for revenue to be totalled for each month, and then averaged for each unique month. For example, I would calculate total revenue for January 2021 and then average that with the total revenue from January 2022, etc..

To solve the above problem I will:
- import data and pandas
- drop order_id col since it isn't needed in the final result
- group the table by month, year and channel
- aggregate the revenue col to show total revenue by month and year
- group the table by month and channel, and aggregate so an average per month can be calculated

2. I will solve assuming that this problem is asking for transactional to be totalled for each month. For example, the value of each order in September, online channel would be averaged for a final avg revenue value

To solve the above problem I will:
- import data and pandas
- drop order_id and date cols since they aren't needed in the final result
- group the table by month and channel
- aggregate the revenue col to show average revenue by transaction

In [8]:
import pandas as pd

#### Approach 1

In [9]:
channel_data_2022_01_01_approach_1 = channel_data

In [108]:
channel_data_2022_01_01_approach_1.head()

Unnamed: 0,order_id,channel,date,month,revenue
0,1,online,2018-09-01,9,100
1,2,online,2018-09-03,9,125
2,3,in_store,2018-10-11,10,200
3,4,in_store,2018-08-21,8,80
4,5,online,2018-08-13,8,200


In [109]:
channel_data_2022_01_01_approach_1.drop("order_id",axis=1, inplace=True)

In [110]:
channel_data_2022_01_01_approach_1["year"]=pd.DatetimeIndex(channel_data_2022_01_01_approach_1['date']).year

In [111]:
revenue_by_month_and_year = channel_data_2022_01_01_approach_1.groupby(["month","year","channel"]).agg("sum")

In [112]:
revenue_by_month_and_year

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,revenue
month,year,channel,Unnamed: 3_level_1
8,2018,in_store,80
8,2018,online,200
9,2018,online,225
10,2018,in_store,200


In [126]:
average_revenue_by_month = revenue_by_month_and_year.groupby(["month", "channel"]).agg("mean").reset_index()

In [128]:
average_revenue_by_month.rename(columns={"month":"Month","channel":"Channel","revenue":"Avg. revenue"}, inplace=True)

In [129]:
average_revenue_by_month

Unnamed: 0,Month,Channel,Avg. revenue
0,8,in_store,80
1,8,online,200
2,9,online,225
3,10,in_store,200


#### Approach 2

In [10]:
channel_data_2022_01_01_approach_2 = channel_data

In [101]:
channel_data_2022_01_01_approach_2.drop("order_id",axis=1, inplace=True)
channel_data_2022_01_01_approach_2.drop("date",axis=1, inplace=True)

In [130]:
revenue_by_transaction_2 = channel_data_2022_01_01_approach_2.groupby(["month","channel"]).agg("mean").reset_index()

In [131]:
revenue_by_transaction_2.rename(columns={"month":"Month","channel":"Channel","revenue":"Avg. revenue"}, inplace=True)

In [132]:
revenue_by_transaction_2

Unnamed: 0,Month,Channel,Avg. revenue
0,8,in_store,80.0
1,8,online,200.0
2,9,online,112.5
3,10,in_store,200.0


<hr style="border:1px solid black"> </hr>

## 1/12/2022 Question

Given the information below, if you had a good first interview, what is the probability you will receive a second interview?

50% of all people who received a first interview received a second interview

95% of people that received a second interview had a good first interview

75% of people that did not receive a second interview had a good first interview

### Approach

The above statistical question can be solved using Bayes' Theorem. Bayes' Theorem states that:

P(A|B) = (P(B|A) * P(A)) / P(B)

In this case:
- A is the probability of getting a second interview
- B is the probability of having a good first interview
- B|A is the probability of having a good first interview given that you received a second interview

From the above definition, A is simply the chance of getting a second interview which we are told is 50% in the first statement.

B is the probability of having a good first interview. Based on the information given, we can calculate this as the addition of the following:
- percent of people who didn't proceed to the second round (50%) but had a good first interview (75%) = 50%x75%
- percent of people of getting into the second round (50%) and having a good first interview (95%) = 50%x95%

P(B|A) is simply the 95% statistic in the second statement.

Combining this information:

In [1]:
A = .5

In [2]:
B = .5*.75 + .5*.95

In [5]:
B_given_A = .95

In [6]:
P_A_given_B = B_given_A*A / B

In [7]:
P_A_given_B

0.5588235294117647

__According to the above, there is an ~0.5588 chance that someone will receive a second interview given that you had a good first interview__

<hr style="border:1px solid black"> </hr>

## 01/07 Question

You are given a dataset with information around messages sent between users in a P2P messaging application. Below is the dataset's schema:

Given this, write code to find the fraction of messages that are sent between the same sender and receiver within five minutes (e.g. the fraction of messages that receive a response within 5 minutes). 

### Approach

To solve this problem, I will go through each row. For each row, I will:
- record current time, sender, and receiver id's
- determine if a table entry exists where sender and receiver are swapped, and time is within 5 minutes (360 seconds) of original message
- if there is such a table entry, the count of messages within 5 minutes is incremented by 1

### Result
As shown via the below code there are 86 total rows, 26 of which matched the above criteria. <strong>This leads to a fraction of .267 <strong>. 

In [1]:
#import pandas for data processing
import pandas as pd

In [93]:
#import data
message_data = pd.read_csv("https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/sample_message_dataset.csv")

In [94]:
#get a quick look at data and structure
message_data.head()

Unnamed: 0,date,timestamp,sender_id,receiver_id
0,2018-03-01,1519923378,1,5
1,2018-03-01,1519942810,1,4
2,2018-03-01,1519918950,1,5
3,2018-03-01,1519930114,1,2
4,2018-03-01,1519920410,1,2


In [95]:
#count total messages in the dF
total_messages = len(message_data)
total_messages

86

In [98]:
#set counter to track instances of messages within 5 minutes and sender/reciever swapped
counter=0

#run a for loop that will take each message, search the dataframe for the time/id requirements, and increment the counter
for message_number in range(total_messages):
    time=message_data.iloc[message_number]["timestamp"]
    new_sender=message_data.iloc[message_number]["receiver_id"]
    new_reciever=message_data.iloc[message_number]["sender_id"]
    if len(message_data[(message_data["timestamp"]>time) & (message_data["timestamp"]<(time+360)) & (message_data["sender_id"]==new_sender) & (message_data["receiver_id"]==new_reciever)])>0:
        counter+=1    

In [99]:
#divide messages that met criteria by total messages to get final answer
answer = counter/total_messages
answer

0.26744186046511625

<strong>Final result is .267 <strong>

## To Solve

Suppose you're running a second-price auction. In this auction, the highest bidder will win, but will pay the auctioneer (you) the value of the second-highest bid. Assuming there are two bidders bidding on one item, and the bidder knows his own valuation but sees the valuation of the rival as uncertain and distributed uniformly in the unit interval, calculate the expected revenue when the reserve price is 1/2.
