### Challenge
Find the most visited URL per country per day during a week of your choice (e.g. 2018-04-01 until 2018-04-08).

### Expected output
As part of this challenge, we would like you to create a more detailed diagram of the whole system, with every resource reflected on it. Include an explanation on what it is that it does and your approach to solving the challenge.

Feel free to include as well any code or screenshots that are part of your solution if relevant!

### Approach
#### Athena
1. Understand the data using Athena i.e. what the column names mean, what kind of values and data types
2. Choose which columns would be useful - i.e. country_code, url_visited, date
3. Test queries out step by step with the following to try out:
    * COUNT(date) to count how many url per country per day
    * Filter the dates: BETWEEN 2018-04-01 AND 2018-04-08
    * Use GROUP by on country_code, date
    * Do some research to find out how we could get the url that is most visited on the url column
4. Repeat testing until the query performs as expected and gives the output I am looking for

Below is the table I expect to transform to be loaded to my psql table locally

| country_code | date       | most_visited_url                |
|---------|------------|--------------------|
| US    | 2018-04-01 | www.netflix.com |
| CA   | 2018-04-01 | www.netflix.com  |

### Documenting the SQL queries I did in Athena


1. Get the first five rows to understand what kind of values each column contains
```
SELECT *
FROM vod_clickstream
LIMIT 5;
```
* After executing this query, I found that 
    * `server_request_country_code` shows the country code
    * `dt` shows the full date i.e. 2016-06-09
    * `event_url` shows the url visited i.e. https://www.netflix.com/browse

2. Base on the above findings, I decided to narrow down and check out these three columns
```
SELECT 
server_request_country_code, 
dt, 
event_url
FROM 
vod_clickstream
LIMIT 5;
```
Below is the output I got from Athena and I thought it looked good with the info I need find out what's the most visited url for each country per day for a week of my choice
![Image showing the output from Athena after narrowing down to the 3 columns](./athena_outputs/narrow_query.png)

3. After this I decided to try and filter the dates using the BETWEEN clause
```
SELECT 
server_request_country_code, 
dt, 
event_url
FROM 
vod_clickstream
WHERE dt BETWEEN '2018-04-01' AND '2018-04-08'
LIMIT 5;
```
![Image showing the output from Athena after  filtering on the dates to the week of my choice](./athena_outputs/filter_dates_query.png)

4. Next I decided to try group by on the country code and date so that I can see if it includes all the dates I need. I choose to filter on the country code to be Hong Kong so it's easier to see without all the countries. In the process I was reminded that to use GROUP BY you need to have an aggregate function so to be easy, I just decided to use COUNT on the event_url
```
SELECT 
server_request_country_code,  
dt, 
COUNT(event_url) AS total_url_visited
FROM 
vod_clickstream
WHERE dt BETWEEN '2018-04-01' AND '2018-04-08'
AND server_request_country_code = 'HK'
GROUP BY server_request_country_code, dt;
```
![Image showing the output of the GROUP BY query filtering on country being Hong Kong](./athena_outputs/group_by_check_dates_query.png)

This looks great except we need to filter out one of the extra days as there are 8 days above. After further checking, 2018-04-01 is a Sunday and I want to use the week beginning on a Monday so this date will be filtered out now.

5. Adjusted the BETWEEN clause so 2018-04-01 is not included
```
SELECT 
server_request_country_code,  
dt, 
COUNT(event_url) AS total_url_visited
FROM 
vod_clickstream
WHERE dt BETWEEN '2018-04-02' AND '2018-04-08'
AND server_request_country_code = 'HK'
GROUP BY server_request_country_code, dt;
```
![Image showing 2018-04-01 not included in the results anymore](./athena_outputs/optimised_between_clause_query.png)

In [1]:
import boto3
import time
import psycopg2
from dotenv import load_dotenv
import os