# Exercise #2 - Popular POI
You are given two datasets:
* Pickup/Dropoff locations of taxi rides in New York - `rides_small_ds.csv`
* POI (Points of interest) of New York city - `poi_small_ds.csv`

Your goal is to find the most popular places visited by people during morning and evening hours (morning is defined by range 6AM - 11AM, while evening is 17PM - 23PM).

We assume that a place was visited if it’s located within 300 meters of pick up / drop off location.

The result should be a dataframe with all the locations and their types, the number of visited places - sorted by the number of visits desc.

### Haversine Formula
The haversine formula determines the great-circle distance between two points on a sphere given their longitudes and latitudes. You should use it to find the distance in meters between two Geo locations.

A method that implements it is already defined below. You can use it as a UDF or to check the algorithm and implement it using PySpark functions.

In [None]:
import os
import sys
os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk1.8.0_212"
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [None]:
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [None]:
spark = SparkSession.builder \
    .appName('poi_exercise') \
    .master('local[*]') \
    .config('spark.sql.execution.arrow.pyspark.enabled', True) \
    .config('spark.driver.memory','10G') \
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config('spark.network.timeout', '400s') \
    .config('spark.storage.blockManagerSlaveTimeoutMs', '400s') \
    .config('spark.executor.heartbeatInterval', '300s') \
    .getOrCreate()

In [None]:
import math

def haversine_formula(lat1, lon1, lat2, lon2):
    """
    Return the distance between two coordinates, in meters.
    """
    lat1 = math.pi / 180.0 * lat1
    lon1 = math.pi / 180.0 * lon1
    lat2 = math.pi / 180.0 * lat2
    lon2 = math.pi / 180.0 * lon2
    radius = 6371  # km
    meters = 1000

    # Use the haversine formula:
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = math.pow(math.sin(dlat/2),2) + math.cos(lat1) * math.cos(lat2) * math.pow(math.sin(dlon/2),2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    meters = radius * c * meters
    return meters

### Steps:

1. Load the data from the files, if needed - change columns names and / or types
2. filter the data as described in the exercise
3. join the data (in this case, a cross join should do it)
4. calculate the distance from each location to the pickup / dropoff location (tip: add `.cache()` after your calculations to prevent the action from being executed over and over)
5. keep only the rows that are considered close (as described above)
6. count visits per place, sort the data to by the number of visits and order it so the most visited place will be the first

Good Luck!

If you are not sure how to do something, google it first