# Evaluation criteria

The goal of this assignment is to get a view on your hands-on "data engineering" skills.  
At our company, our data scientists and engineers collaborate on projects.  
Your main focus will be creating performant & robust data flows.  
For a take-home-assignment, we cannot grant you access to our infrastructure.  
The assignement below measures your proficiency in general programming, data science & engineering tasks using python.  
Completion should not take more than half a day.

**We expect you to be proficient in:**
 * SQL queries (Sybase IQ system)
 * ETL flows (In collaboration with existing teams)
 * General python to glue it all together
 * Python data science ecosystem (Pandas + SKlearn)
 
**In this exercise we expect you to demonstrate your ability to / knowledge of:**
 * Building a data science runtime
 * PEP8 / Google python styleguide
 * Efficiently getting the job done
 * Choose meaningfull names for variables & functions
 * Writing maintainable code (yes, you might need to document some steps)
 * Help a data scientist present interactive results.
 * Offer predictions via REST api

# Setting-up a data science workspace

We allow you full freedom in setting up a data science runtime.  
The main objective is having a runtime where you can run this notebook and the code you will develop.  
You can choose for a local setup on your pc, or even a cloud setup if you're up for it.   

**In your environment, you will need things for:**
 * https request
 * python3 (not python2 !!)
 * (geo)pandas
 * interactive maps (e.g. folium, altair, ...)
 * REST apis
 
**Deliverables we expect**:
 * notebook with the completed assignment
 * list of packages for your runtime (e.g. yml or txt file)
 * evidence of a working API endpoint

# Importing packages

We would like you to put all your import statements here, together in 1 place.  
Before submitting, please make sure you remove any unused imports :-)  

In [29]:
## your imports go here.  You get pandas for free.

import pandas as pd
import json
from urllib.request import urlopen
import unittest
import ssl

# Data ingestion exercises

## Getting store location data from an API

**Goal:** Obtain a pandas dataframe  
**Hint:** You will need to normalise/flatten the json, because it contains multiple levels  
**API call:** https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places  

In [30]:
## your code goes here
def get_clp_places(url:str)->pd.DataFrame:
    """This function will noramalize the data fetched from URL recieved 

    Args:
        url (str): URL as string

    Returns:
        pd.DataFrame: Noramalized data from URL
    """    
    #create_default_context() returns a new context with secure default settings.
    #This had to added after anaconda upgrade as SSVL3: Handshake failure popped up
    ctx = ssl.create_default_context()
    ctx.set_ciphers('DEFAULT')
    json_url = urlopen(url,context=ctx)
    data = json.loads(json_url.read())
    df = pd.json_normalize(data)
    return df


In [31]:
df_clp = get_clp_places("https://ecgplacesmw.colruytgroup.com/ecgplacesmw/v3/nl/places/filter/clp-places")

print(df_clp.head(2))

   placeId    commercialName branchId sourceStatus      sellingPartners  \
0      902   AALST (COLRUYT)     4156           AC  [QUALITY, 3RDPARTY]   
1      946  AALTER (COLRUYT)     4218           AC  [QUALITY, 3RDPARTY]   

                   handoverServices  \
0  [CSOP_ORDERABLE, PREPAID_PARCEL]   
1  [CSOP_ORDERABLE, PREPAID_PARCEL]   

                                         moreInfoUrl  \
0  https://www.colruyt.be/nl/colruyt-openingsuren...   
1  https://www.colruyt.be/nl/colruyt-openingsuren...   

                                            routeUrl  isActive  \
0  https://maps.apple.com/?daddr=50.933074,4.0538972      True   
1  https://maps.apple.com/?daddr=51.0784761,3.450...      True   

                             placeSearchOpeningHours  ...  placeType.id  \
0  [{'date': '16-05-2022', 'opens': 830, 'closes'...  ...             1   
1  [{'date': '16-05-2022', 'opens': 830, 'closes'...  ...             1   

  placeType.longName  placeType.placeTypeDescription geoCoordi

### Quality checks

We would like you to add several checks on this data based on these constraints:  
 * records > 200
 * latitude between 49 and 52
 * longitude between 2 and 7
 
We dont want you to create a full blown test suite here, we're just gonna use 'asserts' from unittest

In [32]:
## Testcase execution below
tc=unittest.TestCase('__init__')
tc.assertGreater(len(df_clp.index),200,'Record count is less than 200')
tc.assertTrue(df_clp['geoCoordinates.latitude'].min()>49 and df_clp['geoCoordinates.latitude'].max()<52,'Latitude is not within 49 and 52')
tc.assertTrue(df_clp['geoCoordinates.longitude'].min()>2 and df_clp['geoCoordinates.longitude'].max()<7,'longitude is not within 2 and 7')


### Feature creation

Create a new column "antwerpen" which is 1 for all stores in Antwerpen (province) and 0 for all others 

In [33]:
## adding new column antwerpen which is 1 for all stores in Antwerpen province 

df_clp['antwerpen']= df_clp['address.cityName'].apply(lambda city: '1' if 'ANTWERPEN' in city.upper() else '0')

df_clp["antwerpen"].value_counts()

0    247
1      5
Name: antwerpen, dtype: int64

## Predict used car value

A datascientist in our team made a basic model to predict car prices.  
The model was saved to disk ('lgbr_cars.model') using joblib's dump fuctionality.  
Documentation states the model is a LightGBM Regressor, trained using the sk-learn api.  

**As engineer, your task it to expose this model as REST-api.** 

First, retrieve the model via the function below.  
Change the path according to your setup.  

In [2]:
import requests
import json

In [6]:
#Sample input 
model_test_input = [[3,1,190,-1,125000,5,3,1]]

In [7]:
#Sample input 
model_test_input = [[-1,1,0,118,150000,0,1,38]]

In [9]:
#Calling the API for car price prediction
url = "http://127.0.0.1:2000/car_price_prediction"
headers = {"Content-Type": "application/json"}

querystring={"model_test_input" : model_test_input[0]}
response = requests.post(url,headers=headers,json=querystring)
response.json()

{'predicted': 13920.704635637961}

Now you have your trained model, lets do a functional test based on the parameters below.  
You have to present the parameters in this order.  

* vehicleType: coupe
* gearbox: manuell
* powerPS: 190
* model: NaN
* kilometer: 125000
* monthOfRegistration: 5 
* fuelType: diesel
* brand: audi

Based on these parameters, you should get a predicted value of 14026.35068804
However, the model doesnt accept string inputs, see the integer encoding below:

In [3]:
#The input is mentioned in above cell and the API is defined in car_price_prediction.py  

Now you got this model up and running, we want you to **expose it as a rest api.**  
We don't expect you to set up any authentication.  
We're not looking for beautiful inputs, just make it work.  
**Building this endpoint should NOT be done in a notebook, but in proper .py file(s)**

Once its up and running, use it to predict the following input:
* [-1,1,0,118,150000,0,1,38] ==> prediction should be 13920.70

## Geospatial data exercise
The goal of this exercise is to read in some data from a shape file and visualize it on a map
- The map should be dynamic. I want to zoom in and out to see more interesting aspects of the map
- We want you to visualize the statistical sectors within a distance of 2KM of your home location.

Specific steps to take:
- Read in the shape file
- Transform to WGS coordinates
- Create a distance function (Haversine)
- Create variables for home_lat, home_lon and perimeter_distance
- Calculate centroid for each nis district
- Calculate the distance to home for each nis district centroid 
- Figure out which nis districts are near your home
- Create dynamic zoomable map
- Visualize the nis districts near you (centroid <2km away), on the map


In [2]:
# Some imports to help you along the way
import geopandas as gpd
from math import radians
import math
import folium
from IPython.display import display


In [3]:
# part 1: Reading in the data
# fetched this file from https://statbel.fgov.be/sites/default/files/files/opendata/Statistische%20sectoren/sh_statbel_statistical_sectors_20200101.shp.zip 
df = gpd.read_file(r'C:\Users\dmh8w17\Desktop\DSE\sh_statbel_statistical_sectors_20200101.shp\sh_statbel_statistical_sectors_20200101.shp')
df = df.to_crs({'init': 'epsg:4326'}) # changed projection to wgs84 .



  in_crs_string = _prepare_from_proj_string(in_crs_string)


In [4]:
# One of the data scientists discovered stackoverflow ;-) and copypasted something from https://gis.stackexchange.com/questions/166820/geopandas-return-lat-and-long-of-a-centroid-point
# A data science engineer should be able to speed this next code up
'''
for i in range(0, len(df)):
    df.loc[i,'centroid_lon'] = df.geometry.centroid.x.iloc[i]
    df.loc[i,'centroid_lat'] = df.geometry.centroid.y.iloc[i]
'''

# to speed up the centroid calculation directly used .centroid instead of for loop for every row 
# and it saves a lot of time 
# moved derivation of latitude and longitude to haversine function
df['centroid']=df.centroid


  df['centroid']=df.centroid


In [7]:
# Let's create some variables to indicate the location of your interest 
home_lat=float(input('enter home Latitude'))# (50.85045)#50.8476)#50.85045 4.34878
home_lon=float(input('enter home Longitude'))#(4.34878)#4.3572)
perimeter_distance = 2 # km

In [10]:
# At some point we will need a distance function (google the Haversine formula, and implement it)
def haversine(home_lat:float,home_lon:float,centroid)->float:
    """Given two locations this method will calculate the distance between those two locations

    Returns:
       distance in float: distance between the two passed locations
    """  
    R=6371.0
    r_home_lat=radians(home_lat)
    r_home_lon=radians(home_lon)
    r_centroid_lat=radians(centroid.y)
    r_centroid_lon=radians(centroid.x)
    dlon=r_home_lon-r_centroid_lon
    dlat=r_home_lat-r_centroid_lat
    a = math.sin(dlat / 2)**2 + math.cos(r_home_lat) * math.cos(r_centroid_lat) * math.sin(dlon / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c


Next, implement some sanity checks for your distance function 

In [14]:
# for i in range(0,len(df)):
#     df.at[i,'distance']=haversine(home_lat,home_lon,df.loc[i,'centroid']) 
df['home_lat']=home_lat
df['home_lon']=home_lon

df['distance']=df.apply(lambda r: haversine(r.home_lat,r.home_lon,r.centroid),axis=1 )

Now, create a dynamical map 

In [15]:
    """Creating a new dataframe to consider data within our perimeter range
    """
df_poi=df.loc[df['distance']<perimeter_distance,:]

In [16]:
#Creating a map with focus on home location
m=folium.Map(location=[home_lat,home_lon], zoom_start=13)

In [17]:
#adding a crimson circle to highlight home location
folium.Circle(
    radius=100,
    location=[home_lat,home_lon],
    popup="Home",
    color="crimson",
    fill=False,
).add_to(m)

<folium.vector_layers.Circle at 0x2aa1c001ab0>

In [18]:
#Highlighting NIS in map which have centroid less than 2 km from home location 
for _, r in df_poi.iterrows():
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,
                           style_function=lambda x: {'fillColor': 'orange'})
    folium.Popup(r['T_SEC_NL']).add_to(geo_j)
    geo_j.add_to(m)

In [19]:
display(m)