# NYC Taxi demo Notebooks and KQL

_**<u>Source: [https://github.com/Azure/kusto-adx-cse](https:\github.com\Azure\kusto-adx-cse)</u>**_

This will use native python packages that are publicly available. 

**<u>Goal:</u>**   
Q<span style="color: var(--vscode-foreground);">uery a publicly available dataset (</span>[NYC taxi](https:\learn.microsoft.com\en-us\azure\open-datasets\dataset-taxi-yellow?tabs=azureml-opendatasets)<span style="color: var(--vscode-foreground);">) and use a basic clustering ML model to detect where are the most busy taxi pickup hot spots in New York City.</span>

Note that as a prerequisite, your kusto database needs to have that data already ingested. Reach out to your buddy on how to ingest this.

## High level notebook workflow

- Load up our dependencies using import commands
- Load up the [KQL magic](https:\pypi.org\project\Kqlmagic\) package to allow connectivity to Kusto
- Authenticate to the Kusto database
- Demo a few KQL commands to showcase KQL interactivity through Jupyter notebook
- Train a model on a fraction of the data
- Display a graphical rendering for the clustering results on New York City taxi pickup location

So let's start by loading up the numpi packages we need for the renders later...

In [None]:
import numpy as np
import pandas as pd

**_Missing module from GitHub_**

Install needed scikit-learn module via pip

In [None]:
!pip install scikit-learn

Then we load up some matplotlib packages for the graphs..

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Now it's time to import the KQL magic package wich will enable the kusto connectivity

In [None]:
!pip install Kqlmagic --no-cache-dir --upgrade

we load up the package to memory

In [None]:
reload_ext Kqlmagic

And now we must connect to your cluster URL. You can get this in the kusto database landing page in the database detail section. 

This uses what is called Device Code authentication flow. It will give you a code that you need to input and then ask you to authenticate using your AAD credentials. Single sign on is difficult due to the notebook running on compute. It's much easier to use device auth flow but do know that some organizations will prevent this using AAD conditional access. Talk to your administrators if you run into authentication issue and ask us for help as well.

In [None]:
%kql kusto://code;cluster='https://trd25r7f2pc3u6y5rj890s.z9.kusto.data.microsoft.com';database='KustoDemo'

This step simply returns a count of the "trips2" table

In [None]:
%%kql trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
| where pickup_datetime between (datetime(2014-01-01)..datetime(2014-12-31))
| where isnotempty(pickup_latitude) and isnotempty(pickup_longitude)
| count

This cells shows how the render commands is also available through KQL magic. Note that here it`s kusto doing the rendering, not python.

In [None]:
%%kql      // Note the %% magic syntax to send full cell contents to ADX (including comment marker //)
trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
| where pickup_datetime  between (datetime(2014-01-01)..datetime(2014-12-31))
| summarize count() by bin_at(pickup_datetime, 7d, datetime(2014-01-01))
| render timechart with(title='NYC 2014 Taxi Rides count per week')

Now we start working on a more detailed query that contextualizes the data using their geographic positions

In [None]:
%%kql trips2 
| extend 
  pickup_datetime= tpepPickupDateTime
, dropoff_datetime = tpepDropoffDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
, dropoff_longitude = endLon
, dropoff_latitude = endLat
, vendor_id=vendorID
| where isnotempty(pickup_latitude) and isnotempty(pickup_longitude)
| project vendor_id, pickup_datetime, dropoff_datetime,pickup_longitude, pickup_latitude, dropoff_longitude,dropoff_latitude
| take 3

Here we set some boundaries that will help us later

In [None]:
# Define NYC area limits

south=40.61
north=40.91
west=-74.06
east=-73.77

1. Specify KQL query
2. Implement simple result cache in local binary (pickle) file, based on hash of the KQL query string

NOTE: to make hash() consistent set env. variable PYTHONHASHSEED=0

In [None]:
%env PYTHONHASHSEED=0

def adx_query(q):
    fn = "df" + str(hash(q)) + ".pkl"
    try:
        df = pd.read_pickle(fn)
        print("Load df from " + fn)
        return df
    except:
        print("Execute query...")
        %kql res << -query q
        try:
            df = res.to_dataframe()
            print("Save df to " + fn)
            df.to_pickle(fn)
            print("\n", df.shape, "\n", df.columns)
            return df
        except Exception as ex:
            print(ex)
            return None

the "Q" variable holds our main KQL code. This will aggregate all pickups within our geographic boundary. 

In [None]:
q = '''
set notruncation;
let South=south; let North=north; let West=west; let East=east; // copy Python variables to ADX
trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, dropoff_datetime = tpepDropoffDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
, dropoff_longitude = endLon
, dropoff_latitude = endLat
, vendor_id=vendorID
| where pickup_datetime between (datetime(2014-01-01)..datetime(2014-12-31))
| where isnotempty(pickup_latitude) and isnotempty(pickup_longitude)
| extend Lat=round(pickup_latitude, 4), Long=round(pickup_longitude, 4)
| where Lat between(South..North) and Long between(West..East)
| summarize num_pickups=count() by Lat, Long
'''

aggr_pickups = adx_query(q)

we show 4 rows of the dataframe...

In [None]:
print(aggr_pickups[-4:])

Initialize graphics for heatmap

In [None]:
new_style = {'grid':False}
matplotlib.rc('axes', **new_style)
from matplotlib import rcParams
rcParams['figure.figsize'] = [15, 15]

We draw a map by ploting a heat map over a scatter plot. Does this look familiar?

In [None]:
plt.style.use('dark_background')
p = aggr_pickups.plot(kind='scatter', x='Long', y='Lat', color='white', xlim=(west, east), ylim=(south, north), s=0.02, alpha=0.6)

 We grab a small subset of the data for training 0.1%

In [None]:
q = '''
set notruncation;
let South=south; let North=north; let West=west; let East=east; // copy Python variables to ADX
let sf=0.001; // Extract 0.1% of the raw data
trips2
| extend 
  pickup_datetime= tpepPickupDateTime
, dropoff_datetime = tpepDropoffDateTime
, pickup_latitude = startLat
, pickup_longitude = startLon
, dropoff_longitude = endLon
, dropoff_latitude = endLat
, vendor_id=vendorID
| where pickup_datetime between (datetime(2014-01-01)..datetime(2014-12-31))
| where pickup_latitude between(South..North) and pickup_longitude between(West..East)
| project pickup_datetime, pickup_latitude, pickup_longitude
| where rand() < sf'''

df = adx_query(q)

We define the clustering function

In [None]:
def KMeans_clustering(k, features):
    from sklearn.cluster import KMeans, MiniBatchKMeans
    km = MiniBatchKMeans(n_clusters=k) if features.shape[0] > 1000 else KMeans(n_clusters=k)
    km.fit(features)
    centroids = pd.DataFrame(km.cluster_centers_, columns=features.columns)
    centroids.insert(features.shape[1], "num", pd.DataFrame(km.labels_, columns=["n"]).groupby("n").size())
    centroids.insert(features.shape[1], "cluster_id", range(k))
    return centroids, km.labels_

We define a few more variables and mark the centroids on the map with stars

In [None]:
pickup_hub_loc, pickup_cluster = KMeans_clustering(8, df[['pickup_latitude', 'pickup_longitude']])
pickup_hub_loc

In [None]:
plt.scatter(x=aggr_pickups['Long'], y=aggr_pickups['Lat'], color='white', s=0.02, alpha=0.6)
plt.scatter(x=pickup_hub_loc['pickup_longitude'], y=pickup_hub_loc['pickup_latitude'], color='#ff00a0', marker='*', s=pickup_hub_loc['num']/len(df)*8000, alpha=0.6)
plt.show()