# World Data League 2021
## Notebook Template

This notebook is one of the mandatory deliverables when you submit your solution (alongside the video pitch). Its structure follows the WDL evaluation criteria and it has dedicated cells where you can add descriptions. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work.

The notebook must:

*   💻 have all the code that you want the jury to evaluate
*   🧱 follow the predefined structure
*   📄 have markdown descriptions where you find necessary
*   👀 be saved with all the output that you want the jury to see
*   🏃‍♂️ be runnable


## Introduction
Describe how you framed the challenge by telling us what problem are you trying to solve and how your solution solves that problem.

## Development

In [1]:
#Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot
from datetime import datetime
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import requests

sns.set_theme(style="whitegrid")
sns.set_color_codes("pastel")

In [3]:
def get_stage(text):
    if 'STAGE 1' in text:
        return 1
    elif 'STAGE 2' in text:
        return 2
    else:
        return 0

def get_parking_tickets_data():
    # O ficheiro é demasiado grande para o github
    df = pd.read_csv('/home/ana/Downloads/parking-tickets-2017-2019_WDL.csv', sep=';', index_col=0,
                    parse_dates=['EntryDate'])
    # Some infractions are the same, but have a final dot in them! Remove that dot in order 
    # to not consider those infractions as distincts
    df['InfractionText'] = df['InfractionText'].str.rstrip('.') 
    
    # There are repeated infractions with different stages (or without a stage)
    df['Infraction_Stage'] = df['InfractionText'].apply(lambda x: get_stage(x))
    df['InfractionText'] = df['InfractionText'].str.rstrip(' - STAGE 1').str.rstrip(' - STAGE 2') 
    return df

df = get_parking_tickets_data()
df.head()

  mask |= (ar1 == a)


Unnamed: 0,Block,Street,EntryDate,Bylaw,Section,Status,InfractionText,Year,HBLOCK,Infraction_Stage
0,1400,Kingsway,2017-08-23,2849,17.1,IS,STOP AT A PLACE WHERE A TRAFFIC SIGN PROHIBITS...,2017,1400 KINGSWAY,0
1,2100,13th Ave E.,2017-08-26,2849,19.1(H),IS,STOP ON EITHER SIDE OF A LANE WHICH ABUTS COMM...,2017,2100 13TH AVE E,0
2,2800,Trinity St.,2017-08-26,2849,17.6(B),VA,PARK ON A STREET WHERE A TRAFFIC SIGN RESTRICT...,2017,2800 TRINITY ST,0
3,200,17th Ave W.,2017-08-27,2849,17.5(B),IS,STOP WITHIN 6 METRES OF THE NEAREST EDGE OF TH...,2017,200 17TH AVE W,0
4,1900,4th Ave W.,2017-08-19,2952,5(4)(a)(ii),IS,PARK IN A METERED SPACE IF THE PARKING METER H...,2017,1900 4TH AVE W,0


In [4]:
def eda_to_df(df):
    header="+" + ("-"*52) + "+"
    form = "+{:^16s}|{:^16s}|{:^10s}|{:^7s}|"
    print(header)
    print(form.format("Column", "Type", "Uniques", "NaN?"))
    print(header)
    for col in df.columns:
        print(form.format(str(col), str(df[col].dtypes), str(len(df[col].unique())), 
                          str(df[col].isnull().values.any()) ))
    print(header)
    
eda_to_df(df)

+----------------------------------------------------+
+     Column     |      Type      | Uniques  | NaN?  |
+----------------------------------------------------+
+     Block      |     int64      |   129    | False |
+     Street     |     object     |   1785   | False |
+   EntryDate    | datetime64[ns] |   1089   | False |
+     Bylaw      |     int64      |    5     | False |
+    Section     |     object     |    98    | False |
+     Status     |     object     |    5     | False |
+ InfractionText |     object     |    92    | False |
+      Year      |     int64      |    3     | False |
+     HBLOCK     |     object     |  15219   | False |
+Infraction_Stage|     int64      |    3     | False |
+----------------------------------------------------+


## Categorical features

In [None]:
def categorical_feature_study(_df, feature, horizontal=False, treshould=0, plot=True):
    df = _df.copy()
    df = df[feature].value_counts().to_frame()
    
    if treshould>0 :
        df = df[ df[feature] >= treshould ]
    
    df = df.reset_index()
    df = df.rename(columns={feature: "Count"})
    df = df.rename(columns={"index": feature})
       
    if plot:
        fig, ax = pyplot.subplots()#figsize=(20,15))
    
        if horizontal :
            sns.barplot(x="Count", y=feature, data=df, ax=ax)
        else:
            sns.barplot(x=feature, y="Count", data=df, ax=ax)
        
        ax.plot()

    return df

In [None]:
categorical_feature_study(df, "InfractionText", horizontal=True, treshould=10000).head()

In [None]:
categorical_feature_study(df, "Street", horizontal=True, treshould=10000).head()

In [None]:
categorical_feature_study(df, "Year", horizontal=False)

In [None]:
categorical_feature_study(df, "Bylaw", horizontal=False)

In [None]:
categorical_feature_study(df, "HBLOCK", horizontal=True, treshould=5000).head()

## Encode categorical features

In [None]:
enc = OrdinalEncoder()
df["InfractionText"] = enc.fit_transform(df[["InfractionText"]]).astype(int)

df_infraction = categorical_feature_study(df, "InfractionText", horizontal=True, treshould=10000, plot=False)
df_infraction.head()

In [None]:
enc = OrdinalEncoder()
df["Street"] = enc.fit_transform(df[["Street"]]).astype(int)

df_streets = categorical_feature_study(df, "Street", horizontal=True, treshould=10000, plot=False)
df_streets.head()

## Number of infractions per day

In [None]:
nInfractionsPerDay = df.groupby(['EntryDate']) \
                        .count() \
                        .rename(columns={'Block':'Count'})[['Count']] \
                        .reset_index()
nInfractionsPerDay

In [None]:
fig, ax = pyplot.subplots(figsize=(25,20))
sns.lineplot(data=nInfractionsPerDay, x="EntryDate", y="Count", ax=ax)

## Number of infractions per day per type of infraction

In [None]:
df.head()

In [None]:
popular_infractions = df_infraction.InfractionText.tolist()

# number of Infractions per Day and per Type
nIDT = df.copy()
nIDT = nIDT[nIDT["InfractionText"].isin(popular_infractions)]
nIDT = nIDT.groupby(['EntryDate', 'InfractionText']) \
                                .count() \
                                .rename(columns={'Block':'Count'})[['Count']]\
                                .reset_index()
nIDT

In [None]:
fig, ax = pyplot.subplots(figsize=(25,20))
sns.lineplot(data=nIDT, x="EntryDate", y="Count", hue="InfractionText", ax=ax)

## Number of Infractions per Day per Type per Street

In [None]:
popular_streets = df_streets.Street.tolist()

# number of Infractions per Day per Type per Street
nIDTS = df.copy()
nIDTS = nIDTS[(nIDTS["Street"].isin(popular_streets)) & (nIDTS["InfractionText"].isin(popular_infractions))]
nIDTS = nIDTS.groupby(['EntryDate', 'InfractionText', 'Street']) \
                                .count() \
                                .rename(columns={'Block':'Count'})[['Count']]\
                                .reset_index()

nIDTS['Street_str'] = nIDTS['Street'].astype(str)
nIDTS['InfractionText_str'] = nIDTS['InfractionText'].astype(str)
nIDTS['Street_&_Infraction'] = nIDTS['Street_str'] + ["___"]*len(nIDTS) + nIDTS['InfractionText_str']
nIDTS

In [None]:
fig, ax = pyplot.subplots(figsize=(25,20))
sns.lineplot(data=nIDTS, x="EntryDate", y="Count", hue="Street_&_Infraction", ax=ax)

In [None]:
nIDTS

In [None]:
nIDTS_filter = nIDTS[nIDTS['Street']==87]
fig, ax = pyplot.subplots(figsize=(25,20))
sns.lineplot(data=nIDTS_filter, x="EntryDate", y="Count", hue="Street_&_Infraction", ax=ax)

In [None]:
nIDTS_filter[['Count']].head(50)

## Juntar dados meteorológicos + feriados + fins de semana + nº pessoas q anda de bicicleta/ transportes públicos...

# Using Oracle Cloud Infratructure to perform forecasting

## ARIMA, SARIMA, ETS

# ??

## Conclusions

### Scalability and Impact
Tell us how applicable and scalable your solution is if you were to implement it in a city. Identify possible limitations and measure the potential social impact of your solution.

### Future Work
Now picture the following scenario: imagine you could have access to any type of data that could help you solve this challenge even better. What would that data be and how would it improve your solution? 🚀