## Data cleaning benchmark from split_annotations: 

1. For the data source, we can either import the 311-service-requests dataframe of 311 service requests, or we can generate some data provided by the split_annotations benchmark code. Currently, generated data looks like, [1234567, ... 1234567] for however many lines are passed in through the parameter. 
2. We clean the data using the `pandas`. This cleaned version of data will ensure that all entries are unique. We should also be able to replace NULL, broken, or missing values with NaN.

- Code: https://github.com/weld-project/split-annotations/tree/master/python/benchmarks/data_cleaning
- Data: https://github.com/jvns/pandas-cookbook/blob/master/cookbook/data/311-service-requests.csv
   

In [8]:
# Install and import all dependencies.
!pip install pandas

import numpy as np
import time
import argparse
import sys
import pandas as pd



## NOTES: 
- Most of the code and structure (with modification to use a larger dataframe and dividing it in cells) is directly from the split_annotations benchmarks
- Helper functions are top the top and main execution run() is called at the bottom (not very top-down)
    - This made it hard to edit a function and then scrolling down to re-run main cell again 
- Cells are divided to helper functions currently
- Main Workload is an inplace filter

In [12]:
# Data generation cell. 
def gen_data(size):
    values = ["1234567" for  _ in range(size)]
    return pd.Series(data=values)

In [13]:
## !! This is the naive using pandas only. 
def datacleaning_pandas(requests):
    try: 
        requests = requests.str.slice(0, 5)
        zero_zips = requests == "00000"
        requests = requests.mask(zero_zips, np.nan)
        requests = requests.unique()
        return requests
    except: 
        print("We will replace any broken data with NaN")
        requests.replace(["NULL", "NaN", "", " "], np.nan, inplace=True)

## !! This is the split_annotation. 
def datacleaning_composer(requests, threads):
    # Fix requests with extra digits
    requests = pd.series_str_slice(requests, 0, 5)
    requests.dontsend = True

    # Fix requests with 00000 zipcodes
    zero_zips = pd.equal(requests, "00000")
    zero_zips.dontsend = True
    requests = pd.mask(requests, zero_zips, np.nan)
    requests.dontsend = True
    requests = pd.unique(requests)
    pd.evaluate(workers=threads)
    requests = requests.value
    return requests

In [14]:
def run(size: int, piece_size: int, threads: int, loglevel: str, mode: str):
    assert mode == "composer" or mode == "naive"
    assert threads >= 1
    print("Size:", size)
    print("Piece Size:", piece_size)
    print("Threads:", threads)
    print("Log Level", loglevel)
    print("Mode:", mode)

    print("Generating data...")
    inputs = gen_data(size)
    print("done.")

    start = time.time()
    if mode == "composer":
        result = datacleaning_composer(inputs, threads)
    elif mode == "naive":
        result = datacleaning_pandas(inputs)
    end = time.time()
    print(end - start, "seconds")
    print("after cleaning: ", result)

def run_import(path: str):
    print("We are using this dataset: " + path)
    try: 
        inputs = pd.read_csv(path, dtype={8: str}) 
    except e: 
        print("Cannot parse " + path) 
        return 
    print("We have loaded our data: ")
    
    start = time.time()
    if mode == "composer":
        result = datacleaning_composer(inputs, threads)
    elif mode == "naive":
        result = datacleaning_pandas(inputs)
    end = time.time()
    print(end - start, "seconds")
    print("after cleaning: ", inputs.head())

In [15]:
# Change parameters here to run with generated data. 
size = 5000000 # Size of each array. 
piece_size = 50006 # Size of each piece. 
threads = 1 # Number of threads. 
loglevel = 'none' # Log level. debug|info|warning|error|critical|none 
mode = "naive" # composer | native mode => required
run(size, piece_size, threads, loglevel, mode)

# Change file name here to run with imported data. 
path = 'data/311-service-requests.csv' 
run_import(path)

Size: 5000000
Piece Size: 50006
Threads: 1
Log Level none
Mode: naive
Generating data...
done.
1.5939052104949951 seconds
after cleaning:  ['12345']
We are using this dataset: data/311-service-requests.csv
We have loaded our data: 
We will replace any broken data with NaN
0.5974400043487549 seconds
after cleaning:     Unique Key            Created Date             Closed Date Agency  \
0    26589651  10/31/2013 02:08:41 AM                     NaN   NYPD   
1    26593698  10/31/2013 02:01:04 AM                     NaN   NYPD   
2    26594139  10/31/2013 02:00:24 AM  10/31/2013 02:40:32 AM   NYPD   
3    26595721  10/31/2013 01:56:23 AM  10/31/2013 02:21:48 AM   NYPD   
4    26590930  10/31/2013 01:53:44 AM                     NaN  DOHMH   

                               Agency Name           Complaint Type  \
0          New York City Police Department  Noise - Street/Sidewalk   
1          New York City Police Department          Illegal Parking   
2          New York City Police Depar

## NOTES
- This is the same workload but a different way to organize the cells. Here, I try to mimic a notebook's top-down notion.

In [9]:
path = 'data/311-service-requests.csv' # We can change paths here.

In [10]:
# Load the datasets. 
print("We are using this dataset: " + path)
try: 
    inputs = pd.read_csv(path, dtype={8: str}) 
except e: 
    print("Cannot parse " + path) 
print("We have loaded our data: ", inputs.head())

We are using this dataset: data/311-service-requests.csv
We have loaded our data:     Unique Key            Created Date             Closed Date Agency  \
0    26589651  10/31/2013 02:08:41 AM                     NaN   NYPD   
1    26593698  10/31/2013 02:01:04 AM                     NaN   NYPD   
2    26594139  10/31/2013 02:00:24 AM  10/31/2013 02:40:32 AM   NYPD   
3    26595721  10/31/2013 01:56:23 AM  10/31/2013 02:21:48 AM   NYPD   
4    26590930  10/31/2013 01:53:44 AM                     NaN  DOHMH   

                               Agency Name           Complaint Type  \
0          New York City Police Department  Noise - Street/Sidewalk   
1          New York City Police Department          Illegal Parking   
2          New York City Police Department       Noise - Commercial   
3          New York City Police Department          Noise - Vehicle   
4  Department of Health and Mental Hygiene                   Rodent   

                     Descriptor        Location Type Inci

In [16]:
# Clean data. 
start = time.time()
inputs.replace(["NULL", "NaN", "", " "], np.nan, inplace=True)
end = time.time()
print(end - start, "seconds")
print("after cleaning: ", inputs.head())

0.6360399723052979 seconds
after cleaning:     Unique Key            Created Date             Closed Date Agency  \
0    26589651  10/31/2013 02:08:41 AM                     NaN   NYPD   
1    26593698  10/31/2013 02:01:04 AM                     NaN   NYPD   
2    26594139  10/31/2013 02:00:24 AM  10/31/2013 02:40:32 AM   NYPD   
3    26595721  10/31/2013 01:56:23 AM  10/31/2013 02:21:48 AM   NYPD   
4    26590930  10/31/2013 01:53:44 AM                     NaN  DOHMH   

                               Agency Name           Complaint Type  \
0          New York City Police Department  Noise - Street/Sidewalk   
1          New York City Police Department          Illegal Parking   
2          New York City Police Department       Noise - Commercial   
3          New York City Police Department          Noise - Vehicle   
4  Department of Health and Mental Hygiene                   Rodent   

                     Descriptor        Location Type Incident Zip  \
0                  Loud Tal