1. Restart Kernel, Run All, Activate Rerun 
2. Modify C5 (0 Rerun Cells)
3. Modify C2 (3 Rerun Cells)

## Data cleaning benchmark from split_annotations (Generated Data): 

1. For the data source, we can either import the 311-service-requests dataframe of 311 service requests, or we can generate some data provided by the split_annotations benchmark code. Currently, generated data looks like, [1234567, ... 1234567] for however many lines are passed in through the parameter. 
2. We clean the data using the `pandas`. This cleaned version of data will ensure that all entries are unique. We should also be able to replace NULL, broken, or missing values with NaN.

- Code: https://github.com/weld-project/split-annotations/tree/master/python/benchmarks/data_cleaning
- Data: https://github.com/jvns/pandas-cookbook/blob/master/cookbook/data/311-service-requests.csv
   

In [1]:
# Install and import all dependencies.
import numpy as np
import time
import argparse
import sys
import pandas as pd

## NOTES: 
- Most of the code and structure (with modification to use a larger dataframe and dividing it in cells) is directly from the split_annotations benchmarks
- Helper functions are top the top and main execution run() is called at the bottom (not very top-down)
    - This made it hard to edit a function and then scrolling down to re-run main cell again 
- Cells are divided to helper functions currently
- Main Workload is an in-place filter

In [7]:
# Data generation cell. 
def gen_data(size):
    values = ["1234567" for  _ in range(size)] # Modify(3) Change to 1234567890
    return pd.Series(data=values)

In [8]:
## !! This is the naive using pandas only. 
def datacleaning_pandas(requests):
    try: 
        requests = requests.str.slice(0, 5)
        zero_zips = requests == "00000"
        requests = requests.mask(zero_zips, np.nan)
        requests = requests.unique()
        return requests
    except: 
        print("We will replace any broken data with NaN")
        requests.replace(["NULL", "NaN", "", " "], np.nan, inplace=True)

## !! This is the split_annotation. 
def datacleaning_composer(requests, threads):
    # Fix requests with extra digits
    requests = pd.series_str_slice(requests, 0, 5)
    requests.dontsend = True

    # Fix requests with 00000 zipcodes
    zero_zips = pd.equal(requests, "00000")
    zero_zips.dontsend = True
    requests = pd.mask(requests, zero_zips, np.nan)
    requests.dontsend = True
    requests = pd.unique(requests)
    pd.evaluate(workers=threads)
    requests = requests.value
    return requests

In [9]:
def run(size: int, piece_size: int, threads: int, loglevel: str, mode: str):
    assert mode == "composer" or mode == "naive"
    assert threads >= 1
    print("Size:", size)
    print("Piece Size:", piece_size)
    print("Threads:", threads)
    print("Log Level", loglevel)
    print("Mode:", mode)

    print("Generating data...")
    inputs = gen_data(size)
    print("done.")

    start = time.time()
    if mode == "composer":
        result = datacleaning_composer(inputs, threads)
    elif mode == "naive":
        result = datacleaning_pandas(inputs)
    end = time.time()
    print(end - start, "seconds")
    print("after cleaning: ", result)

In [10]:
# Change parameters here to run with generated data. 
size = 5000000 # Size of each array. 
piece_size = 50006 # Size of each piece. 
threads = 1 # Number of threads. 
loglevel = 'none' # Log level. debug|info|warning|error|critical|none 
mode = "naive" # composer | native mode => required
# print('size: ', size, 'piece_size: ', piece_size, "threads: ", threads) # Modify(2) Uncomment this
run(size, piece_size, threads, loglevel, mode)

size:  5000000 piece_size:  50006 threads:  1
Size: 5000000
Piece Size: 50006
Threads: 1
Log Level none
Mode: naive
Generating data...
done.
1.4349901676177979 seconds
after cleaning:  ['12345']
