# Generate Synthetic Data using Sparkov Data Generation technique

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bacalhau-project/examples/blob/main/workload-onboarding/Sparkov-Data-Generation/index.ipynb)
[![Open In Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/bacalhau-project/examples/HEAD?labpath=workload-onboarding/Sparkov-Data-Generation/index.ipynb)

## Introduction

 A synthetic Dataset is generated by algorithms or simulations
which has similar characteristics of real world data. Collecting real world data especially the data which contains sensitive user data like credit card information is not possible due to security and privacy concerns, If a data scientist needs to train  a model to detect credit fraud
They can use synthetically generated data instead of using the real data without compromising privacy of users

The advantage of using bacalhau is that you can generate terabytes of synthetic data without
Having to install any dependencies or store the data locally

In this example we will generate synthetic credit card transaction data using the Sparkov program. and store the results to IPFS

## Running Locally​


Installing dependencies


In [None]:
%%bash
git clone https://github.com/js-ts/Sparkov_Data_Generation/
pip3 install -r Sparkov_Data_Generation/requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Faker==13.12.0
  Downloading Faker-13.12.0-py3-none-any.whl (1.6 MB)
Installing collected packages: Faker
Successfully installed Faker-13.12.0


Cloning into 'Sparkov_Data_Generation'...


In [None]:
%cd Sparkov_Data_Generation

/content/Sparkov_Data_Generation


Creating a tmp directory to store the outputs

In [None]:
%%bash
mkdir ../outputs

Running the script

Parameters

-n  Number of customers to generate

-o path to store the outputs

 Start date "01-01-2022" 
 
 End date "10-01-2022"

To see the full list of options, use:

In [None]:
%%bash
python datagen.py -h

usage: datagen.py [-h] [-n NB_CUSTOMERS] [-seed [SEED]] [-config [CONFIG]]
                  [-c CUSTOMER_FILE] [-o OUTPUT]
                  start_date end_date

Customer Generator

positional arguments:
  start_date            Transactions start date
  end_date              Transactions start date

optional arguments:
  -h, --help            show this help message and exit
  -n NB_CUSTOMERS, --nb_customers NB_CUSTOMERS
                        Number of customers to generate
  -seed [SEED]          Random generator seed
  -config [CONFIG]      Profile config file (typically
                        profiles/main_config.json")
  -c CUSTOMER_FILE, --customer_file CUSTOMER_FILE
                        Customer file generated with the datagen_customer
                        script
  -o OUTPUT, --output OUTPUT
                        Output Folder path


In [None]:
%%bash
python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"

Num CPUs: 2
profile: adults_50up_male_urban.json, chunk size: 200,                 chunk: 0-199
profile: adults_50up_male_urban.json, chunk size: 200,                 chunk: 200-399
profile: adults_50up_male_urban.json, chunk size: 200,                 chunk: 400-599
profile: adults_50up_male_urban.json, chunk size: 200,                 chunk: 600-799
profile: adults_50up_male_urban.json, chunk size: 200,                 chunk: 800-999
profile: adults_50up_female_urban.json, chunk size: 200,                 chunk: 0-199
profile: adults_50up_female_urban.json, chunk size: 200,                 chunk: 200-399
profile: adults_50up_female_urban.json, chunk size: 200,                 chunk: 400-599
profile: adults_50up_female_urban.json, chunk size: 200,                 chunk: 600-799
profile: adults_50up_female_urban.json, chunk size: 200,                 chunk: 800-999
profile: adults_50up_male_rural.json, chunk size: 200,                 chunk: 0-199
profile: adults_50up_male_rural.json, 

### Building a Docker container (Optional)
Note* you can skip this section entirely and directly go to running on bacalhau

To use Bacalhau, you need to package your code in an appropriate format. The developers have already pushed a container for you to use, but if you want to build your own, you can follow the steps below. You can view a [dedicated container example](../custom-containers/index.md) in the documentation.

### Dockerfile

In this step, you will create a `Dockerfile` to create an image. The `Dockerfile` is a text document that contains the commands used to assemble the image. First, create the `Dockerfile`.

```
FROM python:3.8

RUN apt update && apt install git

RUN git clone https://github.com/js-ts/Sparkov_Data_Generation/

WORKDIR /Sparkov_Data_Generation/

RUN pip3 install -r requirements.txt
```

To Build the docker container run the docker build command

```
docker build -t <hub-user>/<repo-name>:<tag> .
```

Please replace

<hub-user> with your docker hub username, If you don’t have a docker hub account Follow these instructions to create docker account, and use the username of the account you created

<repo-name> This is the name of the container, you can name it anything you want

<tag> This is not required but you can use the latest tag

After you have build the container, the next step is to test it locally and then push it docker hub

Now you can push this repository to the registry designated by its name or tag.

```
 docker push <hub-user>/<repo-name>:<tag>
```


After the repo image has been pushed to docker hub, we can now use the container for running on bacalhau

## Running on Bacalhau

After the repo image has been pushed to docker hub, we can now use the container for running on bacalhau

This command is similar to what we have run locally 

In [None]:
%%bash
curl -sL https://get.bacalhau.org/install.sh | bash

Your system is linux_amd64
No BACALHAU detected. Installing fresh BACALHAU CLI...
Getting the latest BACALHAU CLI...
Installing v0.3.11 BACALHAU CLI...
Downloading https://github.com/filecoin-project/bacalhau/releases/download/v0.3.11/bacalhau_v0.3.11_linux_amd64.tar.gz ...
Downloading sig file https://github.com/filecoin-project/bacalhau/releases/download/v0.3.11/bacalhau_v0.3.11_linux_amd64.tar.gz.signature.sha256 ...
Verified OK
Extracting tarball ...
NOT verifying Bin
bacalhau installed into /usr/local/bin successfully.
Client Version: v0.3.11
Server Version: v0.3.11


In [None]:
%%bash --out job_id
bacalhau docker run \
--id-only \
--wait \
jsacex/sparkov-data-generation \
--  python3 datagen.py -n 1000 -o ../outputs "01-01-2022" "10-01-2022"

In [None]:
%env JOB_ID={job_id}

env: JOB_ID=d986b432-9af6-4463-93d2-362dbccb8379


Running the commands will output a UUID that represents the job that was created. You can check the status of the job with the following command:

In [None]:
%%bash
bacalhau list --id-filter ${JOB_ID}

[92;100m CREATED  [0m[92;100m ID       [0m[92;100m JOB                     [0m[92;100m STATE     [0m[92;100m VERIFIED [0m[92;100m PUBLISHED               [0m
[97;40m 12:03:03 [0m[97;40m d986b432 [0m[97;40m Docker jsacex/sparko... [0m[97;40m Completed [0m[97;40m          [0m[97;40m /ipfs/QmQSfVLAZGoy8K... [0m



Where it says "`Completed `", that means the job is done, and we can get the results.

To find out more information about your job, run the following command:

In [None]:
%%bash
bacalhau describe ${JOB_ID}

If you see that the job has completed and there are no errors, then you can download the results with the following command:

In [None]:
%%bash
rm -rf results && mkdir -p results
bacalhau get $JOB_ID --output-dir results

Fetching results of job 'd986b432-9af6-4463-93d2-362dbccb8379'...
Results for job 'd986b432-9af6-4463-93d2-362dbccb8379' have been written to...
results


2022/11/12 12:05:30 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.


After the download has finished you should 
see the following contents in results directory

In [None]:
%%bash
ls results/combined_results/outputs

adults_2550_female_rural_000-199.csv
adults_2550_female_rural_200-399.csv
adults_2550_female_rural_400-599.csv
adults_2550_female_rural_600-799.csv
adults_2550_female_rural_800-999.csv
adults_2550_female_urban_000-199.csv
adults_2550_female_urban_200-399.csv
adults_2550_female_urban_400-599.csv
adults_2550_female_urban_600-799.csv
adults_2550_female_urban_800-999.csv
adults_2550_male_rural_000-199.csv
adults_2550_male_rural_200-399.csv
adults_2550_male_rural_400-599.csv
adults_2550_male_rural_600-799.csv
adults_2550_male_rural_800-999.csv
adults_2550_male_urban_000-199.csv
adults_2550_male_urban_200-399.csv
adults_2550_male_urban_400-599.csv
adults_2550_male_urban_600-799.csv
adults_2550_male_urban_800-999.csv
adults_50up_female_rural_000-199.csv
adults_50up_female_rural_200-399.csv
adults_50up_female_rural_400-599.csv
adults_50up_female_rural_600-799.csv
adults_50up_female_rural_800-999.csv
adults_50up_female_urban_000-199.csv
adults_50up_female_urban_200-399.csv
adults_50up_female_ur