Readme

damodhar918 · Oct 12, 2023 · 971480f · 971480f
1 parent f975691
commit 971480f
Show file tree

Hide file tree

Showing 6 changed files with 189 additions and 165 deletions.
diff --git a/.github/workflows/integration.yml b/.github/workflows/integration.yml
diff --git a/.github/workflows/unittest.yml b/.github/workflows/unittest.yml
@@ -26,7 +26,8 @@ jobs:
           python -m pip install invoke .[test]
           pip install flake8 pytest colorama pandas pyarrow exrex pytest-cov
     - name: Run unit tests
-      run: invoke unit
+      run: |
+        pytest --cov=sdgp --cov-report=xml tests/
     - if: matrix.os == 'ubuntu-latest' && matrix.python-version == 3.8
       name: Upload codecov report
       uses: codecov/codecov-action@v2
diff --git a/README.md b/README.md
@@ -2,40 +2,71 @@
 
 [![Python application](https://github.com/damodhar918/sdgp/actions/workflows/python-app.yml/badge.svg)](https://github.com/damodhar918/sdgp/actions/workflows/python-app.yml) [![codecov](https://codecov.io/github/damodhar918/sdgp/graph/badge.svg?token=MHZTS92Y4I)](https://codecov.io/github/damodhar918/sdgp)
 
-For questions on this package contact the package Developer Damodhar Jangam at damodhar918@outlook.com
+For questions on this package contact the package Developer Damodhar Jangam at <damodhar918@outlook.com>
+
 ## Overview
+
 This project [Synthetic data generator plus]() is a python script that generates mock data based on given configurations. It can also edit and scale existing data to create high volume data. It is useful for testing and prototyping purposes.
+
 ## Features
+
 - Generate mock data for different types of configuration items
 - Edit the mock data and generate mock data for different types of configuration items
 - Configuration rules include generating unique indices, fixed or random dates/times, categorical values, float values within a range, integer values within a range, or constant values.
 - Generate high volume data
 - Save a DataFrame in CSV and Parquet file formats
+
 ## Package Installation
+
 ### Install on a Local Machine (optional)
+
 Go through the following sequence:
+
 - Clone repo
 - Create a virtual environment and install the package
+
 ```bash
 PS > python -m venv .venv
 PS > .\.venv\Scripts\activate
 PS > pip install -r requirements.txt
+PS > python setup.py install
+# you can use package here by calling sdgp
+PS > sdgp -h
+# try to go through the usage section/ help section
+# then you can use package here by calling sdgp
+PS > sdgp -c m 1000000 csv test test_conf.csv
+PS > sdgp -c e 1000000 parquet test test_conf.csv
 PS > deactivate # when you need exit
 ```
+
 ### Install on a edge node (optional)
+
 Go through the following sequence:
+
 - Clone repo
 - Create a virtual environment and install the package
+
 ```bash
 $ python3 -m venv .venv
 $ source .venv/bin/activate
 $ pip install -r requirements.txt
+$ python setup.py install
+# you can use package here by calling sdgp
+$ sdgp -h
+# try to go through the usage section/ help section
+# then you can use package here by calling sdgp
+$ sdgp -c m 1000000 csv test test_conf.csv
+$ sdgp -c e 1000000 parquet test test_conf.csv
 $ deactivate # when you need exit
 ```
+
 At that point you're good to go and the package `Synthetic data generater plus` and its module will be
 available for use in your virtual environment.
+
 ## Usage
+
 To run the script, you need to provide some arguments:
+
 - `-c` or `--choice`: The type of function to select. `m` for mock data, `e` for edit mock data, `g` for generate high volume data.
 - `volume`: The size. An integer value that specifies how many rows to generate mock data. Recommended minimum value is more than volume size or more than 1000.
 - `format`: The type of format to save the mock data. `csv` for CSV format, `parquet` for Parquet format.
@@ -52,6 +83,8 @@ Example configuration file:
 | incometime2   | dateRange          | 2021-10-10 \| 2022-10-26 \|%Y-%m-%d %H:%M:%S            |
 | outcometime3  | dependentDateRange | incometime2\|1D\|3W\|%Y-%m-%d %H:%M:%S                  |
 | model1        | category           | Customers\|Lending\|Web_Lending                         |
+| model         | category           | Customers\|Lending\|Web_Lending \|                      |
+| gender1       | category           | 0\|1\|~0.4\|0.5\|0.1                                    |
 | probability1  | floatRange         | 0.001\|1\|3                                             |
 | float1        | floatRange         | 0.001\|0.3\|5                                           |
 | number1       | intRange           | 10\|25                                                  |
@@ -64,13 +97,15 @@ Example configuration file:
 
 ```bash
 name,type,values
-id1,uniqueIndex,800000000
+id1,uniqueIndex,800000000000000000000000000000
 date1,date,2022-10-26|%Y-%m-%d
 time1,time,00:00:00|23:59:59|%H:%M:%S
 dateRange1,dateRange,2021-10-10 | 2022-10-26|%Y-%m-%d
 incometime2,dateRange,2021-10-10 | 2022-10-26|%Y-%m-%d %H:%M:%S
 outcometime3,dependentDateRange,incometime2|1D|3W|%Y-%m-%d %H:%M:%S
-model1,category,Customers|Lending|Web_Lending||
+model1,category,Customers|Lending|Web_Lending
+model,category,Customers|Lending|Web_Lending|
+gender1,category,0|1|~0.4|0.5|0.1
 probability1,floatRange,0.001|1|3
 float1,floatRange,0.001|0.3|5
 number1,intRange,10|25
@@ -81,17 +116,22 @@ zip_code,regexPattern,([4-9]{5})
 email_address,regexPattern,"([a-zA-Z0-9]{1,10})\@[a-z]{1,5}\.(com|net|org|in)"
 compositeKey1,composite,dateRange1|model1|number1|phone_number|zip_code
 ```
+
 Explanation of above file:
-- `uniqueIndex`: This indicates that the `id1` column should contain unique and sequential values, starting from `800000000`.
+
+- `uniqueIndex`: This indicates that the `id1` column should contain unique and sequential values,here it's starting from `800000000000000000000000000000`.
 - `date`: This indicates that the `date1` column should contain a fixed date value (`2022-10-26`) for all rows. `%Y-%m-%d` format is used.
-- `time`: This indicates that the `time1` column should contain random time values between `00:00:00` and `23:59:59`.
-- `dateRange`: This indicates that the `dateRange1` and `incometime2` columns should contain random date values within the range from `2021-10-10` to `2022-10-26`. The format of the dates in `incometime2` also includes`%Y-%m-%d %H:%M:%S`. format reference given below.
+- `time`: This indicates that the `time1` column should contain random time values between `00:00:00` and `23:59:59` here you can pass reqired format like `%H:%M:%S`.
+- `dateRange`: This indicates that the `dateRange1` and `incometime2` columns should contain random date values within the range from `2021-10-10` to `2022-10-26`. The format of the dates in `incometime2` also includes `%Y-%m-%d %H:%M:%S`. For other formats reference given below.
 - `dependentDateRange`: This indicates that the `outcometime3` column should contain random duration values within the range from `1D` to `3W` in addition to the `incometime2`.Here `1D` means 1 day and `3W` means 3 weeks. Other compatable inputs are `10S` means 10 seconds, `5m` means 5 minutes, `2h` means 2 hours, `3d` means 3 days, `4W` means 4 weeks. The format of the dates in `outcometime3` also includes`%Y-%m-%d %H:%M:%S`. format reference given below.
-- `category`: This indicates that the `model` column should contain random categorical values chosen from the options "Customers", "Lending", and "Web_Lending".
+- `category`: This indicates that the `model1` column should contain random categorical values chosen from the options "Customers", "Lending", and "Web_Lending".
+**Note 1**: If you want to add empty value in the column then add `|` at the end of the values as in `model`.
+**Note 2**: : If you want categorical values with probilities then add `~` at the end of the values as in `gender1` input `0|1|~0.4|0.5|0.1`. here ~ is seperater between categorical values and probilities ["0", "1",""] ~ ["0.4", "0.5", "0.1"].
 - `floatRange`: This indicates that the `probability1` and `float` columns should contain random float values within a given range. The range for `probability1` is from `0.001` to `1`, with a precision of 3 decimal places. The range for `float` is from `0.001` to `0.3`, with a precision of 5 decimal places.
 - `intRange`: This indicates that the `number1` column should contain random integer values within the range from 10 to 25.
 - `constant`: This indicates that the `test1` column should contain a constant value (`Done`) for all rows.
 - `regexPattern`: This indicates that the `name1` column should contain a fixed pattren range value (`([a-z]{3,10})\, ([a-z]{3,10})`) for all records. The`phone_number` column should contain a fixed length phone number value (`(\+[4-9]{2,3})\-([1-9]{5})\-([1-9]{5})`) for all records. The `zip_code` column should contain a fixed length zip code value (`([4-9]{5})`) for all records. `email_address` column should contain a fixed length email address value (`([a-zA-Z0-9]{1,10})\@[a-z]{1,5}\.(com|net|org|in)`) for all records. For more regex pattren check [here](https://docs.python.org/3/howto/regex.html#simple-patterns) and play around with it.
+**Note:** regexPattern takes long time to generate data.
 - `composite`: This indicates that the `compositeKey1` column should contain sha256 hashed value from these combinations: `dateRange1|model1|number1|phone_number|zip_code`
 Each row in this CSV file defines a rule for generating or handling data in a specific column of another dataset. The rules include generating unique indices, fixed or random dates/times, categorical values, float values within a range, integer values within a range, or constant values.
 datetime formats you can use in the script:
@@ -109,41 +149,54 @@ datetime formats you can use in the script:
 - `%p`: Locale’s equivalent of either AM or PM. Example: AM
 - `%M`: Minute as a zero-padded decimal number. Example: 06
 - `%S`: Second as a zero-padded decimal number. Example: 05
+
 To run the script, use the following command:
+
+```csv
+sdgp -c <choice> <volume> <format> <csv_file> <conf_csv_file>
+
+positional arguments:
+  volume                The size. An integer value that specifies how many rows to generate mock data. Recommended
+                        minimum value is more than volume size or more than 1000.
+  {csv,parquet}         The type of format to save the mock data. csv for CSV format, parquet for Parquet format.
+  csv_file              The CSV file name. A string value that specifies the name of the CSV file to read or write.
+  conf_csv_file         The configuration CSV file name. A string value that specifies the name of the configuration
+                        CSV file to read. This argument is required if mode is e or g.
+options:
+  -h, --help            show this help message and exit
+  -c {m,e,g}, --choice {m,e,g}
+                        The type of function to select. m for mock data, e for edit mock data, g for generate high
+                        volume data.
 ```
-# python main.py -c <choice> <volume> <format> <csv_file> <conf_csv_file>
-#
-# positional arguments:
-#   volume                The size. An integer value that specifies how many rows to generate mock data. Recommended
-#                         minimum value is more than volume size or more than 1000.
-#   {csv,parquet}         The type of format to save the mock data. csv for CSV format, parquet for Parquet format.
-#   csv_file              The CSV file name. A string value that specifies the name of the CSV file to read or write.
-#   conf_csv_file         The configuration CSV file name. A string value that specifies the name of the configuration
-#                         CSV file to read. This argument is required if mode is e or g.
-# options:
-#   -h, --help            show this help message and exit
-#   -c {m,e,g}, --choice {m,e,g}
-#                         The type of function to select. m for mock data, e for edit mock data, g for generate high
-#                         volume data.
-```
+
 For example:
+
 ```bash
-python main.py -c m 50000 csv mock_table conf.csv # Generate 50000 rows of mock data and save as mock_table_50000.csv
-python main.py -c e 100000 parquet edit_table.csv conf.csv # Along with given data can edit with conf.csv, generate 100000 recrds and save as edit_table_100000.parquet\n
-python main.py -c g 1000000 csv scale.csv # Generate 1000000 rows of mock data by scaling existing data and save as scale_1000000.csv
+sdgp -c m 50000 csv mock_table conf.csv # Generate 50000 rows of mock data and save as mock_table_50000.csv
+sdgp -c e 100000 parquet edit_table.csv conf.csv # Along with given data can edit with conf.csv, generate 100000 recrds and save as edit_table_100000.parquet\n
+sdgp -c g 1000000 csv scale.csv # Generate 1000000 rows of mock data by scaling existing data and save as scale_1000000.csv
 ```
-Sample output for `python .\main.py -c m 1000000 csv test .\test_conf.csv `:
-![image.png](./confluence/145434.png)
+
+Sample output for `sdgp -c m 1000000 csv test .\test_conf.csv`:
+
+![image.png](./docs/235759.jpg)
+
 ```bash
-id1,date1,model1,probability1,float1,number1,test1,time1,dateRange1,incometime2,outcometime3,name1,phone_number,zip_code,email_address,compositeKey1
-800000004,2022-10-26,,0.792,0.14948,12,Done,11:34:20,2022-04-07,2022-06-28 21:33:32,2022-07-03 09:41:10,"gkxtawx, pfuf",+65-67845-69497,65957,ji8et6@u.net,c05b0a767331f3176ec3cdf3dee852759a858e30
-800000001,2022-10-26,Lending,0.442,0.11305,24,Done,06:01:02,2022-06-18,2022-07-04 01:51:18,2022-07-20 04:31:45,"ttjwjy, zesc",+48-89997-49658,78754,YYChHbaJD@oid.com,ac8759aac34e718dad0ef46c62edb5bff07cb003
-800000009,2022-10-26,Lending,0.267,0.17349,17,Done,08:43:08,2022-01-31,2021-12-11 02:33:20,2021-12-19 22:29:15,"vlflyewer, ilj",+564-44495-77467,98785,3mjDBVliLT@ydbpg.com,c068d7d1a8d5e1c6527f84246d8b9dc911b52884
-800000003,2022-10-26,,0.565,0.20937,11,Done,02:52:08,2022-04-27,2022-10-25 22:21:19,2022-11-15 16:22:14,"orkilkzh, xozrfwwrtq",+88-95566-65789,68677,Ulq@u.org,d51540c711301c6badc2aad051bb048fd175201b
+id1,date1,model1,model,gender1,probability1,float1,number1,test1,time1,dateRange1,incometime2,outcometime3,name1,phone_number,zip_code,email_address,compositeKey1
+800000000000000000000000022022,2022-10-26,Lending,,1,0.526,0.01349,18,Done,09:17:27,2021-12-21,2021-10-11 19:45:49,2021-10-29 15:22:51,"xfdqsmj, pfnzgqd",+69-68479-47968,45568,S@euly.in,5291fed2490313181144993e6f9d0e478a774cbe
+800000000000000000000000069854,2022-10-26,Lending,Web_Lending,1,0.466,0.13702,23,Done,13:06:09,2022-06-16,2022-05-06 05:33:04,2022-05-15 23:17:00,"spjcnaumo, fmxd",+769-58564-59786,74648,6HzAItG@rcfsb.com,553d8e0a445569f8c329c4f2cad5bd0a217e2cf8
+800000000000000000000000052417,2022-10-26,Lending,Customers,1,0.474,0.07092,15,Done,13:27:22,2022-05-12,2022-06-09 00:01:55,2022-06-20 04:30:03,"wnzebd, xuhqai",+99-88586-45856,49977,c0@u.in,65d22a12c4c95d2d14615f9d5b4c6582cd60c45f
+800000000000000000000000068698,2022-10-26,Customers,Web_Lending,0,0.012,0.12498,23,Done,00:19:00,2022-09-15,2022-04-25 08:29:39,2022-05-14 14:29:06,"kccxqzujf, aqitzbuj",+47-86496-46488,75598,1h4xIF@dx.in,f2c1cd5e87cc1a5ed0cf800bcb7228c3c4f621cb
 ```
-![image.png](./confluence/232516.png)
+
+![Ouput image](./docs/000610.jpg)
+[Samlpe output](./tests/test_assect/test_data.csv)
+
 ## License
+
 This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
+
 ## Acknowledgments
-If you have any questions, feedback, or suggestions, please feel free to contact me at damodhar918@outlook.com. You can also open an issue or submit a pull request on GitHub if you want to contribute to this project.
+
+If you have any questions, feedback, or suggestions, please feel free to contact me at <damodhar918@outlook.com>. You can also open an issue or submit a pull request on GitHub if you want to contribute to this project.
 I hope you find this project useful and interesting. Thank you for reading! 😊
diff --git a/docs/000610.jpg b/docs/000610.jpg
diff --git a/docs/235759.jpg b/docs/235759.jpg