VerifyData

VerifyData is a Python package for performing data quality checks on PySpark DataFrames. It provides a comprehensive set of checks to validate the integrity, consistency, and accuracy of data before processing, ensuring that data meets the required standards. The package is ideal for ETL workflows, data pipelines, and any scenario where high-quality data is essential.

Features

Null Value Check: Ensures that specified columns do not contain null values.
Uniqueness Check: Verifies that the values in a specified column are unique.
Range Check: Checks if values in a column fall within a specified range.
Valid Values Check: Confirms that values in a column belong to a list of predefined valid values.
Schema Validation: Validates the presence of required columns and their data types.

Installation

To install DataQualityChecker, clone the repository and install dependencies:

pip install -r requirements.txt

Usage

Here’s how to use DataQualityChecker:

Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataQualityCheckerExample").getOrCreate()

Define DataFrame and Expected Schema: from pyspark.sql.types import StructType, StructField, IntegerType, StringType

data = [(1, "John", 25, "M"), (2, "Jane", None, "F"), (3, "Alice", 35, None)]
schema = StructType([
   StructField("id", IntegerType(), True),
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True),
   StructField("gender", StringType(), True)
 ])
df = spark.createDataFrame(data, schema)
 
expected_schema = {
   "id": IntegerType(),
   "name": StringType(),
   "age": IntegerType(),
   "gender": StringType()
 }

Initialize DataQualityChecker:

from verify_data.verify import VerifyData

dq_checker = VerifyData(df, expected_schema)

Run Checks:

dq_checker.check_null_values()
dq_checker.check_uniqueness("id")
dq_checker.check_value_range("age", 20, 40)
dq_checker.check_valid_values("gender", ["M", "F"])
dq_checker.check_column_presence()
dq_checker.check_column_data_types()
 
results_df = dq_checker.run_checks()
results_df.show(truncate=False)

Example Output

The output will be a DataFrame containing a summary of each check, with details on whether the check passed or failed:

Check	Passed	Details
Null check on column age	False	1 null value found
Uniqueness check on column id	True	All values are unique
Range check for column age	True	All values within range
Valid values check for column gender	False	1 invalid value found
Schema column presence check	True	All necessary columns are present
Data type check for column age	True	Data type matches

Running Tests

Run unit tests with unittest to verify the integrity of the DataQualityChecker:

python -m unittest discover

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
nbs		nbs
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
verify_data		verify_data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VerifyData

Features

Installation

Usage

Example Output

Running Tests

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VerifyData

Features

Installation

Usage

Example Output

Running Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages