<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_10/Section_6__Python_Example__Compliance_Checks_for_Data_Projects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 6: Python example - compliance checks for data projects

Ensuring compliance with legal frameworks in data projects is critical for maintaining data privacy, security, and trust. In this section, we provide a Python example demonstrating how to implement a basic compliance check mechanism that can help identify potential compliance issues related to data storage, processing, and handling. This will involve creating a script to audit data for GDPR compliance, focusing on aspects such as data minimization and retention.

1. Setting Up the Environment:

To carry out compliance checks effectively, you will need Python libraries that can handle data manipulation and provide logging capabilities. If these libraries are not already installed, you can install them using pip:

In [None]:
pip install pandas logging

2. Importing Required Libraries:

Import the necessary libraries. We will use pandas for data manipulation and logging for recording the outcome of our compliance checks.

In [None]:
import pandas as pd
import logging
import datetime


# Setup logging configuration

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

3. Loading and Preparing Data:

For this example, let's assume we have a dataset that includes personal data collected from users. We'll create a sample DataFrame to simulate this:

In [None]:
# Create a sample DataFrame
data = pd.DataFrame({ 'user_id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'David'], 'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com'], 'data_collected_date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'], 'last_login_date': ['2023-07-01', '2023-07-02', '2023-06-15', '2023-07-01'] })
# Convert date columns to datetime format
data['data_collected_date'] = pd.to_datetime(data['data_collected_date'])
data['last_login_date'] = pd.to_datetime(data['last_login_date'])

4. Defining Compliance Rules:

We need to define some GDPR compliance rules. For example, GDPR encourages data minimization and requires that personal data should not be retained longer than necessary.

In [None]:
def check_data_minimization(data):
    """
    Check for unnecessary columns that should not be stored.
    """
    necessary_columns = {'user_id', 'name', 'last_login_date'}
    extra_columns = set(data.columns) - necessary_columns

    if extra_columns:
        logging.warning(f'Unnecessary data columns found: {extra_columns}')
    else:
        logging.info("Data minimization check passed.")

def check_data_retention(data, retention_period_years=2):
    """
    Check that no data is older than the retention period.
    """
    cutoff_date = pd.Timestamp(datetime.datetime.now() - datetime.timedelta(days=365 * retention_period_years))
    old_data = data[data['data_collected_date'] < cutoff_date]

    if not old_data.empty:
        logging.warning("Data older than retention period found.")
    else:
        logging.info("Data retention check passed.")


5. Performing Compliance Checks:

Now, run these checks against the dataset:

In [None]:
check_data_minimization(data)
check_data_retention(data)

6. Conclusion:

This Python script provides a basic framework for performing compliance checks in data projects. By automating the evaluation of whether data handling practices align with legal standards like GDPR, organizations can ensure they manage personal data responsibly and maintain compliance. These checks can be expanded based on specific organizational or regulatory requirements, offering a robust approach to compliance verification in data-driven environments.