# How to Use

1. Run everything in the **Setup** section. 
    - Make sure to change the working directory to **your** working directory. The code for this is already there.
    - Make sure the Excel document for logging the scores also exists in your working directory, and that the file name is correct.

2. Determine *if the test needs to be run* by having a good understanding of what each test is doing. 
    - Please refer to this document [here](https://086gc.sharepoint.com/:x:/r/sites/PacificSalmonTeam/Shared%20Documents/General/02%20-%20PSSI%20Secretariat%20Teams/04%20-%20Strategic%20Salmon%20Data%20Policy%20and%20Analytics/02%20-%20Data%20Governance/00%20-%20Projects/10%20-%20Data%20Quality/Presentation/DQP%20Demo.xlsx?d=wc15abe6743954df980a05f09fe99a560&csf=1&web=1&e=CJeb6h)

3. Some requirements for the datasets:
    - The data must be on the **first sheet** in the Excel document.
    - The **first row** must be the column names. 
    - The test won't run if the Excel file is open

4. After running all the tests, the Excel document for logging the scores can be uploaded to Sharepoint using the function "Saving the file to sharepoint". 

Note: The Output Reports are used for when a data steward is asking about why their dataset gets a certain score. If the metric is not in Output Reports, then running the test itself will generate an output that can be put into a report.  

# Setup

Please run everything in the set up, and double check the working directory so that the data can be read from that same directory.

All of these functions are used in the process of calculating data quality. 

In [129]:
from IPython import get_ipython

# Clear memory
get_ipython().magic('reset -sf')

  get_ipython().magic('reset -sf')


In [134]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import re
from difflib import SequenceMatcher
from datetime import datetime
import nbformat
import gc
import traceback
import nbimporter
import utils
import sys

In [136]:
gc.collect()

0

Support functions to allow running cells from other notebooks.

In [137]:
def run_selected_cells(notebook_path, cell_indices):       
    # Load the notebook       
    with open(notebook_path) as f:           
        nb = nbformat.read(f, as_version=4)   

    # Get the cells to run       
    selected_cells = [nb.cells[i] for i in cell_indices]   

    # Execute the selected cells       
    for cell in selected_cells:    
        # print(cell)       
        if cell.cell_type == 'code':               
            exec(cell.source, globals())

def run_selected_cells_from_util(util_folder, notebook_name, cell_indices):       
    notebook_path = os.path.join(util_folder, notebook_name)       
    run_selected_cells(notebook_path, cell_indices)   

Make sure to set to the correct working directory

In [138]:
# Change working directory to the same place where you saved the test datasets
# os.chdir('C:/Users/luos/OneDrive - DFO-MPO/Python') #change directory
os.getcwd()  # check where the directory is (and whether the change was successful or not)
GLOBAL_USER = "LuoS"
GLOBAL_DATASET = "11- NuSEDS All_Areas" # the folder that stores all the files related to this data
GLOBAL_DATASET_PATH = "ForPortal_20250318" # the actual folder the data is in
GLOBAL_DATAFILE = "All Areas NuSEDS.csv" # dataset file name

Load the following functions from dq_utils.ipynb:
* read_data(dataset_path) - Function to read either csv or xlsx data
* log_score(test_name, dataset_name, score, selected_columns, threshold) - Function to log the scores into an xlsx file (already created, existing)
* get_dataset_name(dataset_path) - Function to extract dataset name from a path
* Global variables for console output colours

In [139]:
run_selected_cells_from_util('utils', 'dq_utils.ipynb', [2, 4, 6, 8])

# Data Quality Tests

### Consistency

#### Consistency Type 1 (C1)

Calculate consistency score of a dataset

This code is best run on CSV data where the column names are in the first row. It can also accept files that are in xlsx formats but it will only take data from the first sheet if there are more than one sheet in the excel file.

Limitations: It will not check for differences in capitalization of the same word (since all the words will be changed to lower case before the similarity score is calculated)

In [142]:
# Run utils for C1 
run_selected_cells_from_util('utils', 'consistancy_utils.ipynb', [2])    

##### Test the dataset by changing the path

In [None]:
try: 
    datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/{GLOBAL_DATASET}/{GLOBAL_DATASET_PATH}/{GLOBAL_DATAFILE}"
    # Test Consistency Calculations

    processed_df = process_and_calculate_similarity(
        dataset_path=datafilepath,
        column_names=["project_name"],
        threshold=0.91
    )

    # processed_df['Overall Consistency Score'].min()
    print(processed_df)
except MemoryError as e:
    print(f'{RED}Dataset is too large for this test, out of memory!{RESET}')
    print(f'Error: {e}')
except KeyError as e:
    print(f'{RED}Issue with column names, are you sure you entered them correctly?{RESET}')
    print(f'Column name that fails: {e}')
    print(f'List of all detected column names: {list(read_data(datafilepath).columns)}')
except FileNotFoundError as e:
    print(f'{RED}Did not find dataset, make sure you have provided the correct name.{RESET}')
    print(f'Error: {e}')


#### Consistency Type 2 (C2)

Calculate consistency score of datasets with a reference list

The compared columns in question must be identical to the ref list, otherwise they will be penalized more harshly.

In [162]:
# Run utils for C2
run_selected_cells_from_util('utils', 'consistancy_utils.ipynb', [5]) 

##### Test the dataset by changing the path

In [165]:
# Function 1: Get names used for a single column
def get_names_used_for_column(df, column_name):
    unique_observations = pd.unique(df[column_name].dropna().values.ravel())
    return unique_observations


# Function 2: Calculate Cosine Similarity
def calculate_cosine_similarity(text_list, ref_list, Stop_Words):
    count_vectorizer = CountVectorizer(stop_words=Stop_Words)
    ref_vec = count_vectorizer.fit_transform(ref_list).todense()
    ref_vec_array = np.array(ref_vec)
    text_vec = count_vectorizer.transform(text_list).todense()
    text_vec_array = np.array(text_vec)
    cosine_sim = np.round((cosine_similarity(text_vec_array, ref_vec_array)), 2)
    return cosine_sim


# Function 3: Average Consistency Score
def average_consistency_score(cosine_sim_df, threshold=0.91):
    num_rows, num_columns = cosine_sim_df.shape
    total_count = 0  # This will count all values above or equal to the threshold

    for i in range(num_rows):
        if np.max(cosine_sim_df[i]) >= threshold:  # Include all comparisons
            total_count += 1
    total_observations = num_rows  # Total number of observations
    average_consistency_score = total_count / total_observations
    return average_consistency_score


def process_and_calculate_similarity_ref(
    dataset_path,
    column_mapping,
    ref_dataset_path=None,
    threshold=0.91,
    Stop_Words="activity",
):
    
    # Variables that prepare for output reports
    errors = None
    test_fail_comment = None
    all_consistency_scores = None  # Ensure it exists even if errors occur
    
    try:
        # Read the data file
        df = read_data(dataset_path)

        # Initialize ref_df if a ref dataset is provided
        if ref_dataset_path:
            df_ref = read_data(ref_dataset_path)
            ref_data = True  # Flag to indicate we are using a ref dataset
        else:
            ref_data = False  # No ref dataset, compare within the same dataset

        all_consistency_scores = []

        for selected_column, m_selected_column in column_mapping.items():
            if ref_data:
                # Compare to ref dataset
                unique_observations = get_names_used_for_column(df_ref, m_selected_column)
            else:
                # Use own column for comparison
                unique_observations = get_names_used_for_column(df, selected_column)

            cosine_sim_matrix = calculate_cosine_similarity(
                df[selected_column].dropna(), unique_observations, Stop_Words=Stop_Words
            )
            column_consistency_score = average_consistency_score(
                cosine_sim_matrix, threshold
            )
            all_consistency_scores.append(column_consistency_score)

    except Exception:
        # Capture any unexpected error
        errors = traceback.format_exc().strip()
        test_fail_comment = errors # not sure how to make it say the first part of the error like with the memory issue.
    
    # Calculate the average of all consistency scores
    overall_avg_consistency = (
        sum(all_consistency_scores) / len(all_consistency_scores)
        if all_consistency_scores
        else None
    )

    # log the results
    log_score(
        test_name="Consistency (C2)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=column_mapping,
        threshold=threshold,
        score=overall_avg_consistency,
    )

    # output report of results
    output_log_score(
        test_name = "C2", 
        dataset_name = get_dataset_name(dataset_path), 
        score = overall_avg_consistency, 
        selected_columns=column_mapping, 
        new_or_existing = "existing", 
        test_fail_comment = test_fail_comment, 
        errors = errors, 
        dimension = "Consistency", 
        threshold= threshold)
    

    return overall_avg_consistency

In [169]:
column_mapping = {
    "FULL_CU_IN": "Full Conservation Unit Index", }  # the pattern for comparison is 'dataset column' : 'reference column'

datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/{GLOBAL_DATASET}/{GLOBAL_DATASET_PATH}/{GLOBAL_DATAFILE}"
    # Test Consistency Calculations

ref_datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/01 - Pacific Salmon Unit Crosswalk/CrossWalkData_2025-03-14.csv", 

c2_score = process_and_calculate_similarity_ref(
    dataset_path=datafilepath,
    column_mapping=column_mapping,
    ref_dataset_path=ref_datafilepath,
    threshold=1,
    Stop_Words=[""],
)

print(c2_score)



None


### Accuracy

#### Accuracy Type 1 (A1, Mixed Data Types, Symbols in Numerics) 

Test whether there are symbols in numerics

In [47]:
# Run utils for A1
run_selected_cells_from_util('utils', 'accuracy_utils.ipynb', [2]) 

##### Test the dataset by changing the path

In [48]:
try:
    datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/{GLOBAL_DATASET}/{GLOBAL_DATASET_PATH}/{GLOBAL_DATAFILE}"
    score = accuracy_score(
        dataset_path=datafilepath,
        selected_columns=["project_total_cost", "outcome_value"],
    )
    print(score)
except KeyError as e:
    print(f'{RED}Issue with column names, are you sure you entered them correctly?{RESET}')
    print(f'Column name that fails: {e}')
    print(f'List of all detected column names: {list(read_data(datafilepath).columns)}')
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

1.0


#### Accuracy Type 2 (A2 Outliers)

In [92]:
# Run utils for A2
run_selected_cells_from_util('utils', 'accuracy_utils.ipynb', [4]) 

Tests

In [93]:
try:    
    datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/{GLOBAL_DATASET}/{GLOBAL_DATASET_PATH}/{GLOBAL_DATAFILE}"
    outliers = find_outliers_iqr(
        dataset_path=datafilepath,
        selected_columns = ["project_total_cost"],
        groupby_column = ["target_species"], 
        threshold=0.90,
        minimum_score=0.85,
    )
    print(outliers)
except KeyError as e:
    print(f'{RED}Issue with column names, are you sure you entered them correctly?{RESET}')
    print(f'Column name that fails: {e}')
    print(f'List of all detected column names: {list(read_data(datafilepath).columns)}')
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

({'project_total_cost': target_species
Bull Trout, Chinook Salmon, Coho Salmon, Rainbow Trout                                                                                     1.0
Bull Trout, Chinook Salmon, Coho Salmon, Rainbow Trout, Steelhead Trout                                                                    1.0
Bull Trout, Chinook Salmon, Coho Salmon, Sockeye Salmon, Steelhead Trout                                                                   1.0
Bull Trout, Chinook Salmon, Coho Salmon, Steelhead Trout                                                                                   1.0
Bull Trout, Steelhead Trout, Chinook Salmon, Coho Salmon, Sockeye Salmon, White Sturgeon, Mountain Sucker, Rocky Mountain Ridged Mussel    1.0
                                                                                                                                          ... 
Umatilla Dace, White Sturgeon, Columbia Sculpin, Shorthead Sculpin                                     

### Uniqueness

#### Uniqueness Type 1 (U1, which was A3 Duplicates before)

In [70]:
# Run utils for A3
run_selected_cells_from_util('utils', 'accuracy_utils.ipynb', [6]) 

Test

In [71]:
try:
    datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/{GLOBAL_DATASET}/{GLOBAL_DATASET_PATH}/{GLOBAL_DATAFILE}"
    find_duplicates_and_percentage(
        dataset_path=datafilepath
    )

except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

Duplicate Rows:
                     activity_id  \
31   GC-2016-00017_FY18-19_S1_A1   
32   GC-2016-00017_FY18-19_S1_A1   
33   GC-2016-00017_FY18-19_S2_A1   
34   GC-2016-00017_FY18-19_S2_A1   
775  GC-2016-00017_FY18-19_S1_A2   
776  GC-2016-00017_FY18-19_S1_A2   
777  GC-2016-00017_FY18-19_S2_A2   
778  GC-2016-00017_FY18-19_S2_A2   

                                          project_name  \
31   Activity 1: Elaho River Rock Obstruction Resto...   
32   Activity 1: Elaho River Rock Obstruction Resto...   
33   Activity 1: Elaho River Rock Obstruction Resto...   
34   Activity 1: Elaho River Rock Obstruction Resto...   
775  Activity 1: Elaho River Rock Obstruction Resto...   
776  Activity 1: Elaho River Rock Obstruction Resto...   
777  Activity 1: Elaho River Rock Obstruction Resto...   
778  Activity 1: Elaho River Rock Obstruction Resto...   

                                   project_description  \
31   The Elaho River watershed is a tributary of Sq...   
32   The Elaho River

### Completeness (P)

The threshold is for removing a column that meets the threshold of the percentage of blanks.

In [None]:
run_selected_cells_from_util('utils', 'dq_utils.ipynb', [10])

Test

In [97]:
try:
    datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/09 - Pacific Salmon Data Portal/14_Official_Launch/03_Datasets/New Datasets/{GLOBAL_DATASET}/{GLOBAL_DATASET_PATH}/{GLOBAL_DATAFILE}"
    completeness_test(
        dataset_path = datafilepath,
        exclude_columns=["project_partners", "project_total_cost", "site_latitude", "site_longitude", "ecosystem_types", "target_species", "other_benefitting_species", "restoration_activity", "restoration_outcome", "outcome_value"], 
        threshold=0.75,
    )
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

### Timeliness

In [177]:
from datetime import datetime


def calc_timeliness(refresh_date, cycle_day):
    refresh_date = pd.to_datetime(refresh_date)
    unupdate_cycle = np.max([((datetime.now() - refresh_date).days / cycle_day) - 1, 0])

    # unupdate_cycle = np.floor((datetime.now() - refresh_date).days/cycle_day)
    # print((datetime.now() - refresh_date).days/cycle_day)
    return np.max([0, 100 - (unupdate_cycle * (100 / 3))])

In [178]:
calc_timeliness("2022-12-01", cycle_day=365)

66.21004566210046

# Output Reports
Run all the functions above first before running this section

#### Note that output reports can be generated through the data quality tests of
<p>    - Consistency type 1
<p>    - Accuracy type 2
<p>    - Accuracy type 3
<p>    - Completeness
<p>          
<p>  *Completeness test does not require an output report (just find the blanks in the dataset). The rest can be found below

### Consistency Type 2

In [179]:
run_selected_cells_from_util('utils', 'dq_utils.ipynb', [13])

In [180]:
try:
    column_mapping = {
        "STOCK_CU_NAME": "CU_Display",
        "STOCK_CU_INDEX": "FULL_CU_IN",
    }  # the pattern for comparison is 'dataset column' : 'reference column'
    compare_datasets(
        dataset_path="data/test/Salmonid_Enhancement_Program_Releases.xlsx",
        column_mapping=column_mapping,
        ref_dataset_path="data/Pacific Salmon Population Unit Crosswalk_Final_20240513.xlsx",
    )
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

[31mTest failed![0m
Error: [Errno 2] No such file or directory: 'data/test/Salmonid_Enhancement_Program_Releases.xlsx'


### Accuracy Type 1

In [181]:
run_selected_cells_from_util('utils', 'dq_utils.ipynb', [15])

Test

In [182]:
try:
    add_only_numbers_columns(
        dataset_path="data/test/SEP Facilities.xlsx", selected_columns=["LicNo", "FRN"]
    )
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

[31mTest failed![0m
Error: [Errno 2] No such file or directory: 'data/test/SEP Facilities.xlsx'


# Score Log

In [183]:
from datetime import datetime

current_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

current_date

'2024-12-05 13:53:16'

In [184]:
def get_dataset_name(dataset_path):
    # Extract the file name from the path (e.g., 'Dataset_A.csv')
    file_name = os.path.basename(dataset_path)
    # Split the file name to remove the extension (e.g., 'Dataset_A')
    dataset_name = os.path.splitext(file_name)[0]
    return dataset_name

More up to date code for the score log can be found in the "Setup" section, the code here is treated more as a testing space

In [185]:
# Function to log a new row into the DQS_Log.xlsx file
def log_score(test_name, dataset_name, score, threshold=None):
    # Convert score to a percentage
    percentage_score = score * 100

    # Load the Excel file into a DataFrame
    log_file = "DQS_Log_Test.xlsx"

    # Set threshold to "No threshold" if it is not provided
    if threshold is None:
        threshold_value = "no threshold"
    else:
        threshold_value = threshold
    # Try loading the existing Excel file
    try:
        df = read_data(log_file)
    except FileNotFoundError:
        # Create an empty DataFrame if file doesn't exist (shouldn't be the case if you already created it)
        df = pd.DataFrame(
            columns=["Dataset", "Test", "Threshold", "Date_Calculated", "Score"]
        )

    # Prepare the new row as a DataFrame
    new_row = pd.DataFrame(
        {
            "Dataset": [dataset_name],
            "Test": [test_name],
            "Date_Calculated": [datetime.now().strftime("%Y-%m-%d %H:%M:%S")],
            "Threshold": [threshold_value],
            "Score": [percentage_score],
        }
    )

    # Append the new row to the DataFrame
    df = pd.concat([df, new_row], ignore_index=True)

    # Save the updated DataFrame back to the Excel file
    df.to_excel(log_file, index=False)

### Saving the file to sharepoint

In [186]:
import shutil
import os


# Function to copy the log file to another folder
def copy_log_file(destination_folder):
    # Define the name of the file and the current working directory
    log_file = "DQS_Log_Test.xlsx"

    # Get the current working directory (if needed)
    current_directory = os.getcwd()

    # Define the source path (current working directory + file)
    source_path = os.path.join(current_directory, log_file)

    # Define the destination path (destination folder + file)
    destination_path = os.path.join(destination_folder, log_file)

    # Copy the file to the destination folder
    shutil.copy(source_path, destination_path)

    print(f"File copied to {destination_path}")

Run this function when saving the excel document from the working directory to Sharepoint:

In [188]:
try:
    # Specify the destination folder where you want to copy the file
    destination_folder = "C:/Users/EwertM/Documents/Portal/DataQuality"

    copy_log_file(destination_folder)
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

[31mTest failed![0m
Error: [Errno 2] No such file or directory: 'c:\\Users\\onakd\\Documents\\Data Quality Tests\\DataQuality\\DQS_Log_Test.xlsx'
