# How to Use

1. Run everything in the **Setup** section. 
    - Make sure to change the working directory to **your** working directory. The code for this is already there.
    - Make sure the Excel document for logging the scores also exists in your working directory, and that the file name is correct.

2. Determine *if the test needs to be run* by having a good understanding of what each test is doing. 
    - Please refer to this document [here](https://086gc.sharepoint.com/:x:/r/sites/PacificSalmonTeam/Shared%20Documents/General/02%20-%20PSSI%20Secretariat%20Teams/04%20-%20Strategic%20Salmon%20Data%20Policy%20and%20Analytics/02%20-%20Data%20Governance/00%20-%20Projects/10%20-%20Data%20Quality/Presentation/DQP%20Demo.xlsx?d=wc15abe6743954df980a05f09fe99a560&csf=1&web=1&e=CJeb6h)

3. Some requirements for the datasets:
    - The data must be on the **first sheet** in the Excel document.
    - The **first row** must be the column names. 
    - The test won't run if the Excel file is open

4. After running all the tests, the Excel document for logging the scores can be uploaded to Sharepoint using the function "Saving the file to sharepoint". 

Note: The Output Reports are used for when a data steward is asking about why their dataset gets a certain score. If the metric is not in Output Reports, then running the test itself will generate an output that can be put into a report.  

# Setup

Please run everything in the set up, and double check the working directory so that the data can be read from that same directory.

All of these functions are used in the process of calculating data quality. 

In [1]:
from IPython import get_ipython

# Clear memory
get_ipython().magic('reset -sf')

  get_ipython().magic('reset -sf')


In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import re
from difflib import SequenceMatcher
from datetime import datetime
import nbformat
import gc

# Import dimentions
from dimensions.consistancy import Consistency
from dimensions.accuracy import Accuracy
from dimensions.completeness import Completeness

In [3]:
gc.collect()

21

Make sure to set to the correct working directory

In [4]:
# Change working directory to the same place where you saved the test datasets
# os.chdir('C:/Users/luos/OneDrive - DFO-MPO/Python') #change directory
os.getcwd()  # check where the directory is (and whether the change was successful or not)
GLOBAL_USER = "OnakD"
GLOBAL_DATASET = "CU Sampling Sites"
GLOBAL_DATAFILE = "Conservation_Unit_Data_20220902.csv"
DATA_FILE_PATH = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{GLOBAL_DATASET}/{GLOBAL_DATAFILE}"

# Data Quality Tests

### Consistency

#### Consistency Type 1 (C1)

Calculate consistency score of a dataset

This code is best run on CSV data where the column names are in the first row. It can also accept files that are in xlsx formats but it will only take data from the first sheet if there are more than one sheet in the excel file.

Limitations: It will not check for differences in capitalization of the same word (since all the words will be changed to lower case before the similarity score is calculated)

#### Consistency Type 2 (C2)

Calculate consistency score of datasets with a reference list

The compared columns in question must be identical to the ref list, otherwise they will be penalized more harshly.

##### Test the dataset by changing the path

In [None]:
column_mapping = {
    "STOCK_CU_NAME": "CU_Display",
    "STOCK_CU_INDEX": "FULL_CU_IN",
}  # the pattern for comparison is 'dataset column' : 'reference column'

# Test Consistency Calculations
# Using default thresholds and stop words for both metrics
consitancy_tests = Consistency(
    dataset_path=DATA_FILE_PATH,
    c1_column_names=["DESCR"],
    c2_column_mapping=column_mapping
    # ref_dataset_path="data/Pacific Salmon Population Unit Crosswalk_Final_20240513.xlsx"
)

print(consitancy_tests.run_metrics())


  df = pd.read_csv(dataset_path, encoding="utf-8-sig")


[31mDataset is too large for this test, out of memory![0m


  df = pd.read_excel(dataset_path)
  df = pd.read_csv(dataset_path, encoding="utf-8-sig")


[31mIssue with column names, are you sure you entered them correctly?[0m
Column name that fails: 'STOCK_CU_NAME'


  df = pd.read_csv(dataset_path, encoding="utf-8-sig")


List of all detected column names: ['ACT_ID', 'DESCR', 'ANALYSIS_YR', 'STREAM_ID', 'AREA', 'SPECIES', 'SEN_STATUS', 'ESTIMATE_CLASSIFICATION', 'ESTIMATE_METHOD', 'WATERSHED_CDE', 'ESTIMATE_STAGE', 'SPL_ID', 'SEN_PRESENCE_ADULT', 'SEN_PRESENCE_JACK', 'NATURAL_ADULT_SPAWNERS', 'NATURAL_JACK_SPAWNERS', 'NATURAL_SPAWNERS_TOTAL', 'ADULT_BROODSTOCK_REMOVALS', 'JACK_BROODSTOCK_REMOVALS', 'TOTAL_BROODSTOCK_REMOVALS', 'OTHER_REMOVALS', 'TOTAL_RETURN_TO_RIVER', 'UNSPECIFIED_RETURN', 'NO_INSPECTIONS_USED', 'POPULATION', 'MAX_ESTIMATE', 'RUN_TYPE', 'SEN_NUSEDS1_ENUM_METHOD1', 'SEN_NUSEDS1_ENUM_METHOD2', 'SEN_NUSEDS1_ENUM_METHOD3', 'SEN_NUSEDS1_ENUM_METHOD4', 'SEN_NUSEDS1_ENUM_METHOD5', 'SEN_NUSEDS1_ENUM_METHOD6', 'EFFECTIVE_FEMALES', 'WEIGHTED_PCT_SPAWN', 'OTHER_ADULT_REMOVALS', 'OTHER_JACK_REMOVALS', 'TOT_ADULT_RET_RIVER', 'TOT_JACK_RET_RIVER', 'JUV_PRES_TYP', 'GEOGRAPHICAL_EXTNT_OF_ESTIMATE', 'POP_ID', 'CU_NAME', 'CU_INDEX', 'CU_TYPE', 'SPECIES_QUALIFIED', 'SBJ_ID']
[]


  df = pd.read_excel(dataset_path)


### Accuracy

#### Accuracy Type 1 (A1, Mixed Data Types, Symbols in Numerics) 

Test whether there are symbols in numerics

#### Accuracy Type 2 (A2 Outliers)

Some description for this test

#### Accuracy Type 3 (A3 Duplicates)

Some description for this test

##### Test the dataset by changing the path

In [5]:
# Test Accuracy Calculations
# Using default threshold, group by, and min score for A2 metric 
accuracy_tests = Accuracy(
    dataset_path=DATA_FILE_PATH,
    # selected_columns=[" Egg Target ", " Release/ Transfer Target ", " Coded Wire Tag Target ", " Fin Clip Target ", " Thermal Mark Target ", " Parentage-based Tag Target ", " PIT Tag Target "]
    selected_columns=["ACT_ID","ANALYSIS_YR","STREAM_ID","SPL_ID","NATURAL_ADULT_SPAWNERS","NATURAL_JACK_SPAWNERS","NATURAL_SPAWNERS_TOTAL"]
)

print(accuracy_tests.run_metrics())

  df = pd.read_csv(dataset_path, encoding="utf-8-sig")
  df = pd.read_csv(dataset_path, encoding="utf-8-sig")
  df = pd.read_csv(dataset_path, encoding="utf-8-sig")


Duplicate Rows:
Empty DataFrame
Columns: [ACT_ID, DESCR, ANALYSIS_YR, STREAM_ID, AREA, SPECIES, SEN_STATUS, ESTIMATE_CLASSIFICATION, ESTIMATE_METHOD, WATERSHED_CDE, ESTIMATE_STAGE, SPL_ID, SEN_PRESENCE_ADULT, SEN_PRESENCE_JACK, NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS, TOTAL_RETURN_TO_RIVER, UNSPECIFIED_RETURN, NO_INSPECTIONS_USED, POPULATION, MAX_ESTIMATE, RUN_TYPE, SEN_NUSEDS1_ENUM_METHOD1, SEN_NUSEDS1_ENUM_METHOD2, SEN_NUSEDS1_ENUM_METHOD3, SEN_NUSEDS1_ENUM_METHOD4, SEN_NUSEDS1_ENUM_METHOD5, SEN_NUSEDS1_ENUM_METHOD6, EFFECTIVE_FEMALES, WEIGHTED_PCT_SPAWN, OTHER_ADULT_REMOVALS, OTHER_JACK_REMOVALS, TOT_ADULT_RET_RIVER, TOT_JACK_RET_RIVER, JUV_PRES_TYP, GEOGRAPHICAL_EXTNT_OF_ESTIMATE, POP_ID, CU_NAME, CU_INDEX, CU_TYPE, SPECIES_QUALIFIED, SBJ_ID]
Index: []

[0 rows x 47 columns]

Duplication Score: 100.0%
[1.0, {'ACT_ID': 0.966184040873678, 'ANALYSIS_YR': 1.0, '

### Completeness (P)

The threshold is for removing a column that meets the threshold of the percentage of blanks.

Test

In [5]:
completeness_tests = Completeness(
    dataset_path=DATA_FILE_PATH
)

print(completeness_tests.run_metrics())

  df = pd.read_csv(dataset_path, encoding="utf-8-sig")


[0.9240346358763629]


### Timeliness

In [177]:
from datetime import datetime


def calc_timeliness(refresh_date, cycle_day):
    refresh_date = pd.to_datetime(refresh_date)
    unupdate_cycle = np.max([((datetime.now() - refresh_date).days / cycle_day) - 1, 0])

    # unupdate_cycle = np.floor((datetime.now() - refresh_date).days/cycle_day)
    # print((datetime.now() - refresh_date).days/cycle_day)
    return np.max([0, 100 - (unupdate_cycle * (100 / 3))])

In [178]:
calc_timeliness("2022-12-01", cycle_day=365)

66.21004566210046

# Output Reports
Run all the functions above first before running this section

#### Note that output reports can be generated through the data quality tests of
<p>    - Consistency type 1
<p>    - Accuracy type 2
<p>    - Accuracy type 3
<p>    - Completeness
<p>          
<p>  *Completeness test does not require an output report (just find the blanks in the dataset). The rest can be found below

### Consistency Type 2

In [179]:
run_selected_cells_from_util('utils', 'dq_utils.ipynb', [13])

In [180]:
try:
    column_mapping = {
        "STOCK_CU_NAME": "CU_Display",
        "STOCK_CU_INDEX": "FULL_CU_IN",
    }  # the pattern for comparison is 'dataset column' : 'reference column'
    compare_datasets(
        dataset_path="data/test/Salmonid_Enhancement_Program_Releases.xlsx",
        column_mapping=column_mapping,
        ref_dataset_path="data/Pacific Salmon Population Unit Crosswalk_Final_20240513.xlsx",
    )
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

[31mTest failed![0m
Error: [Errno 2] No such file or directory: 'data/test/Salmonid_Enhancement_Program_Releases.xlsx'


### Accuracy Type 1

In [181]:
run_selected_cells_from_util('utils', 'dq_utils.ipynb', [15])

Test

In [182]:
try:
    add_only_numbers_columns(
        dataset_path="data/test/SEP Facilities.xlsx", selected_columns=["LicNo", "FRN"]
    )
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

[31mTest failed![0m
Error: [Errno 2] No such file or directory: 'data/test/SEP Facilities.xlsx'


# Score Log

In [183]:
from datetime import datetime

current_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

current_date

'2024-12-05 13:53:16'

In [184]:
def get_dataset_name(dataset_path):
    # Extract the file name from the path (e.g., 'Dataset_A.csv')
    file_name = os.path.basename(dataset_path)
    # Split the file name to remove the extension (e.g., 'Dataset_A')
    dataset_name = os.path.splitext(file_name)[0]
    return dataset_name

More up to date code for the score log can be found in the "Setup" section, the code here is treated more as a testing space

In [185]:
# Function to log a new row into the DQS_Log.xlsx file
def log_score(test_name, dataset_name, score, threshold=None):
    # Convert score to a percentage
    percentage_score = score * 100

    # Load the Excel file into a DataFrame
    log_file = "DQS_Log_Test.xlsx"

    # Set threshold to "No threshold" if it is not provided
    if threshold is None:
        threshold_value = "no threshold"
    else:
        threshold_value = threshold
    # Try loading the existing Excel file
    try:
        df = read_data(log_file)
    except FileNotFoundError:
        # Create an empty DataFrame if file doesn't exist (shouldn't be the case if you already created it)
        df = pd.DataFrame(
            columns=["Dataset", "Test", "Threshold", "Date_Calculated", "Score"]
        )

    # Prepare the new row as a DataFrame
    new_row = pd.DataFrame(
        {
            "Dataset": [dataset_name],
            "Test": [test_name],
            "Date_Calculated": [datetime.now().strftime("%Y-%m-%d %H:%M:%S")],
            "Threshold": [threshold_value],
            "Score": [percentage_score],
        }
    )

    # Append the new row to the DataFrame
    df = pd.concat([df, new_row], ignore_index=True)

    # Save the updated DataFrame back to the Excel file
    df.to_excel(log_file, index=False)

### Saving the file to sharepoint

In [186]:
import shutil
import os


# Function to copy the log file to another folder
def copy_log_file(destination_folder):
    # Define the name of the file and the current working directory
    log_file = "DQS_Log_Test.xlsx"

    # Get the current working directory (if needed)
    current_directory = os.getcwd()

    # Define the source path (current working directory + file)
    source_path = os.path.join(current_directory, log_file)

    # Define the destination path (destination folder + file)
    destination_path = os.path.join(destination_folder, log_file)

    # Copy the file to the destination folder
    shutil.copy(source_path, destination_path)

    print(f"File copied to {destination_path}")

Run this function when saving the excel document from the working directory to Sharepoint:

In [188]:
try:
    # Specify the destination folder where you want to copy the file
    destination_folder = "C:/Users/EwertM/Documents/Portal/DataQuality"

    copy_log_file(destination_folder)
except Exception as e:
    print(f'{RED}Test failed!{RESET}')
    print(f'Error: {e}')

[31mTest failed![0m
Error: [Errno 2] No such file or directory: 'c:\\Users\\onakd\\Documents\\Data Quality Tests\\DataQuality\\DQS_Log_Test.xlsx'
