# How to Use

1. Run everything in the **Setup** section. 
    - Make sure to change the working directory to **your** working directory. The code for this is already there.
    - Make sure the Excel document for logging the scores also exists in your working directory, and that the file name is correct.

2. Determine *if the test needs to be run* by having a good understanding of what each test is doing. 
    - Please refer to this document [here](https://086gc.sharepoint.com/:x:/r/sites/PacificSalmonTeam/Shared%20Documents/General/02%20-%20PSSI%20Secretariat%20Teams/04%20-%20Strategic%20Salmon%20Data%20Policy%20and%20Analytics/02%20-%20Data%20Governance/00%20-%20Projects/10%20-%20Data%20Quality/Presentation/DQP%20Demo.xlsx?d=wc15abe6743954df980a05f09fe99a560&csf=1&web=1&e=CJeb6h)

3. Some requirements for the datasets:
    - The data must be on the **first sheet** in the Excel document.
    - The **first row** must be the column names. 
    - The test won't run if the Excel file is open

4. After running all the tests, the Excel document for logging the scores can be uploaded to Sharepoint using the function "Saving the file to sharepoint". 

Note: The Output Reports are used for when a data steward is asking about why their dataset gets a certain score. If the metric is not in Output Reports, then running the test itself will generate an output that can be put into a report.  

# Setup

Please run everything in the set up, and double check the working directory so that the data can be read from that same directory.

All of these functions are used in the process of calculating data quality. 

In [1]:
from IPython import get_ipython

# Clear memory
get_ipython().magic('reset -sf')

  get_ipython().magic('reset -sf')


In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import re
from difflib import SequenceMatcher
from datetime import datetime
import nbformat
import gc

# Import dimentions
from dimensions.consistency import Consistency
from dimensions.accuracy import Accuracy
from dimensions.completeness import Completeness
from dimensions.uniqueness import Uniqueness
from dimensions.utils import calculate_dimension_score, calculate_DQ_grade

In [3]:
gc.collect()

28

Make sure to set to the correct working directory

In [4]:
# Change working directory to the same place where you saved the test datasets
# os.chdir('C:/Users/luos/OneDrive - DFO-MPO/Python') #change directory
os.getcwd()  # check where the directory is (and whether the change was successful or not)
LOGGING_PATH = "/metric_output_logs/"
GLOBAL_USER = "OnakD"
GLOBAL_DATASET = "NuSEDS Escapement"
GLOBAL_DATAFILE = "Johnstone Strait and Strait of Georgia NuSEDS_20241004.xlsx"
DATA_FILE_PATH = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{GLOBAL_DATASET}/{GLOBAL_DATAFILE}"
DIMENSION_SCORES = []

# Data Quality Tests

### Consistency

#### Consistency Type 1 (C1)

Calculate consistency score of a dataset

This code is best run on CSV data where the column names are in the first row. It can also accept files that are in xlsx formats but it will only take data from the first sheet if there are more than one sheet in the excel file.

Limitations: It will not check for differences in capitalization of the same word (since all the words will be changed to lower case before the similarity score is calculated)

#### Consistency Type 2 (C2)

Calculate consistency score of datasets with a reference list

The compared columns in question must be identical to the ref list, otherwise they will be penalized more harshly.

In [None]:
column_mapping = {
    "STOCK_CU_NAME": "CU_Display",
    "STOCK_CU_INDEX": "FULL_CU_IN",
}  # the pattern for comparison is 'dataset column' : 'reference column'

# Test Consistency Calculations
# Using default thresholds and stop words for both metrics
consitancy_tests = Consistency(
    dataset_path=DATA_FILE_PATH,
    c1_column_names=["POPULATION", "ESTIMATE_CLASSIFICATION", "ESTIMATE_METHOD"],
    c2_column_mapping=column_mapping,
    # ref_dataset_path="data/Pacific Salmon Population Unit Crosswalk_Final_20240513.xlsx"
    return_type='dataset'
    # logging_path='metric_output_logs/'
)

consistancy_score = calculate_dimension_score("Consistency", scores=consitancy_tests.run_metrics(), weights={})
DIMENSION_SCORES.append(consistancy_score)
print(consistancy_score)


[31mIssue with column names, are you sure you entered them correctly?[0m
Column name that fails: 'STOCK_CU_NAME'
List of all detected column names: ['AREA', 'WATERBODY', 'GAZETTED_NAME', 'LOCAL_NAME_1', 'LOCAL_NAME_2', 'ANALYSIS_YR', 'SPECIES', 'NATURAL_ADULT_SPAWNERS', 'NATURAL_JACK_SPAWNERS', 'NATURAL_SPAWNERS_TOTAL', 'ADULT_BROODSTOCK_REMOVALS', 'JACK_BROODSTOCK_REMOVALS', 'TOTAL_BROODSTOCK_REMOVALS', 'OTHER_REMOVALS', 'TOTAL_RETURN_TO_RIVER', 'ENUMERATION_METHODS', 'ADULT_PRESENCE', 'JACK_PRESENCE', 'START_DTT', 'END_DTT', 'NATURAL_ADULT_FEMALES', 'NATURAL_ADULT_MALES', 'EFFECTIVE_FEMALES', 'WEIGHTED_PCT_SPAWN', 'WATERSHED_CDE', 'WATERBODY_ID', 'POPULATION', 'RUN_TYPE', 'STREAM_ARRIVAL_DT_FROM', 'STREAM_ARRIVAL_DT_TO', 'START_SPAWN_DT_FROM', 'START_SPAWN_DT_TO', 'PEAK_SPAWN_DT_FROM', 'PEAK_SPAWN_DT_TO', 'END_SPAWN_DT_FROM', 'END_SPAWN_DT_TO', 'ACCURACY', 'PRECISION', 'INDEX_YN', 'RELIABILITY', 'ESTIMATE_STAGE', 'ESTIMATE_CLASSIFICATION', 'NO_INSPECTIONS_USED', 'ESTIMATE_METHOD', 

### Accuracy

#### Accuracy Type 1 (A1, Mixed Data Types, Symbols in Numerics) 

Test whether there are symbols in numerics

#### Accuracy Type 2 (A2 Outliers)

Find outliers that are 1.5 (or any threshold) times away from the inter-quartile range.

In [None]:
# Test Accuracy Calculations
# Using default threshold, group by, and min score for A2 metric 
accuracy_tests = Accuracy(
    dataset_path=DATA_FILE_PATH,
    # selected_columns=[" Egg Target ", " Release/ Transfer Target ", " Coded Wire Tag Target ", " Fin Clip Target ", " Thermal Mark Target ", " Parentage-based Tag Target ", " PIT Tag Target "]
    selected_columns=["AREA", "ANALYSIS_YR", "NATURAL_ADULT_SPAWNERS", "NATURAL_JACK_SPAWNERS", "NATURAL_SPAWNERS_TOTAL", "ADULT_BROODSTOCK_REMOVALS", "JACK_BROODSTOCK_REMOVALS", "TOTAL_BROODSTOCK_REMOVALS", "OTHER_REMOVALS", "TOTAL_RETURN_TO_RIVER", "NATURAL_ADULT_MALES", "EFFECTIVE_FEMALES", "WEIGHTED_PCT_SPAWN", "NO_INSPECTIONS_USED", "ACT_ID", "POP_ID", "GFE_ID"],
    return_type='dataset'
    # logging_path='metric_output_logs/'
)

accuracy_score = calculate_dimension_score("Accuracy", scores=accuracy_tests.run_metrics(), weights={})
DIMENSION_SCORES.append(accuracy_score)
print(accuracy_score)

  df = pd.read_csv(logging_path)


True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
When trying to create one line summary for A2, the following error occurred: can only concatenate str (not "list") to str
{'dimension': 'Accuracy', 'score': 0.7941176470588236}


### Completeness (P)

#### Completeness Type 1 (P1)

The threshold is for removing a column that meets the threshold of the percentage of blanks.

In [None]:
completeness_tests = Completeness(
    dataset_path=DATA_FILE_PATH,
    return_type='dataset'
    # logging_path='metric_output_logs/'
)

completeness_score = calculate_dimension_score("Completeness", completeness_tests.run_metrics(), weights={})
DIMENSION_SCORES.append(completeness_score)
print(completeness_score)

{'dimension': 'Completeness', 'score': 0.8482207305966877}


### Uniqueness (U)

#### Uniqueness Type 1 (U1)

Find duplicated rows.

In [None]:
uniqueness_tests = Uniqueness(
    dataset_path=DATA_FILE_PATH,
    return_type='dataset'
    # logging_path='metric_output_logs/'
)

uniqueness_score = calculate_dimension_score("Uniqueness", uniqueness_tests.run_metrics(), weights={})
DIMENSION_SCORES.append(uniqueness_score)
print(uniqueness_score)

Duplicate Rows:
Empty DataFrame
Columns: [AREA, WATERBODY, GAZETTED_NAME, LOCAL_NAME_1, LOCAL_NAME_2, ANALYSIS_YR, SPECIES, NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS, TOTAL_RETURN_TO_RIVER, ENUMERATION_METHODS, ADULT_PRESENCE, JACK_PRESENCE, START_DTT, END_DTT, NATURAL_ADULT_FEMALES, NATURAL_ADULT_MALES, EFFECTIVE_FEMALES, WEIGHTED_PCT_SPAWN, WATERSHED_CDE, WATERBODY_ID, POPULATION, RUN_TYPE, STREAM_ARRIVAL_DT_FROM, STREAM_ARRIVAL_DT_TO, START_SPAWN_DT_FROM, START_SPAWN_DT_TO, PEAK_SPAWN_DT_FROM, PEAK_SPAWN_DT_TO, END_SPAWN_DT_FROM, END_SPAWN_DT_TO, ACCURACY, PRECISION, INDEX_YN, RELIABILITY, ESTIMATE_STAGE, ESTIMATE_CLASSIFICATION, NO_INSPECTIONS_USED, ESTIMATE_METHOD, CREATED_DTT, UPDATED_DTT, ACT_ID, POP_ID, GFE_ID]
Index: []

[0 rows x 49 columns]

Duplication Score: 100.0%
When trying to create one line summary for U1, the following error occurred: '>' not su

## Determine Overall Data Quality Grade

In [9]:
# Call grade calculation here
print(f'DQ grade for this dataset is: {calculate_DQ_grade(DIMENSION_SCORES)}')

DQ grade for this dataset is: C
