# How to Use

1. Run everything in the **Setup** section. 
    - Make sure to change the working directory to **your** working directory. The code for this is already there.
    - Make sure the Excel document for logging the scores also exists in your working directory, and that the file name is correct.

2. Determine *if the test needs to be run* by having a good understanding of what each test is doing. 
    - Please refer to this document [here](https://086gc.sharepoint.com/:x:/r/sites/PacificSalmonTeam/Shared%20Documents/General/02%20-%20PSSI%20Secretariat%20Teams/04%20-%20Strategic%20Salmon%20Data%20Policy%20and%20Analytics/02%20-%20Data%20Governance/00%20-%20Projects/10%20-%20Data%20Quality/Presentation/DQP%20Demo.xlsx?d=wc15abe6743954df980a05f09fe99a560&csf=1&web=1&e=CJeb6h)

3. Some requirements for the datasets:
    - The data must be on the **first sheet** in the Excel document.
    - The **first row** must be the column names. 
    - The test won't run if the Excel file is open

4. After running all the tests, the Excel document for logging the scores can be uploaded to Sharepoint using the function "Saving the file to sharepoint". 

Note: The Output Reports are used for when a data steward is asking about why their dataset gets a certain score. If the metric is not in Output Reports, then running the test itself will generate an output that can be put into a report.  

# Setup

Please run everything in the set up, and double check the working directory so that the data can be read from that same directory.

All of these functions are used in the process of calculating data quality. 

In [106]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import re
from difflib import SequenceMatcher
from datetime import datetime

Make sure to set to the correct working directory

In [107]:
# Change working directory to the same place where you saved the test datasets
# os.chdir('C:/Users/luos/OneDrive - DFO-MPO/Python') #change directory
os.getcwd()  # check where the directory is (and whether the change was successful or not)
GLOBAL_USER = "EwertM"

Function to read either csv or xlsx data 

In [108]:
# Function 0: Reading the dataset file
def read_data(dataset_path):
    _, file_extension = os.path.splitext(dataset_path)
    if file_extension == ".csv":
        df = pd.read_csv(
            dataset_path, encoding="cp1252"
        )  # sometimes if the function has issue reading a csv file, include: , encoding = 'cp1252')
    elif file_extension == ".xlsx":
        df = pd.read_excel(dataset_path)
    else:
        print("Unsupported file type")
        df = None
    return df

Function to log the scores into an xlsx file (already created, existing)

In [109]:
# Function to log a new row into the DQS_Log_XX.xlsx file
def log_score(test_name, dataset_name, score, selected_columns, threshold=None):
    # Convert score to a percentage
    percentage_score = score

    # Load the Excel file into a DataFrame
    log_file = "DQS_Log_Beta.xlsx"

    # Set threshold to "No threshold" if it is not provided
    if threshold is None:
        threshold_value = "no threshold"
    else:
        threshold_value = threshold

    # If selected_columns is None, assume "All" was tested
    if selected_columns is None:
        columns_tested = "All columns"
    else:
        # Convert selected_columns list to a string if specific columns are provided
        columns_tested = ", ".join(selected_columns)

    # Try loading the existing Excel file
    try:
        df = read_data(log_file)
    except FileNotFoundError:
        # Create an empty DataFrame if file doesn't exist (shouldn't be the case if you already created it)
        df = pd.DataFrame(
            columns=["Dataset", "Test", "Threshold", "Date_Calculated", "Score"]
        )

    # Prepare the new row as a DataFrame
    new_row = pd.DataFrame(
        {
            "Dataset": [dataset_name],
            "Columns_Tested": [columns_tested],  # Add the list of columns tested
            "Test": [test_name],
            "Date_Calculated": [datetime.now().strftime("%Y-%m-%d %H:%M:%S")],
            "Threshold": [threshold_value],
            "Score": [percentage_score],
            "User": GLOBAL_USER
        }
    )

    # Append the new row to the DataFrame
    df = pd.concat([df, new_row], ignore_index=True)

    # Save the updated DataFrame back to the Excel file
    df.to_excel(log_file, index=False)

Function to extract dataset name from a path

In [110]:
def get_dataset_name(dataset_path):
    # Extract the file name from the path (e.g., 'Dataset_A.csv')
    file_name = os.path.basename(dataset_path)
    # Split the file name to remove the extension (e.g., 'Dataset_A')
    dataset_name = os.path.splitext(file_name)[0]
    return dataset_name

# Data Quality Tests

### Consistency

#### Consistency Type 1 (C1)

Calculate consistency score of a dataset

This code is best run on CSV data where the column names are in the first row. It can also accept files that are in xlsx formats but it will only take data from the first sheet if there are more than one sheet in the excel file.

Limitations: It will not check for differences in capitalization of the same word (since all the words will be changed to lower case before the similarity score is calculated)

In [111]:
# Consistency Type 1 (C1) function

# Dictionary mapping Canadian province abbreviations to their full names
province_abbreviations = {
    "BC": "British Columbia",
    "ON": "Ontario",
    "QC": "Quebec",
    "AB": "Alberta",
    "MB": "Manitoba",
    "SK": "Saskatchewan",
    "NS": "Nova Scotia",
    "NB": "New Brunswick",
    "NL": "Newfoundland and Labrador",
    "PE": "Prince Edward Island",
    "NT": "Northwest Territories",
    "YT": "Yukon",
    "NU": "Nunavut",
}


def normalize_text(text, remove_numbers=False):
    """
    Normalize input text by converting to lowercase, stripping whitespace,
    replacing province abbreviations with full names, and removing non-alphanumeric characters.
    Optionally remove numbers based on the flag.
    """
    text = str(text).lower().strip()
    for abbr, full in province_abbreviations.items():
        text = re.sub(r"\b" + abbr.lower() + r"\b", full.lower(), text)
    if remove_numbers:
        text = re.sub(r"\d+", "", text)
    text = "".join(char for char in text if char.isalnum() or char.isspace())
    return " ".join(text.split())


def extract_numbers(text):
    """
    Extract all numbers from the input text and return them as a list of strings.
    """
    return re.findall(r"\d+", text)


def remove_short_numbers(text):
    """
    Remove numbers with 1 or 2 digits from the input text.
    """
    return re.sub(r"\b\d{1,4}\b", "", text)


def numeric_similarity(num1_list, num2_list):
    """
    Calculate the similarity between two lists of numbers by comparing each digit.
    Return the proportion of matching digits.
    """
    num1, num2 = " ".join(num1_list), " ".join(num2_list)
    matches = sum(1 for a, b in zip(num1, num2) if a == b)
    max_length = max(len(num1), len(num2))
    return matches / max_length if max_length > 0 else 0


def string_similarity(str1, str2):
    """
    Calculate the similarity between two strings using the SequenceMatcher from difflib.
    Return the similarity ratio.
    """
    return SequenceMatcher(None, str1, str2).ratio()


def calculate_cosine_similarity(text_list, ref_list, Stop_Words):
    """
    Calculate the cosine similarity between lists of texts using TF-IDF vectorization.
    """
    vectorizer = TfidfVectorizer(
        stop_words=Stop_Words, analyzer="word", ngram_range=(1, 2)
    )
    ref_vec = vectorizer.fit_transform(ref_list)
    text_vec = vectorizer.transform(text_list)
    return cosine_similarity(text_vec, ref_vec)


def contains_short_number(num_list):
    """
    Check if any number in the list has 1 or 2 digits.
    """
    return any(len(num) <= 4 for num in num_list)


def numbers_match(num_list1, num_list2):
    """
    Check if any number in the first list is present in the second list.
    """
    return any(num in num_list2 for num in num_list1)


def calculate_combined_similarity(df, unique_observations, text_similarity_matrix):
    """
    Combine text and numeric similarities into a single similarity matrix.
    """
    # Make a copy of the text similarity matrix to modify it
    combined_sim_matrix = np.copy(text_similarity_matrix)

    # Extract numeric parts from each unique observation
    numeric_parts = [extract_numbers(obs) for obs in unique_observations]

    # Iterate over each pair of unique observations to calculate numeric similarity
    for i, num_i in enumerate(numeric_parts):
        for j, num_j in enumerate(numeric_parts):
            if i != j:
                # Calculate the numeric similarity for the current pair
                num_sim = numeric_similarity(num_i, num_j)

                # Update the combined similarity matrix with the maximum value between text and numeric similarity
                combined_sim_matrix[i, j] = max(combined_sim_matrix[i, j], num_sim)

    # Iterate over each pair of unique observations to calculate string similarity
    for i, obs_i in enumerate(unique_observations):
        for j, obs_j in enumerate(unique_observations):
            if i != j:
                # Calculate the string similarity for the current pair
                seq_sim = string_similarity(obs_i, obs_j)

                # Update the combined similarity matrix with the maximum value between existing and sequence matcher
                combined_sim_matrix[i, j] = max(combined_sim_matrix[i, j], seq_sim)

    return combined_sim_matrix


def average_consistency_score(cosine_sim_matrix, threshold):
    """
    Calculate the average consistency score based on the cosine similarity matrix and a given threshold.
    """
    num_rows, num_columns = cosine_sim_matrix.shape
    inconsistency = 0

    for i in range(num_rows):
        if np.any(
            (cosine_sim_matrix[i] > threshold) & (cosine_sim_matrix[i] <= 1.0000000)
        ):
            inconsistency += 1

    return (num_rows - inconsistency) / num_rows


def process_and_calculate_similarity(
    dataset_path, column_names, threshold, Stop_Words=["the", "and"]
):
    """
    Process the dataset, normalize the text, and calculate the similarity scores for multiple columns.
    """
    # Read the dataset from the provided Excel file path
    df = read_data(dataset_path)
    overall_consistency_scores = []

    # Iterate over each specified column
    for column_name in column_names:
        # Normalize the text in the specified column and store the results in a new column
        df[f"Normalized {column_name}"] = df[column_name].apply(normalize_text)

        # Get unique normalized observations by removing duplicates and NaN values
        unique_observations = pd.unique(
            df[f"Normalized {column_name}"].dropna().values.ravel()
        )

        # Calculate the cosine similarity matrix for the unique normalized observations
        text_sim_matrix = calculate_cosine_similarity(
            unique_observations.tolist(), unique_observations.tolist(), Stop_Words
        )

        # Set the diagonal of the similarity matrix to 0 to ignore self-similarity
        np.fill_diagonal(text_sim_matrix, 0)

        # Combine text similarity with numeric similarity to get a final similarity matrix
        combined_sim_matrix = calculate_combined_similarity(
            df, unique_observations, text_sim_matrix
        )

        # Initialize columns in the dataframe to store the recommended organization matches and all matches
        df[f"Recommended {column_name}"] = None
        df[f"All Matches {column_name}"] = None

        # Iterate over each normalized organization in the dataframe
        for i, norm_org in enumerate(df[f"Normalized {column_name}"]):
            # Find the index of the current normalized organization in the unique observations
            try:
                current_index = np.where(unique_observations == norm_org)[0][0]
            except IndexError:
                df.at[i, f"Recommended {column_name}"] = "No significant match"
                df.at[i, f"All Matches {column_name}"] = []
                continue

            # Get the similarities for the current organization from the combined similarity matrix
            similarities = combined_sim_matrix[current_index]

            # Find the indices and values of all matching organizations
            matched_indices = np.where(similarities >= threshold)[0]
            all_matches = [unique_observations[idx] for idx in matched_indices]
            all_match_scores = [similarities[idx] for idx in matched_indices]

            best_score = 0
            best_match = "No significant match"

            # Extract numbers from the current organization
            num_list_current = extract_numbers(norm_org)

            for idx in matched_indices:
                candidate_match = unique_observations[idx]
                num_list_candidate = extract_numbers(candidate_match)

                if contains_short_number(num_list_current) or contains_short_number(
                    num_list_candidate
                ):
                    # If short numbers are present, ensure they match; otherwise, skip this match
                    if not numbers_match(num_list_current, num_list_candidate):
                        continue
                    # Recalculate similarity excluding short numbers
                    norm_org_no_nums = remove_short_numbers(norm_org)
                    candidate_no_nums = remove_short_numbers(candidate_match)
                    recalculated_similarity = string_similarity(
                        norm_org_no_nums, candidate_no_nums
                    )
                    if recalculated_similarity > best_score:
                        best_score = recalculated_similarity
                        best_match = candidate_match
                else:
                    if similarities[idx] > best_score:
                        best_score = similarities[idx]
                        best_match = candidate_match

            # Assign the best match to the dataframe
            if best_score > threshold:
                df.at[i, f"Recommended {column_name}"] = (
                    f"{best_match} ({best_score:.2f})"
                )
            else:
                df.at[i, f"Recommended {column_name}"] = "No significant match"

            # Store all matches
            df.at[i, f"All Matches {column_name}"] = ", ".join(
                [
                    f"{match} ({score:.2f})"
                    for match, score in zip(all_matches, all_match_scores)
                    if score > threshold
                ]
            )

        # Calculate the overall consistency score for the current column
        consistency_score = average_consistency_score(text_sim_matrix, threshold)
        overall_consistency_scores.append(consistency_score)

    # Calculate the overall consistency score as the average of individual consistency scores
    overall_consistency_score = np.mean(overall_consistency_scores)
    df["Overall Consistency Score"] = overall_consistency_score

    # log the results
    log_score(
        test_name="Consistency (C1)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=column_names,
        threshold=threshold,
        score=overall_consistency_score,
    )

    return overall_consistency_score  # to return the score
    # return df #to return the dataset

##### Test the dataset by changing the path

In [112]:
dataset = "Salmon Head Depot"
datafile = "Pacific-Recreational-Fishery-Salmon-Head-Depots.csv"
datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{dataset}/{datafile}"
# Test Consistency Calculations

processed_df = process_and_calculate_similarity(
    dataset_path=datafilepath,
    column_names=[
        "DEPOT NAME / NOM DU DÉPÔT",
        "AREA / LA RÉGION",
        "MUNICIPALITY / MUNICIPALITÉ",
        "ADDRESS / ADRESSE",
        "STORAGE INFORMATION / DÉTAILS DE STOCKAGE",
    ],
    threshold=0.91,
)

# processed_df['Overall Consistency Score'].min()
processed_df

np.float64(1.0)

#### Consistency Type 2 (C2)

Calculate consistency score of datasets with a reference list

The compared columns in question must be identical to the ref list, otherwise they will be penalized more harshly.

In [113]:
# Function 1: Get names used for a single column
def get_names_used_for_column(df, column_name):
    unique_observations = pd.unique(df[column_name].dropna().values.ravel())
    return unique_observations


# Function 2: Calculate Cosine Similarity
def calculate_cosine_similarity(text_list, ref_list, Stop_Words):
    count_vectorizer = CountVectorizer(stop_words=Stop_Words)
    ref_vec = count_vectorizer.fit_transform(ref_list).todense()
    ref_vec_array = np.array(ref_vec)
    text_vec = count_vectorizer.transform(text_list).todense()
    text_vec_array = np.array(text_vec)
    cosine_sim = np.round((cosine_similarity(text_vec_array, ref_vec_array)), 2)
    return cosine_sim


# Function 3: Average Consistency Score
def average_consistency_score(cosine_sim_df, threshold=0.91):
    num_rows, num_columns = cosine_sim_df.shape
    total_count = 0  # This will count all values above or equal to the threshold

    for i in range(num_rows):
        if np.max(cosine_sim_df[i]) >= threshold:  # Include all comparisons
            total_count += 1
    total_observations = num_rows  # Total number of observations
    average_consistency_score = total_count / total_observations
    return average_consistency_score


def process_and_calculate_similarity_ref(
    dataset_path,
    column_mapping,
    ref_dataset_path=None,
    threshold=0.91,
    Stop_Words="activity",
):
    # Read the data file
    df = read_data(dataset_path)

    # Initialize ref_df if a ref dataset is provided
    if ref_dataset_path:
        df_ref = read_data(ref_dataset_path)
        ref_data = True  # Flag to indicate we are using a ref dataset
    else:
        ref_data = False  # No ref dataset, compare within the same dataset

    all_consistency_scores = []

    for selected_column, m_selected_column in column_mapping.items():
        if ref_data:
            # Compare to ref dataset
            unique_observations = get_names_used_for_column(df_ref, m_selected_column)
        else:
            # Use own column for comparison
            unique_observations = get_names_used_for_column(df, selected_column)

        cosine_sim_matrix = calculate_cosine_similarity(
            df[selected_column].dropna(), unique_observations, Stop_Words=Stop_Words
        )
        column_consistency_score = average_consistency_score(
            cosine_sim_matrix, threshold
        )
        all_consistency_scores.append(column_consistency_score)

    # Calculate the average of all consistency scores
    overall_avg_consistency = (
        sum(all_consistency_scores) / len(all_consistency_scores)
        if all_consistency_scores
        else None
    )

    # log the results
    log_score(
        test_name="Consistency (C2)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=column_mapping,
        threshold=threshold,
        score=overall_avg_consistency,
    )

    print(f"overall_avg_consistency = {overall_avg_consistency}")
    return overall_avg_consistency

##### Test the dataset by changing the path

In [114]:
# column_mapping = {
#     "STOCK_CU_NAME": "CU_Display",
#     "STOCK_CU_INDEX": "FULL_CU_IN",
# }  # the pattern for comparison is 'dataset column' : 'reference column'
# process_and_calculate_similarity_ref(
#     dataset_path="data/test/2024-03-28 1_qryThermal_NatEmerg.xlsx",
#     column_mapping=column_mapping,
#     ref_dataset_path="data/Pacific Salmon Population Unit Crosswalk_Final_20240513.xlsx",
#     threshold=1,
#     Stop_Words=[""],
# )

### Accuracy

#### Accuracy Type 1 (A1, Mixed Data Types, Symbols in Numerics) 

Test whether there are symbols in numerics

In [115]:
# Function 1: Using isdigit to find non-numerical entries
def find_non_digits(s):
    # Ensure the value is treated as a string
    s = str(s)
    return [char for char in s if not (char.isdigit() or char == ".")]


# Function 2 : Calculate the score
def accuracy_score(dataset_path, selected_columns):
    adf = read_data(dataset_path)
    selected_columns = [col for col in adf.columns if col in selected_columns]

    all_accuracy_scores = []

    for column_name in selected_columns:
        # Drop NA, null, or blank values from column
        column_data = adf[column_name].dropna()

        total_rows = len(column_data)

        if total_rows > 0:  # to avoid division by zero
            non_digit_chars_per_row = column_data.apply(find_non_digits)
            non_numerical_count = non_digit_chars_per_row.apply(
                lambda x: len(x) > 0
            ).sum()
            accuracy_score = (total_rows - non_numerical_count) / total_rows
            all_accuracy_scores.append(accuracy_score)

    overall_accuracy_score = (
        sum(all_accuracy_scores) / len(all_accuracy_scores)
        if all_accuracy_scores
        else None
    )

    # log the results
    log_score(
        test_name="Accuracy (A1)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=selected_columns,
        threshold=None,
        score=overall_accuracy_score,
    )
    print(f"overall_avg_consistency = {overall_accuracy_score}")
    return overall_accuracy_score

##### Test the dataset by changing the path

In [116]:
# dataset = "NuSEDS Escapement"
# datafile = "Yukon and Transboundary NuSEDS_20241004.xlsx"
# datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{dataset}/{datafile}"
accuracy_score(
    dataset_path=datafilepath,
    selected_columns=["LATITUDE / LATITUDE", "LONGITUDE / LONGITUDE"],
)

overall_avg_consistency = 0.5


np.float64(0.5)

#### Accuracy Type 2 (A2 Outliers)

In [117]:
def find_outliers_iqr(
    dataset_path,
    selected_columns,
    groupby_column=None,
    threshold=1.5,
    minimum_score=0.85,
):
    df = read_data(dataset_path)

    outliers_dict = {}

    # If a groupby column is specified, perform the IQR calculation within each group

    if groupby_column:
        grouped = df.groupby(groupby_column)
        for column in selected_columns:
            # Apply the outlier detection for each group
            outliers = grouped[column].apply(
                lambda x: (
                    (
                        x
                        < x.quantile(0.25)
                        - threshold * (x.quantile(0.75) - x.quantile(0.25))
                    )
                    | (
                        x
                        > x.quantile(0.75)
                        + threshold * (x.quantile(0.75) - x.quantile(0.25))
                    )
                )
            )
            # Combine the outlier Series into a single Series that corresponds to the original DataFrame index
            outliers_dict[column] = 1 - outliers.groupby(groupby_column).mean()
    else:
        # Perform the IQR calculation on the whole column if no groupby column is specified
        for column in selected_columns:
            Q1 = df[column].quantile(0.25)
            Q3 = df[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - threshold * IQR
            upper_bound = Q3 + threshold * IQR
            outliers = (df[column] < lower_bound) | (df[column] > upper_bound)
            outliers_dict[column] = 1 - outliers.mean()

    #compute final score
    total_groups = len(outliers_dict)
    groups_above = sum(1 for score in outliers_dict.values() if score > minimum_score)
    final_score = groups_above / total_groups if total_groups > 0 else 0

    #final_score = {}

    # for key in outliers_dict.keys():
    #     print(outliers_dict[key])
    #     arr = outliers_dict[key].values
    #     value_out = np.sum(arr > minimum_score) / len(arr)
    #     final_score[key] = value_out
    
    # for key, value in outliers_dict.items():  
    #     print(key, value)
    #     # Check if the proportion of non-outliers is greater than the minimum score  
    #     value_out = value > minimum_score  
    #     # Store the result (True or False) in the final_score dictionary  
    #     final_score[key] = value_out  

    # log the results

    log_score(
        test_name="Accuracy (A2)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=selected_columns,
        threshold=threshold,
        score=final_score,
    )

    return outliers_dict, final_score

Tests

In [118]:
# dataset = "NuSEDS Escapement"
# datafile = "Johnstone Strait and Strait of Georgia NuSEDS_20241004.xlsx"
# datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{dataset}/{datafile}"
find_outliers_iqr(
    dataset_path=datafilepath,
    selected_columns=["LATITUDE / LATITUDE", "LONGITUDE / LONGITUDE"],
    threshold=1.5,
    minimum_score=0.85,
)

({'LATITUDE / LATITUDE': np.float64(0.8693693693693694),
  'LONGITUDE / LONGITUDE': np.float64(0.9819819819819819)},
 1.0)

#### Accuracy Type 3 (A3 Duplicates)

In [119]:
# function 1: finding duplicates
def find_duplicates_and_percentage(dataset_path):

    df = read_data(dataset_path)

    # Find duplicate rows
    duplicate_rows = df[df.duplicated(keep=False)]

    # Calculate percentage of duplicate rows
    total_rows = len(df)
    total_duplicate_rows = len(duplicate_rows)
    percentage_duplicate = 1 - (total_duplicate_rows / total_rows)

    # Print duplicate rows
    print("Duplicate Rows:")
    print(duplicate_rows)

    # log the results
    log_score(
        test_name="Accuracy (A3)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=None,
        threshold=None,
        score=percentage_duplicate,
    )

    # Print percentage of duplicate rows
    print(f"\nDuplication Score: {percentage_duplicate*100}%")

Test

In [120]:
# dataset = "NuSEDS Escapement"
# datafile = "Yukon and Transboundary NuSEDS_20241004.xlsx"
# datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{dataset}/{datafile}"
find_duplicates_and_percentage(
    dataset_path=datafilepath
)

Duplicate Rows:
Empty DataFrame
Columns: [DEPOT NAME / NOM DU DÉPÔT, AREA / LA RÉGION, MUNICIPALITY / MUNICIPALITÉ, ADDRESS / ADRESSE, PHONE NUMBER / NUMÉRO DE TÉLÉPHONE, ACCESSIBILITY / ACCESSIBILITÉ, STORAGE INFORMATION / DÉTAILS DE STOCKAGE, LATITUDE / LATITUDE, LONGITUDE / LONGITUDE]
Index: []

Duplication Score: 100.0%


### Completeness (P)

The threshold is for removing a column that meets the threshold of the percentage of blanks.

In [121]:
def completeness_test(dataset_path, exclude_columns=[], threshold=0.75):
    dataset = read_data(dataset_path)

    # Exclude the 'Comment' column if it exists in the dataset
    if "Comment" in dataset.columns:
        dataset = dataset.drop(columns=["Comment"])

    # Exclude columns in exclude_columns if they exist in the dataset
    dataset = dataset.drop(
        columns=[col for col in exclude_columns if col in dataset.columns]
    )

    # Calculate the percentage of non-null (non-missing) values in each column
    is_null_percentage = dataset.isna().mean()

    # Identify columns with non-null percentage less than or equal to the threshold
    columns_to_keep = is_null_percentage[is_null_percentage <= threshold].index

    # Keep columns that exceed the threshold of non-null values
    dataset2 = dataset[columns_to_keep]

    # Calculate the actual percentage of non-missing values in the dataset
    total_non_missing = dataset2.notna().sum().sum()
    total_obs = dataset2.shape[0] * dataset2.shape[1]
    completeness_score = total_non_missing / total_obs

    # log the results
    log_score(
        test_name="Completeness (P)",
        dataset_name=get_dataset_name(dataset_path),
        selected_columns=None,
        threshold=threshold,
        score=completeness_score,
    )

    return completeness_score

Test

In [122]:
# "North and Central Coast NuSEDS_20241004.xlsx"
# "West Coast Vancouver Island NuSEDS_20241004.xlsx"
# "Yukon and Transboundary NuSEDS_20241004.xlsx"
# dataset = "NuSEDS Escapement"
# datafile = "Johnstone Strait and Strait of Georgia NuSEDS_20241004.xlsx"
# datafilepath = f"C:/Users/{GLOBAL_USER}/OneDrive - DFO-MPO/04 - Strategic Salmon Data Policy and Analytics/07 - Data Products & Data/21 - Transitory Files/{dataset}/{datafile}"
completeness_test(
    datafilepath,
    threshold=0.75,
)

np.float64(0.9764764764764765)

### Timeliness

In [123]:
from datetime import datetime


def calc_timeliness(refresh_date, cycle_day):
    refresh_date = pd.to_datetime(refresh_date)
    unupdate_cycle = np.max([((datetime.now() - refresh_date).days / cycle_day) - 1, 0])

    # unupdate_cycle = np.floor((datetime.now() - refresh_date).days/cycle_day)
    # print((datetime.now() - refresh_date).days/cycle_day)
    return np.max([0, 100 - (unupdate_cycle * (100 / 3))])

In [124]:
calc_timeliness("2022-12-01", cycle_day=365)

np.float64(66.48401826484017)

# Output Reports
Run all the functions above first before running this section

#### Note that output reports can be generated through the data quality tests of
<p>    - Consistency type 1
<p>    - Accuracy type 2
<p>    - Accuracy type 3
<p>    - Completeness
<p>          
<p>  *Completeness test does not require an output report (just find the blanks in the dataset). The rest can be found below

### Consistency Type 2

In [125]:
def compare_datasets(dataset_path, column_mapping, ref_dataset_path=None):
    # Read the data file
    df = read_data(dataset_path)

    # Initialize ref_df if a ref dataset is provided
    if ref_dataset_path:
        df_ref = read_data(ref_dataset_path)
        ref_data = True  # Flag to indicate we are using a ref dataset
    else:
        ref_data = False  # No ref dataset, compare within the same dataset

    for selected_column, m_selected_column in column_mapping.items():
        if ref_data:
            # Compare to ref dataset
            unique_observations = get_names_used_for_column(df_ref, m_selected_column)
        else:
            # Use own column for comparison
            unique_observations = get_names_used_for_column(df, selected_column)

        # Iterate over each row in the selected column
        column_results = []
        for value in df[selected_column]:
            # Check if the value exists in unique_observations and append the result to column_results
            if pd.isnull(value):
                column_results.append(
                    False
                )  # or True, depending on how you want to handle NaN values
            else:
                column_results.append(value in unique_observations)

        # Add the results as a new column in the DataFrame
        df[selected_column + "_comparison"] = column_results

    return df

In [126]:
column_mapping = {
    "STOCK_CU_NAME": "CU_Display",
    "STOCK_CU_INDEX": "FULL_CU_IN",
}  # the pattern for comparison is 'dataset column' : 'reference column'
compare_datasets(
    dataset_path="data/test/Salmonid_Enhancement_Program_Releases.xlsx",
    column_mapping=column_mapping,
    ref_dataset_path="data/Pacific Salmon Population Unit Crosswalk_Final_20240513.xlsx",
)

FileNotFoundError: [Errno 2] No such file or directory: 'data/test/Salmonid_Enhancement_Program_Releases.xlsx'

### Accuracy Type 1

In [None]:
# Function 1: Using isdigit to find non-numerical entries
def find_non_digits(s):
    # Ensure the value is treated as a string
    s = str(s)
    return [char for char in s if not (char.isdigit() or char == ".")]


# Function 2 : Check if each row has only numbers in each selected column and add results as new columns
def add_only_numbers_columns(dataset_path, selected_columns):
    adf = read_data(dataset_path)
    selected_columns = [col for col in adf.columns if col in selected_columns]

    for column_name in selected_columns:
        adf[column_name + "_Only_Numbers"] = adf[column_name].apply(
            lambda x: len(find_non_digits(x)) == 0
        )

    return adf

Test

In [None]:
add_only_numbers_columns(
    dataset_path="data/test/SEP Facilities.xlsx", selected_columns=["LicNo", "FRN"]
)

# Score Log

In [None]:
from datetime import datetime

current_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

current_date

'2024-11-28 16:53:40'

In [None]:
def get_dataset_name(dataset_path):
    # Extract the file name from the path (e.g., 'Dataset_A.csv')
    file_name = os.path.basename(dataset_path)
    # Split the file name to remove the extension (e.g., 'Dataset_A')
    dataset_name = os.path.splitext(file_name)[0]
    return dataset_name

More up to date code for the score log can be found in the "Setup" section, the code here is treated more as a testing space

In [None]:
# Function to log a new row into the DQS_Log.xlsx file
def log_score(test_name, dataset_name, score, threshold=None):
    # Convert score to a percentage
    percentage_score = score * 100

    # Load the Excel file into a DataFrame
    log_file = "DQS_Log_Test.xlsx"

    # Set threshold to "No threshold" if it is not provided
    if threshold is None:
        threshold_value = "no threshold"
    else:
        threshold_value = threshold
    # Try loading the existing Excel file
    try:
        df = read_data(log_file)
    except FileNotFoundError:
        # Create an empty DataFrame if file doesn't exist (shouldn't be the case if you already created it)
        df = pd.DataFrame(
            columns=["Dataset", "Test", "Threshold", "Date_Calculated", "Score"]
        )

    # Prepare the new row as a DataFrame
    new_row = pd.DataFrame(
        {
            "Dataset": [dataset_name],
            "Test": [test_name],
            "Date_Calculated": [datetime.now().strftime("%Y-%m-%d %H:%M:%S")],
            "Threshold": [threshold_value],
            "Score": [percentage_score],
        }
    )

    # Append the new row to the DataFrame
    df = pd.concat([df, new_row], ignore_index=True)

    # Save the updated DataFrame back to the Excel file
    df.to_excel(log_file, index=False)

### Saving the file to sharepoint

In [None]:
import shutil
import os


# Function to copy the log file to another folder
def copy_log_file(destination_folder):
    # Define the name of the file and the current working directory
    log_file = "DQS_Log_Test.xlsx"

    # Get the current working directory (if needed)
    current_directory = os.getcwd()

    # Define the source path (current working directory + file)
    source_path = os.path.join(current_directory, log_file)

    # Define the destination path (destination folder + file)
    destination_path = os.path.join(destination_folder, log_file)

    # Copy the file to the destination folder
    shutil.copy(source_path, destination_path)

    print(f"File copied to {destination_path}")

Run this function when saving the excel document from the working directory to Sharepoint:

In [None]:
# Specify the destination folder where you want to copy the file
destination_folder = "C:/Users/EwertM/Documents/Portal/DataQuality"

copy_log_file(destination_folder)