Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge APS investigation outcomes into DETECT data #39

Open
2 of 11 tasks
mbcann01 opened this issue Mar 25, 2024 · 6 comments
Open
2 of 11 tasks

Merge APS investigation outcomes into DETECT data #39

mbcann01 opened this issue Mar 25, 2024 · 6 comments
Assignees
Labels
data wrangling A data wrangling task

Comments

@mbcann01
Copy link
Member

mbcann01 commented Mar 25, 2024

Overview

On 2024-03-21, Catherine sent us a new batch of APS data. We need to merge the APS outcomes with our DETECT screenings for a publication.

We want to link APS investigation outcomes to DETECT screenings completed by MedStar during the R01 phase of DETECT. The DETECT data we want to use for linking is participant_import.rds.

Links

Tasks

  • Review 'medstar_id' as a potential unique subject ID in the MedStar data set
  • Review APS 'client id' as a potential unique subject ID in the APS data set
  • Clean/Prep APS identifier data for fuzzy-matching algorithm
  • Add unique person ID to the APS data set (if it doesn't already exist)
  • Clean/Prep MedStar identifier data for fuzzy-matching algorithm
  • Add unique person ID to the MedStar data set (if it doesn't already exist)
  • Create a dictionary (data frame) that allows linkage of the MedStar subject IDs to APS subject IDs
  • Create a dictionary (data frame) that allows linkage of the MedStar screenings to APS cases/investigations and their outcomes
  • Create a dictionary (data frame) that allows linkage of the MedStar screenings to a matched APS intake
  • Upload any new data (dictionaries) to SharePoint
  • Document any new data (dicsionaries) on the Wiki
@mbcann01 mbcann01 added the data wrangling A data wrangling task label Mar 25, 2024
@corvidfox
Copy link
Collaborator

corvidfox commented May 10, 2024

Initial review of the APS subject identifier data appeared to indicate that the variable (client_id) would be readily valid, with a few "failed matches" noted due to multiple (client_id) values associated with certain (case_id) values (each case should only belong to a single subject). Data largely appeared to be clean and ready to use.

However, further examination has found significant typographical and other errors in the identifier data fields and a much larger degree of "failed matches" for subject-id values. As such, the data requires significant cleaning in preparation for fuzzy-matching algorithm application and a within-set APS subject ID would need to be created.

Cleaning of APS data is underway.

Name fields:

  • "Unknown" or equivalent values ("none", "not applicable", "don't know", etc)
  • Numbers in name fields
  • Nicknames or comments in name fields
  • Titles in name fields
  • Suffixes in name fields
  • Ensuring consistency in the use of white space (single white spaces, no leading/trailing white space)
  • Verifying uses of hyphens, single quotes, double quotes
  • Unexpected/non-standard characters

Potentially valuable information (such as "female" if a name was given as "unknown female" or a suffix trimmed from a name value) is being shifted to a comment field, so it is available in manual review of fuzzy-match pairs.

Additionally, some exploration of address values has been completed.

  • ZIP Code-State Validation (using publicly available USPS data indicating which States are covered by each ZIP Code)

@corvidfox
Copy link
Collaborator

corvidfox commented May 17, 2024

As of today, further progress has been made in cleaning/standardizing the APS Client data

Name fields are clean! 🎉

  • Titles in name fields
  • Unexpected/non-standard characters

Address fields are pending only street address cleaning/validation:

  • ZIP code formatting (5 digits in strings, no unexpected or invalid characters)
  • Validation of ZIP code vs USPS mapping of ZIP to STATE in coverage areas
  • Verify that all COUNTIES listed are valid counties in Texas (does not include verifying if the given address and/or ZIP code is part of said county)
  • Clean/standardize STATE values (valid 2 letter abbreviations)
  • Clean/standardize CITY values to avoid invalid characters (non letters, non-white space)
  • Clean/standardize CITY values to remove "unknown" or equivalents
  • Clean/standardize CITY value prefixes ("Mount", "Fort", "Point", "Port", and cardinal directions)
  • Clean/standardize CITY values known to have several variations within the study coverage area ("Fort Worth", "North Richland Hills")
  • Clean/standardize STREET ADDRESS values to remove PO BOX addresses into a separate field
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS values to remove "unknown" or equivalent values
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Clean/standardize STREET ADDRESS unanticipated characters (non letters, non-white space)

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data.

Address fields are pending only street address cleaning/validation completion. Street Addresses are a bear.

  • ZIP code formatting (5 digits in strings, no unexpected or invalid characters)
  • Validation of ZIP code vs USPS mapping of ZIP to STATE in coverage areas
  • Verify that all COUNTIES listed are valid counties in Texas (does not include verifying if the given address and/or ZIP code is part of said county)
  • Clean/standardize STATE values (valid 2 letter abbreviations)
  • Clean/standardize CITY values to avoid invalid characters (non letters, non-white space)
  • Clean/standardize CITY values to remove "unknown" or equivalents
  • Clean/standardize CITY value prefixes ("Mount", "Fort", "Point", "Port", and cardinal directions)
  • Clean/standardize CITY values known to have several variations within the study coverage area ("Fort Worth", "North Richland Hills")
  • Clean/standardize STREET ADDRESS values to remove PO BOX addresses into a separate field
  • Clean/standardize STREET ADDRESS values to remove "unknown" or equivalent values
  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Anything that is not an alphanumeric or white space
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

  • ZIP code formatting (5 digits in strings, no unexpected or invalid characters)
  • Validation of ZIP code vs USPS mapping of ZIP to STATE in coverage areas
  • Verify that all COUNTIES listed are valid counties in Texas (does not include verifying if the given address and/or ZIP code is part of said county)
  • Clean/standardize STATE values (valid 2 letter abbreviations)
  • Clean/standardize CITY values to avoid invalid characters (non letters, non-white space)
  • Clean/standardize CITY values to remove "unknown" or equivalents
  • Clean/standardize CITY value prefixes ("Mount", "Fort", "Point", "Port", and cardinal directions)
  • Clean/standardize CITY values known to have several variations within the study coverage area ("Fort Worth", "North Richland Hills")
  • Clean/standardize STREET ADDRESS values to remove PO BOX addresses into a separate field
  • Clean/standardize STREET ADDRESS values to remove "unknown" or equivalent values
  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Pound symbol (#)
    • Ampersand (&) or word ( and )
    • Anything else
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes
    • QC checks of these pulled values
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Pound symbol (#)
    • Ampersand (&) or word ( and )
    • Anything else
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes. Office, basement, lower, department, front, hanger, floor, lobby, penthouse, pier, side, and slip did not appear to have any potential matches during exploration and testing of regular expressions.
    • Apartment
    • Building
    • Room
    • Suite
    • Lot
    • Space
    • Trailer
    • Rear
    • Stop
    • Unit
    • QC checks of these pulled values
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Separation of secondary address values should be complete (within reason) at this time, though QC checks are designed to help catch other potential remaining values.

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Pound symbol (#)
    • Ampersand (&) or word ( and )
    • Anything else
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes. Office, basement, lower, department, front, hanger, floor, lobby, penthouse, pier, side, and slip did not appear to have any potential matches during exploration and testing of regular expressions.
    • Apartment
    • Building
    • Room
    • Suite
    • Lot
    • Space
    • Trailer
    • Rear
    • Stop
    • Unit
    • Box
    • House
    • Number
    • Duplex
  • QC checks of separated secondary address field values
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data wrangling A data wrangling task
Projects
Status: In Progress
Development

No branches or pull requests

2 participants