# Population-level trends for Client Outcome Measures

## Comparing the averages of scores for a particular set of questions First stage vs any other stage scores ATOM Survey

## Steps:

1. Extract the data - from the database or from a pre-prepared parquet file
2. Processing: if not pulling from pre-cleaned data: Clean & Transform (incl. categorize) the data
    - Clean data - remove rows with missing data : PDCSubstanceOrGambling
    - Transform data - expand PDC, determine Program, categorize fields, drop Notes/Comments fields, rename PartitionKey to SLK.
    - Limit the data to the period of interest - i.e. only clients who have completed at least one survey during the period of interest.
    - Limit by only clients who have completed the survey at least three times (min-stage: 3)
5. Calculate the average score for each client for each stage and for each of the questions of interest.


In [1]:
# Step 0: Importing the libraries
from utils.df_xtrct_prep import extract_prep_data

In [2]:
# Global variables
extract_start_date = 20200101
extract_end_date = 20240101

fname = f"{extract_start_date}_{extract_end_date}"

active_clients_start_date ='2022-07-01' 
active_clients_end_date = '2023-06-30'

# MIN_NUM_ATOMS_PER_CLIENT = 3
# MIN_NUM_COL_VALUES = 3

### Step 1 & 2: Extract & Process

In [3]:
# Step 1: Importing the dataset
processed_df = extract_prep_data(extract_start_date, extract_end_date
                                 , active_clients_start_date
                                 , active_clients_end_date
                                 , fname)

In [4]:
len(processed_df)

2434

### Step 3 : Calculate the average score for each client for each stage and for each of the questions of interest.

In [5]:
from utils.group_utils import getrecs_w_min_numvals_forcol, chrono_rank_within_clientgroup
from statsutil.funcs import get_mean_xcontribs_of_nth_assessment_for_question \
                          , get_df_forclients_with_atleast_n_surveys
# , get_nth_survey_values_for_question, getSLKs_with_change_in_question


In [6]:
# Chronologically Rank the Assessments for each client
col_df = chrono_rank_within_clientgroup(processed_df)  # adds 'survey_rank' column
# g = col_df.groupby('SLK')
# col_df.loc[:,'survey_rank'] = g['AssessmentDate'].rank(method='min')

In [13]:
# MIN_ASSESSMENTS = 6

In [12]:
# df_six = get_df_forclients_with_atleast_n_surveys(col_df, MIN_ASSESSMENTS)
# len(df_six)

1051

In [16]:

# col_df = getrecs_w_min_numvals_forcol(df_six, col, min_num_vals=MIN_ASSESSMENTS)
# len(col_df), len(df_six), min(df_six.AssessmentDate), max(df_six.AssessmentDate)

(931, 1051, Timestamp('2020-01-07 00:00:00'), Timestamp('2023-06-29 00:00:00'))

In [21]:
# not necessary ? the survey_ranks will be consecutive
# col_df = col_df.sort_values(by=['SLK', 'survey_rank'])

# col_df[['SLK', 'survey_rank', 'AssessmentDate']] #.head(30)

Unnamed: 0,SLK,survey_rank,AssessmentDate
54,ACOAC250520011,2.0,2022-10-04
57,ACOAC250520011,3.0,2022-11-11
51,ACOAC250520011,4.0,2022-11-22
55,ACOAC250520011,5.0,2022-12-22
52,ACOAC250520011,6.0,2023-03-22
...,...,...,...
7324,YL2IA310319901,3.0,2021-09-17
7325,YL2IA310319901,4.0,2021-12-21
7326,YL2IA310319901,5.0,2022-03-04
7321,YL2IA310319901,6.0,2022-10-20


In [7]:
def get_nmeans_ncontribs(chosen_surveys, df, field_name:str):
  averages = []
  nth_assessment_contribs =[]

  for s_no in chosen_surveys:
      average, n_contribs = get_mean_xcontribs_of_nth_assessment_for_question(df, s_no, field_name)
      averages.append(average)
      nth_assessment_contribs.append(n_contribs)
      print(f"Survey {s_no} has {n_contribs} contributing asessments for {field_name}, with mean {average}")
  return averages, nth_assessment_contribs

In [8]:

MIN_ASSESSMENTS = 6
col = "SDS_Score" #'Past4WkHowOftenPhysicalHealthCausedProblems' #'SDS_Score'
chosen_surveys = [1, 3 ,6]  # hast to be less than MIN_ASSESSMENTS

df_six = get_df_forclients_with_atleast_n_surveys(col_df, MIN_ASSESSMENTS)
len(df_six)

col_df = getrecs_w_min_numvals_forcol(df_six, col, min_num_vals=MIN_ASSESSMENTS)

print (f"NRecords For Col({col}): {len(col_df)}, Total:{len(df_six)}, {min(df_six.AssessmentDate)}, {max(df_six.AssessmentDate)}")

averages, nth_assessment_contribs = get_nmeans_ncontribs(chosen_surveys, col_df, col)
averages, nth_assessment_contribs, chosen_surveys

# TODO: Clients with no change : treat as outliers and remove ? 
# TODO: Clients with only zeros ?

NRecords For Col(SDS_Score): 337, Total:1051, 2020-01-07 00:00:00, 2023-06-29 00:00:00
Survey 1 has 40 contributing asessments for SDS_Score, with mean 8.52
Survey 3 has 44 contributing asessments for SDS_Score, with mean 7.48
Survey 6 has 35 contributing asessments for SDS_Score, with mean 7.0


([8.52, 7.48, 7.0], [40, 44, 35], [1, 3, 6])

In [9]:
# client_groups_forcol = col_df.groupby('SLK')
from graphing import get_chart_for_means

question_list = [col]
assessment_tags= chosen_surveys
means = averages

# contribs = [first_assess_contribs,fourth_assess_contribs, seventh_assess_contribs ]
chart = get_chart_for_means(question_list, assessment_tags, means, nth_assessment_contribs)
chart