# Population-level trends for Client Outcome Measures

## Changes in average (over clients) scores ATOM Assessment Survey questions.

## Steps:

1. Extract the data - from the database or from a pre-prepared parquet file
2. Processing: if not pulling from pre-cleaned data: Clean & Transform (incl. categorize) the data
    - Clean data - remove rows with missing data : PDCSubstanceOrGambling
    - Transform data - expand PDC, determine Program, categorize fields, drop Notes/Comments fields, rename PartitionKey to SLK.
    - Limit the data to the period of interest - i.e. only clients who have completed at least one survey during the period of interest.
    - Limit by only clients who have completed the survey at least three times (min-stage: 3)
5. Calculate the average score for each client for each stage and for each of the questions of interest.


In [8]:

# Step 0: Importing the libraries
from utils.df_xtrct_prep import extract_prep_data

from statsutil.funcs import get_all_results
from utils.io import write_results_to_files
from graphing import get_chart_for_qna_list

In [10]:
# Global variables
extract_start_date = 20200101
extract_end_date = 20240101

fname = f"{extract_start_date}_{extract_end_date}_1"

active_clients_start_date ='2022-07-01' 
active_clients_end_date = '2023-06-30'

results_folder = "./data/out/"


# MIN_NUM_ATOMS_PER_CLIENT = 3
# MIN_NUM_COL_VALUES = 3

### Step 1 & 2: Extract & Process

#### Extract the data - from the database or from a pre-prepared parquet file

1. *Processed data*:
  - if processed-parquet file is not present, *get the raw data* and process it and cache it into the parquet file.
  - if yes, load the data from the parquet file.
  
2. If *Raw data* doesn't exist in the data/in/ folder as a parquet file:
  - load it from the database (Azure)
  - otherwise from the parquet file.
 
 (cache=True => try to load from a parquet file, if not present, load from the database and cache it into a parquet file)

In [11]:
# Extract & Process
processed_df = extract_prep_data(extract_start_date, extract_end_date
                                 , active_clients_start_date
                                 , active_clients_end_date
                                 , fname)

In [4]:
len(processed_df)

2478

### Step 3 : Calculate the average score for each client for each stage and for each of the questions of interest.

In [28]:
# Chronologically Rank the Assessments for each client
# df_q = chrono_rank_within_clientgroup(processed_df)  # adds 'survey_rank' column
# g = col_df.groupby('SLK')
# col_df.loc[:,'survey_rank'] = g['AssessmentDate'].rank(method='min')

In [12]:
chosen_surveys = [1, 3 ,6]

In [5]:
from filters import get_filters, apply_filters #, get_outfilename_for_filters

In [13]:
orig_filter1 = {
    'FunderName': 'Coordinaire'
}
orig_filter2 = {
   'Program':['EUROPATH']
}


orig_filter = orig_filter1



filters = get_filters(orig_filter)
# print ("before :" , processed_df.Program.value_counts())

new_df = apply_filters(processed_df, filters)
# print ("After :" , new_df.Program.value_counts())

all_results = get_all_results(new_df, chosen_surveys, filters)

# outfile_name = get_outfilename_for_filters(filters)

write_results_to_files(all_results, f"{results_folder}{fname}.csv")#, orig_filter)

# for results in all_results:
#   title_for_file = results['title'].replace(" ", "_")
#   results_filepath = f"{results_folder}{fname}_{title_for_file}.csv"
#   df = results['data']
#   write_df_to_csv(df, results_filepath)

Wellbeing measures
NRecords For Col(Past4WkPhysicalHealth): 829)#, Total:2478, 2020-01-23 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkMentalHealth): 829)#, Total:2478, 2020-01-23 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkQualityOfLifeScore): 823)#, Total:2478, 2020-01-23 00:00:00, 2023-07-06 00:00:00
Substance Use
NRecords For Col(PDCHowMuchPerOccasion): 739)#, Total:2478, 2020-01-31 00:00:00, 2023-07-06 00:00:00
NRecords For Col(PDCDaysInLast28): 1041)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
Problems in Life Domains
NRecords For Col(Past4WkDailyLivingImpacted): 933)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkHowOftenPhysicalHealthCausedProblems): 934)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkHowOftenMentalHealthCausedProblems): 934)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkUseLedToProblemsWithFamilyFriend): 934)#, Total:2478, 2020-01-07 0

In [6]:
def do_all(df, chosen_surveys, orig_filter, filename:str=""):
  filters = get_filters(orig_filter)
  new_df = apply_filters(df, filters)
  all_results = get_all_results(new_df, chosen_surveys, filters)
  # write_results_to_files(all_results, f"{results_folder}{fname_prepend}_{fname}.csv")#, orig_filter)
  write_results_to_files(all_results, f"{results_folder}{filename}.csv")#, orig_filter)

  return all_results


In [15]:

# orig_filter1 = {'FunderName': 'Coordinaire'}
# all_results = do_all(processed_df, chosen_surveys, orig_filter1, filename=orig_filter1['FunderName'])

# orig_filter1 = {'FunderName': 'NSW Ministry of Health'}
# all_results = do_all(processed_df, chosen_surveys, orig_filter1, filename=orig_filter1['FunderName'])


# orig_filter1 = {'FunderName': 'Murrumbidgee PHN'}
# all_results = do_all(processed_df, chosen_surveys, orig_filter1, filename=orig_filter1['FunderName'])

# orig_filter1 = {'FunderName': 'ACT Health'}
# all_results = do_all(processed_df, chosen_surveys, orig_filter1, filename=orig_filter1['FunderName'])

Wellbeing measures
NRecords For Col(Past4WkPhysicalHealth): 829)#, Total:2478, 2020-01-23 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkMentalHealth): 829)#, Total:2478, 2020-01-23 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkQualityOfLifeScore): 823)#, Total:2478, 2020-01-23 00:00:00, 2023-07-06 00:00:00
Substance Use
NRecords For Col(PDCHowMuchPerOccasion): 739)#, Total:2478, 2020-01-31 00:00:00, 2023-07-06 00:00:00
NRecords For Col(PDCDaysInLast28): 1041)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
Problems in Life Domains
NRecords For Col(Past4WkDailyLivingImpacted): 933)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkHowOftenPhysicalHealthCausedProblems): 934)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkHowOftenMentalHealthCausedProblems): 934)#, Total:2478, 2020-01-07 00:00:00, 2023-07-06 00:00:00
NRecords For Col(Past4WkUseLedToProblemsWithFamilyFriend): 934)#, Total:2478, 2020-01-07 0

In [10]:


chart , points = get_chart_for_qna_list(question_list, answers_df, title)

chart + points

#### Write results to CSV

In [18]:
from datetime import datetime
title_for_file = title.replace(" ", "_")
results_filepath = f"{results_folder}{fname}_{title_for_file}.csv"
# write_df_to_csv(answers_df, f"{results_folder}{fname}_{title_for_file}.csv")
#f"./data/out/results_{fname}.csv"
answers_df['ResultsTimestamp'] = datetime.now().replace(microsecond=0)
answers_df.to_csv(results_filepath, index=False, mode='a', header=True)


In [8]:

# chosen_surveys = [1, 3 ,6] 
# answer_list = get_nmeans_for_questions( question_list, processed_df, chosen_surveys)

NRecords For Col(Past4WkHowOftenPhysicalHealthCausedProblems): 931, Total:2434, 2020-01-07 00:00:00, 2023-06-29 00:00:00
Past4WkHowOftenPhysicalHealthCausedProblems,1,125,1.38
Past4WkHowOftenPhysicalHealthCausedProblems,3,125,1.31
Past4WkHowOftenPhysicalHealthCausedProblems,6,125,1.43
NRecords For Col(Past4WkHowOftenMentalHealthCausedProblems): 931, Total:2434, 2020-01-07 00:00:00, 2023-06-29 00:00:00
Past4WkHowOftenMentalHealthCausedProblems,1,125,2.04
Past4WkHowOftenMentalHealthCausedProblems,3,125,1.66
Past4WkHowOftenMentalHealthCausedProblems,6,125,1.72
NRecords For Col(Past4WkUseLedToProblemsWithFamilyFriend): 931, Total:2434, 2020-01-07 00:00:00, 2023-06-29 00:00:00
Past4WkUseLedToProblemsWithFamilyFriend,1,125,0.74
Past4WkUseLedToProblemsWithFamilyFriend,3,125,0.61
Past4WkUseLedToProblemsWithFamilyFriend,6,125,0.44
NRecords For Col(Past4WkDifficultyFindingHousing): 915, Total:2434, 2020-01-07 00:00:00, 2023-06-29 00:00:00
Past4WkDifficultyFindingHousing,1,123,0.24
Past4WkDifficu

In [9]:

title = "Problems in Life Domains"
#'Changes in average scores for "Past 4 weeks: Use let to problems in various Life domains" '
chart , points = get_chart_for_qna_list(question_list, answer_list, chosen_surveys, title)

chart + points

NameError: name 'get_chart_for_qna_list' is not defined

In [25]:
# col_df1[col_df1['survey_rank'] == 1].Past4WkPhysicalHealth.count() #.value_counts(dropna=False)
# len(col_df1[col_df1['survey_rank'] == 1].SLK.unique() )
# len(df_q[df_q['survey_rank'] == 1].SLK.unique() )


528

In [26]:
# col_df1[col_df1['survey_rank'] == 6].Past4WkPhysicalHealth.count()

# len(col_df1[col_df1['survey_rank'] == 6].SLK.unique() )
# len(df_q[df_q['survey_rank'] == 6].SLK.unique() )



138

In [12]:
## client_groups_forcol = col_df.groupby('SLK')
# from graphing import get_chart_for_means

# question_list = [question]
# assessment_tags= chosen_surveys
# means = averages

# # contribs = [first_assess_contribs,fourth_assess_contribs, seventh_assess_contribs ]
# chart = get_chart_for_means(question_list, assessment_tags, means, nth_assessment_contribs)
# chart