# Building blocks
Looking/exploring some potential datasets to be used for the solution

## Components of solution
1. **Target audience**: Senior citizens (65+)
2. **Core issues**: Access to care and addressing mental health challenges
3. **Primary points to address**: Make mental health/treating mental health approachable for senior citizens, and destigmatizing the start of treating and caring for mental health as an older individual (through transparency and encouragement)

## Proposed solution
Create an application that suggests/recommends mental health resources and information surrounding the resource (e.g. cost of care, type of service, etc.) based on provided medical information, self-reported information (if given), location, and pateint demographics.

### Why is this "unique"?
The app's output is tailored to recommend services/resources to each user based on their information and responses, and is trained on specialized data (e.g. PACE reports, mental health responses from previous cases) to give optimal solutions. It also aims to ease the search and start of mental health care for older individuals, as the beginning of the journey to mental wellness can be intimidating, especially with trying to search the unknown through loads of information. Though this app is not intended to diagnose individuals specifically, it allows them to reflect on their answers and start to receive the care they may need.

## Potential workflow
1. **Gather possible datasets**: Think we should look for datasets that contain information regarding resources for older individuals within CA (e.g. PACE reporting), as well as mental health datasets that contain text and sentiment analysis/classifications. We may need to do some web scraping for resources in CA other than PACE.
* [Kanakmi/mental-disorders](https://huggingface.co/datasets/Kanakmi/mental-disorders)
* [PACE Rates Calendar Year 2022](https://data.chhs.ca.gov/dataset/9705522d-898f-44df-a79d-64128005372c/resource/144c5a90-d65d-4876-bcc8-e2ce81d97153/download/pace-rates-calendar-year-2022.csv)
* [Mental Disorder Classification](https://www.kaggle.com/datasets/cid007/mental-disorder-classification)
* [Mental Health Dataaset](https://www.kaggle.com/datasets/bhavikjikadara/mental-health-dataset)
* [Sentiment Analysis for Mental Health](https://www.kaggle.com/datasets/suchintikasarkar/sentiment-analysis-for-mental-health)
2. **Explore the data**: Primarily look at the values, distributions (probably can visualize them for information).
3. **Choose data to build a model with**: In order to create this recommendation system, we would need to build a model that can classify an individual with some likelihood they have `x,y,z` etc, so we need to choose what dataset we'll need to classify with. We might want to use a 'text-based' dataset to do this classificaiton
4. **Build the model**: Given whatever dataset we utilize, we build a classification model to use for our recommender system.
5. **Incorporating model with resource recommender**: We will most likely need to assign certain resources based on scoring, but this may change.
6. **Creating the app**: Build the app with the recommender system built in; I would recommend/suggest to use [Streamlit](https://streamlit.io/), as it helps create a front-end using Python.

# Coding section
## Importing packages

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## `Kanakami/mental-disorders` dataset

In [3]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet', 'val': 'data/val-00000-of-00001.parquet'}
kanakami = pd.read_parquet("hf://datasets/Kanakmi/mental-disorders/" + splits["train"])

'''
Labels:
0:'BPD'
1:'bipolar'
2:'depression'
3:'Anxiety'
4:'schizophrenia'
5:'mentalillness'
'''

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


"\nLabels:\n0:'BPD'\n1:'bipolar'\n2:'depression'\n3:'Anxiety'\n4:'schizophrenia'\n5:'mentalillness'\n"

In [4]:
kanakami.head()

Unnamed: 0,text,label
0,My father - all of my life - has shifted betwe...,0
1,I have health anxiety where I go to the doctor...,3
2,I was thinking about the differences between B...,0
3,Let me preface this by saying that I promise I...,2
4,"I've been exploring this forum for awhile, and...",0


In [5]:
kanakami['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,170272
3,129350
2,96982
5,30521
1,28551
4,9375


## PACE 2022 Dataset

In [23]:
pace = pd.read_csv("PACE_2022.csv")
pace.head()

Unnamed: 0,_id,Rating Period,Calendar Year,Model,County,PACE Organization,Category of Aid,Lower Bound,Midpoint,Upper Bound,AWOP
0,1,01/2022-6/2022,2022,PACE,Los Angeles,AltaMed Senior Care,Full-Dual,"$4,197.13","$4,397.54","$4,615.28","$6,422.25"
1,2,01/2022-6/2022,2022,PACE,Los Angeles,AltaMed Senior Care,Non-Dual,"$6,339.12","$6,649.39","$6,986.54","$9,417.40"
2,3,01/2022-6/2022,2022,PACE,Los Angeles,Brandman Centers For Senior Care,Full-Dual,"$4,397.93","$4,612.08","$4,844.77","$6,422.25"
3,4,01/2022-6/2022,2022,PACE,Los Angeles,Brandman Centers For Senior Care,Non-Dual,"$7,049.46","$7,401.18","$7,783.42","$9,417.40"
4,5,01/2022-6/2022,2022,PACE,Orange,CalOptima,Full-Dual,"$3,991.62","$4,189.19","$4,403.89","$5,891.84"


In [21]:
pace['County'].value_counts()

Unnamed: 0_level_0,count
County,Unnamed: 1_level_1
Los Angeles,18
San Diego,16
San Joaquin,12
Orange,12
San Francisco,8
Tulare,8
Stanislaus,8
Sacramento,8
Riverside,8
San Bernardino,8


In [24]:
# Define a function to clean up the values
def clean_dollar(value):
    return value.replace('$', '').strip()

# Apply the function to the relevant columns
columns_to_clean = ['Lower Bound', 'Midpoint', 'Upper Bound', 'AWOP']
pace[columns_to_clean] = pace[columns_to_clean].applymap(clean_dollar)

  pace[columns_to_clean] = pace[columns_to_clean].applymap(clean_dollar)


In [25]:
pace.head()

Unnamed: 0,_id,Rating Period,Calendar Year,Model,County,PACE Organization,Category of Aid,Lower Bound,Midpoint,Upper Bound,AWOP
0,1,01/2022-6/2022,2022,PACE,Los Angeles,AltaMed Senior Care,Full-Dual,4197.13,4397.54,4615.28,6422.25
1,2,01/2022-6/2022,2022,PACE,Los Angeles,AltaMed Senior Care,Non-Dual,6339.12,6649.39,6986.54,9417.4
2,3,01/2022-6/2022,2022,PACE,Los Angeles,Brandman Centers For Senior Care,Full-Dual,4397.93,4612.08,4844.77,6422.25
3,4,01/2022-6/2022,2022,PACE,Los Angeles,Brandman Centers For Senior Care,Non-Dual,7049.46,7401.18,7783.42,9417.4
4,5,01/2022-6/2022,2022,PACE,Orange,CalOptima,Full-Dual,3991.62,4189.19,4403.89,5891.84


In [26]:
pace['Lower Bound'] = pace['Lower Bound'].str.replace(',', '', regex=True).astype(float)
pace['Midpoint'] = pace['Midpoint'].str.replace(',', '', regex=True).astype(float)
pace['Upper Bound'] = pace['Upper Bound'].str.replace(',', '', regex=True).astype(float)
pace['AWOP'] = pace['AWOP'].str.replace(',', '', regex=True).astype(float)

In [27]:
pace.head()

Unnamed: 0,_id,Rating Period,Calendar Year,Model,County,PACE Organization,Category of Aid,Lower Bound,Midpoint,Upper Bound,AWOP
0,1,01/2022-6/2022,2022,PACE,Los Angeles,AltaMed Senior Care,Full-Dual,4197.13,4397.54,4615.28,6422.25
1,2,01/2022-6/2022,2022,PACE,Los Angeles,AltaMed Senior Care,Non-Dual,6339.12,6649.39,6986.54,9417.4
2,3,01/2022-6/2022,2022,PACE,Los Angeles,Brandman Centers For Senior Care,Full-Dual,4397.93,4612.08,4844.77,6422.25
3,4,01/2022-6/2022,2022,PACE,Los Angeles,Brandman Centers For Senior Care,Non-Dual,7049.46,7401.18,7783.42,9417.4
4,5,01/2022-6/2022,2022,PACE,Orange,CalOptima,Full-Dual,3991.62,4189.19,4403.89,5891.84


In [31]:
# Checking for null values
nan_count = pace.isnull().sum().sum()
print('Number of NaN values:', nan_count)

Number of NaN values: 0


In [34]:
# Checking to make sure types are good
pace.dtypes

Unnamed: 0,0
_id,int64
Rating Period,object
Calendar Year,int64
Model,object
County,object
PACE Organization,object
Category of Aid,object
Lower Bound,float64
Midpoint,float64
Upper Bound,float64


In [32]:
print("Average for lower bound care: " + str(round(pace['Lower Bound'].mean(), 2)))
print("Average for upper bound care: " + str(round(pace['Upper Bound'].mean(), 2)))
print("Average for midpoint care: " + str(round(pace['Midpoint'].mean(), 2)))
print("Average for AWOP: " + str(round(pace['Midpoint'].mean(), 2)))

Average for lower bound care: 6320.61
Average for upper bound care: 6973.01
Average for midpoint care: 6633.26
Average for AWOP: 6633.26


In [35]:
# Saving the cleaned .csv file
pace.to_csv("PACE_2022_cleaned.csv", index=False)