## UW Pharmacy Student Self-Care Analysis (winter1_2)

This notebook is a draft using the "winter1_2" page in the student comment data of the 2022-2023 SY in the UW SoP. The analysis roughly looks at the determinants of the categories of self-care students chose for this quarter using VADER and NLTK sentiment analysis and Pandas table manipulation.

Note: this notebook relies on uploading a single csv sheet for one quarter, along with adding two additional columns called "Category 1" and "Category 2" which categorize each comment based on the 8 facets of self-care.

### Setup

In [1]:
#import python libraries
import numpy as np
import pandas as pd

!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

from textblob import TextBlob
import nltk
nltk.download('brown')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

import spacy

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
  from .autonotebook import tqdm as notebook_tqdm
2023-06-01 21:11:33.949586: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-01 21:11:34.134605: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared

In [2]:
# Can change the file csv right here
file_csv = "win1_2final.csv"

data = pd.read_csv(file_csv)

first_column = data.columns[1]
data = data[[first_column, "Category 1", "Category 2"]]

In [3]:
all_dups = pd.DataFrame()
for i in range(len(data)):
    if not pd.isnull(data["Category 2"][i]):
        temp = data[[first_column, "Category 2"]].iloc[i]
        all_dups = all_dups.append(temp)
    else:
        continue
all_dups = all_dups[[first_column, "Category 2"]]
all_dups = all_dups.rename(columns={"Category 2": "Category"})

### Working data

In [4]:
data = data.rename(columns={"Category 1": "Category"})
data = data[[first_column, "Category"]]
data = data.append(all_dups)
data

Unnamed: 0,Win 1_2,Category
0,Took the time to do some meal prep last night ...,Physical
1,1. Select one of the activities that resonates...,Emotional
2,"Today, I cooked myself lunch and it was delici...",Physical
3,I worked on my Paint by Diamond piece.,Mental
4,I took a 20 minutes walk from the train statio...,Physical
...,...,...
100,I spent 40 mins cleaning my study space. I als...,Environmental
101,I try to take a time off from everything relat...,Spiritual
102,I took a nap for an hour while listening to so...,Emotional
105,I played fetch with my dogs and gave both of t...,Emotional


### Category frequencies within student comments

The comments data is displayed here in terms of each of the 8 self-care categories (physical, mental, community, emotional, environmental, spiritual, and occupational) through a pie chart. 

In [5]:
import plotly.express as px

dfg = data.groupby("Category").count().sort_values(by=first_column, ascending=False)
print("All reflections processed (includes duplicate categories): ", np.sum(dfg[first_column]))

dfg_pie = px.pie(dfg.reset_index(), values='Win 1_2', names='Category', title='Category Frequencies')
dfg_pie

All reflections processed (includes duplicate categories):  168


### Sentiment analysis (positivity polarity score for each comment)

In this section, I scored each of the comments using the VADER sentiment analysis function. The function sentiment_scores(sentence) will return the positivity polarity score for each comment probabilistically between -1 and 1, with -1 being the most negative and 1 being the most positive. Note that this model is untrained and will be a very rough interpretation of the comments, i.e. not trained on what self-care is as a whole but on the general tone of each comment.

In [6]:
# function to print sentiments
# of the sentence (vader)

#taken from geekforgeeks
def sentiment_scores(sentence):
 
    # Create a SentimentIntensityAnalyzer object.
    sid_obj = SentimentIntensityAnalyzer()
 
    # polarity_scores method of SentimentIntensityAnalyzer
    # object gives a sentiment dictionary.
    # which contains pos, neg, neu, and compound scores.
    sentiment_dict = sid_obj.polarity_scores(sentence)

    return sentiment_dict['compound']

In [7]:
scores = pd.DataFrame(columns={"Win 1_2", "Score"})
for i in data[first_column]:
    score = sentiment_scores(i)
    scores = scores.append({first_column: i, "Score": score}, ignore_index=True)

In [8]:
scores_categories = data.merge(scores, on=first_column, how="right")
scores_categories

Unnamed: 0,Win 1_2,Category,Score
0,Took the time to do some meal prep last night ...,Physical,0.0000
1,Took the time to do some meal prep last night ...,Mental,0.0000
2,1. Select one of the activities that resonates...,Emotional,0.4939
3,"Today, I cooked myself lunch and it was delici...",Physical,0.6114
4,I worked on my Paint by Diamond piece.,Mental,0.3400
...,...,...,...
283,I took a nap for an hour while listening to so...,Emotional,0.0000
284,I played fetch with my dogs and gave both of t...,Physical,0.9153
285,I played fetch with my dogs and gave both of t...,Emotional,0.9153
286,I took a walk to get coffee and enjoyed the co...,Physical,0.7579


In [9]:
mean_scores = scores_categories.groupby('Category').mean()

In [10]:
fig = px.scatter(scores_categories, y="Score", x="Category", title="Polarity Scores by Category")
mean_scores = scores_categories.groupby('Category').mean()
for c in scores_categories['Category'].unique():
    fig.add_scatter(x=[c],
                    y=[mean_scores.loc[c]['Score']],
                    marker=dict(
                        color='red',
                        size=10
                    ),
                name=f'{c} mean')

fig.show()

In [11]:
px.bar(mean_scores.reset_index().sort_values(by="Score", ascending=False), x="Category", y="Score", title="Average Scores per Category")

### Scores of each category 

In [12]:
print('Physical category table')
scores_categories_physical = scores_categories[scores_categories["Category"] == "Physical"]
scores_categories_physical

Physical category table


Unnamed: 0,Win 1_2,Category,Score
0,Took the time to do some meal prep last night ...,Physical,0.0000
3,"Today, I cooked myself lunch and it was delici...",Physical,0.6114
5,I took a 20 minutes walk from the train statio...,Physical,0.0000
7,"Currently, I have been way too stressed to foc...",Physical,-0.4754
8,"Since we started school completely virtually, ...",Physical,0.0346
...,...,...,...
274,I took a walk in my neighborhood for an hour.,Physical,0.0000
276,Working out helps me to recharge mentally and ...,Physical,0.3818
282,I took a nap for an hour while listening to so...,Physical,0.0000
284,I played fetch with my dogs and gave both of t...,Physical,0.9153


In [13]:
print('Mental category table')
scores_categories_mental = scores_categories[scores_categories["Category"] == "Mental"]
scores_categories_mental

Mental category table


Unnamed: 0,Win 1_2,Category,Score
1,Took the time to do some meal prep last night ...,Mental,0.0
4,I worked on my Paint by Diamond piece.,Mental,0.34
23,"I took a workout class to better my health, st...",Mental,0.8402
26,I listened to a podcast that wasn't about the ...,Mental,0.3595
35,I spent my time looking at the cook book for m...,Mental,0.0
48,I did embroidery for about 30 min while watchi...,Mental,0.0
51,I recently got a food delivery subscription fr...,Mental,0.7003
55,I spent an hour writing my novel.,Mental,0.3182
58,I closed my eyes for 10 minutes and exercised ...,Mental,0.0
85,I spent my time listening to relaxing music wh...,Mental,0.6808


In [14]:
print('Community category table')
scores_categories_community = scores_categories[scores_categories["Category"] == "Community"]
scores_categories_community

Community category table


Unnamed: 0,Win 1_2,Category,Score
9,"Since we started school completely virtually, ...",Community,0.0346
10,I was having a really hard week and I met up w...,Community,0.6605
12,I worked out at my local gym with two of my fr...,Community,0.4767
14,"For my self-care, I decided to FaceTime a frie...",Community,0.4939
28,I choose to focus on nurture connection. I rea...,Community,0.8658
43,I went out to lunch with a friend who I haven'...,Community,0.4939
50,I took a walk with my husband and our dog in o...,Community,0.0
54,I made a yummy Thai dish with my wife yesterda...,Community,0.5267
62,I went to the gym and did 40 minutes of cardio...,Community,0.7184
67,Social Connectedness - I had lunch with a frie...,Community,0.6486


In [15]:
print('Emotional category table')
scores_categories_emotional = scores_categories[scores_categories["Category"] == "Emotional"]
scores_categories_emotional

Emotional category table


Unnamed: 0,Win 1_2,Category,Score
2,1. Select one of the activities that resonates...,Emotional,0.4939
16,I connected with my environment by taking a wa...,Emotional,0.9423
19,I listened to some relaxing music today. It wa...,Emotional,0.7345
27,I spent 30 minutes listening to calm/relaxing ...,Emotional,0.0
38,"I took a walk in my neighborhood, then wrote a...",Emotional,0.6486
46,Happy New Year 2022! My goal in this year is ...,Emotional,0.9925
59,"For my activity this week, I have chosen to li...",Emotional,0.8316
66,I went to my apartment building gym and worked...,Emotional,0.5777
73,I cooked a meal while listening to some relaxi...,Emotional,0.765
84,I spent my time listening to relaxing music wh...,Emotional,0.6808


In [16]:
print('Spiritual category table')
scores_categories_spiritual = scores_categories[scores_categories["Category"] == "Spiritual"]
scores_categories_spiritual

Spiritual category table


Unnamed: 0,Win 1_2,Category,Score
39,I burned some sage to clear my energy as well ...,Spiritual,0.9399
78,I woke up early this morning and went to yoga ...,Spiritual,0.4588
111,I did 30 minutes of yoga after I got home from...,Spiritual,0.0
122,I went to the gym and worked out. Then I stret...,Spiritual,0.0
132,I took a nap and later attended Daily Mass at ...,Spiritual,0.0
142,To relax and clear my mind from upcoming quizz...,Spiritual,0.8402
158,I try to take a time off from everything relat...,Spiritual,0.8381
220,I woke up early this morning and went to yoga ...,Spiritual,0.4588
242,I did 30 minutes of yoga after I got home from...,Spiritual,0.0
249,I went to the gym and worked out. Then I stret...,Spiritual,0.0


In [17]:
print('Occupational category table')
scores_categories_occupational = scores_categories[scores_categories["Category"] == "Occupational"]
scores_categories_occupational

Occupational category table


Unnamed: 0,Win 1_2,Category,Score
68,Social Connectedness - I had lunch with a frie...,Occupational,0.6486
94,Today I worked on my occupational wellness and...,Occupational,0.6994
112,I did 30 minutes of yoga after I got home from...,Occupational,0.0
140,I reached out to a friend planning how we can ...,Occupational,0.743
213,Social Connectedness - I had lunch with a frie...,Occupational,0.6486
230,Today I worked on my occupational wellness and...,Occupational,0.6994
243,I did 30 minutes of yoga after I got home from...,Occupational,0.0
263,I reached out to a friend planning how we can ...,Occupational,0.743


### Frequency of nouns and verbs 

In [18]:
!pip install spacy -q
!python -m spacy download en_core_web_sm -q

from collections import Counter
import en_core_web_sm

2023-06-01 21:11:49.664082: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-01 21:11:49.848333: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-01 21:11:49.848370: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-01 21:11:49.882007: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-01 21:11:50.714584: W tensorflow/stream_executor/pla

Note: Just for the sake of experimentation, I only did this part on comments categorized as "physical." Can easily broaden it to more categories if useful.

This section takes the top 100 ('top' meaning most frequent) words from a particular category along with its count within the comments. Again, note that there may be duplicates depending on if the comment was repeated if it was categorized twice. I then created a function to extract nouns and verbs within that category. This can potentially be helpful if we were to see which action within each category students tended to choose.

In [19]:
def fig_frequencies(tbl, keyword):
    counter = Counter(" ".join(tbl[first_column]).split()).most_common(100)
    array = []
    for c in counter:
        array = np.append(array, c[0])

    sentence = ','.join(array)

    def extract_nouns_verbs(sentence):
        text = nltk.word_tokenize(sentence)
        pos_tagged = nltk.pos_tag(text)
        nouns_verbs = filter(lambda x:x[1]=='NN' or x[1] == 'VB',pos_tagged)
        return list(nouns_verbs)

    nouns_verbs = pd.DataFrame(extract_nouns_verbs(sentence), columns=['Word', 'Word type'])
    counted_words = pd.DataFrame(counter, columns=['Word', 'Word frequency'])

    noun_verb_frequency = nouns_verbs.merge(counted_words, on='Word', how='left').sort_values(by='Word frequency', ascending=False)

    nlp = en_core_web_sm.load()
    def all_nouns(sentence):
        doc = nlp(sentence)
        nouns = [(token.lemma_, "NN") for token in doc if token.pos_ == "NOUN"]
        return nouns

    def all_verbs(sentence):
        doc = nlp(sentence)
        verbs = [(token.lemma_, "VB") for token in doc if token.pos_ == "VERB"]
        return verbs

    nouns_verbs_spacy = all_nouns(sentence) + all_verbs(sentence)

    nouns_verbs_sp = pd.DataFrame(nouns_verbs_spacy, columns=['Word', 'Word type'])
    counted_words = pd.DataFrame(counter, columns=['Word', 'Word frequency'])

    noun_verb_frequency_sp = nouns_verbs_sp.merge(counted_words, on='Word', how='inner').sort_values(by='Word frequency', ascending=False)
    noun_verb_frequency_sp = noun_verb_frequency_sp.drop_duplicates(subset='Word', keep="last")

    fig = px.bar(noun_verb_frequency_sp, x='Word', y='Word frequency', color='Word type', title='Noun/verb frequencies for Category: ' + keyword)
    fig.update_layout(xaxis_categoryorder = 'total descending')

    fig.show()

In [20]:
fig_frequencies(scores_categories_physical, 'Physical');


[W123] Argument disable with value [] is used instead of ['senter'] as specified in the config. Be aware that this might affect other components in your pipeline.



In [21]:
fig_frequencies(scores_categories_mental, 'Mental');

In [22]:
fig_frequencies(scores_categories_emotional, 'Emotional');

In [23]:
fig_frequencies(scores_categories_community, 'Community');

In [24]:
fig_frequencies(scores_categories_spiritual, 'Spiritual');

In [25]:
fig_frequencies(scores_categories_occupational, 'Occupational');

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=443b2d8b-ed93-43a6-bbe3-c3fe69ac80e1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>