This part would apply the stemming. I choose LancasterStemmer because it seems to be most efficient to find the root and would reduce the number of tokens, which make the last cluster easier.

In [4]:
import pandas as pd
import re
from nltk.stem import LancasterStemmer

# Read the CSV file
df = pd.read_csv('patient_notes.csv')

# Create an instance of LancasterStemmer
stemmer = LancasterStemmer()

# Apply LancasterStemmer to the pn_history column
df['pn_history'] = df['pn_history'].apply(lambda x: ' '.join([stemmer.stem(word) for word in re.findall(r'\w+', x)]))

# Print the updated dataframe
print(df)


       pn_num  case_num                                         pn_history
0           0         0  17 year old mal has com to the stud heal clin ...
1           1         0  17 yo mal with recur palpit for the past 3 mo ...
2           2         0  dillon cleveland is a 17 y o mal paty with no ...
3           3         0  a 17 yo m c o palpit start 3 mos ago noth impr...
4           4         0  17yo mal with no pmh her for evalu of palpit s...
...       ...       ...                                                ...
42141   95330         9  ms mad is a 20 yo fem pres w the worst ha of h...
42142   95331         9  a 20 yo f cam complain a dul 8 10 headach that...
42143   95332         9  ms mad is a 20yo fem who pres with a headach o...
42144   95333         9  stephany mad is a 20 year old wom complain of ...
42145   95334         9  paty is a 20 yo f who pres with a headach she ...

[42146 rows x 3 columns]


Apply lemmatizer.

In [11]:
import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
df['pn_history'] = df['pn_history'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in re.findall(r'\w+', x)]))
print(df)

[nltk_data] Downloading package wordnet to /Users/caoyun/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


       pn_num  case_num                                         pn_history
0           0         0  17 year old mal ha com stud heal clin complain...
1           1         0  17 yo mal recur palpit past 3 mo last 3 4 min ...
2           2         0  dillon cleveland 17 mal paty sign pmh pres com...
3           3         0  17 yo c palpit start 3 mo ago noth improv exac...
4           4         0  17yo mal pmh evalu palpit stat last 3 4mo ha f...
...       ...       ...                                                ...
42141   95330         9  mad 20 yo fem pres w worst ha lif unlik anyth ...
42142   95331         9  20 yo f cam complain dul 8 10 headach assocy n...
42143   95332         9  mad 20yo fem pres headach 1 day dur wok yester...
42144   95333         9  stephany mad 20 year old wom complain headach ...
42145   95334         9  paty 20 yo f pres headach said ha nev somethig...

[42146 rows x 3 columns]


Apply a stop word list to filter out unnecessary words.

In [12]:
from nltk.corpus import stopwords

# Download the stopwords if not already present
nltk.download('stopwords')

# Get the list of stopwords
stop_words = set(stopwords.words('english'))

# Apply stop word filtering to the pn_history column
df['pn_history'] = df['pn_history'].apply(lambda x: ' '.join([word for word in re.findall(r'\w+', x) if word not in stop_words]))

# Print the updated dataframe
print(df)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/caoyun/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


       pn_num  case_num                                         pn_history
0           0         0  17 year old mal ha com stud heal clin complain...
1           1         0  17 yo mal recur palpit past 3 mo last 3 4 min ...
2           2         0  dillon cleveland 17 mal paty sign pmh pres com...
3           3         0  17 yo c palpit start 3 mo ago noth improv exac...
4           4         0  17yo mal pmh evalu palpit stat last 3 4mo ha f...
...       ...       ...                                                ...
42141   95330         9  mad 20 yo fem pres w worst ha lif unlik anyth ...
42142   95331         9  20 yo f cam complain dul 8 10 headach assocy n...
42143   95332         9  mad 20yo fem pres headach 1 day dur wok yester...
42144   95333         9  stephany mad 20 year old wom complain headach ...
42145   95334         9  paty 20 yo f pres headach said ha nev somethig...

[42146 rows x 3 columns]
