<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/INFO371_week6_7_Text_Representation_allMarkdown_medical_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 371: Data Mining Applications

## Week 6-7: Text Representation
### Prof. Charles Dorner, EdD (Candidate)
### College of Computing and Informatics, Drexel University

# Import Libraries
- spaCy: spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.
- pandas: Used for data manipulation and analysis
- sklearn's CountVectorizer: Convert a collection of text documents to a matrix of token counts
- sklearn's TfidfVectorizer: Convert a collection of raw documents to a matrix of TF-IDF features.

```
import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
import spacy
```

In [None]:


import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt
import spacy



# Upload and read the text data

```
sms = pd.read_csv("spam.csv", encoding="latin-1")
sms.head()
```

In [28]:


import pandas as pd
import random

# Define templates for synthetic data generation
appointment_templates = [
    "I'd like to schedule a check-up on {}.",
    "Can I get an appointment with Dr. {} next {}?",
    "I need to reschedule my appointment to {}.",
    "Do you have availability for a physical exam on {}?",
    "Please book me for a consultation on {} morning.",
    "Is Dr. {} free for a wellness visit this {}?",
    "I want to set up a follow-up appointment for {}.",
    "Can I come in for a flu shot on {}?",
    "Looking to book a routine check-up next {}.",
    "Any openings for a dental cleaning on {}?"
]

medical_templates = [
    "I've been feeling {} for the past {} days.",
    "I'm experiencing {} and would like to speak to a doctor.",
    "There's been a lot of {} recently and it's getting worse.",
    "My {} is {} and I’m not sure what to do.",
    "Dealing with severe {} and mild {}.",
    "I have a {} that hasn’t gone away since last {}.",
    "Noticing unusual {} and mild {} symptoms.",
    "My child has had {} since yesterday evening.",
    "Concerned about {} and would like medical advice.",
    "Pain in my {} is affecting my sleep."
]

# Sample fill-in options
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
doctor_names = ["Lee", "Patel", "Smith", "Johnson", "Chen"]
conditions = ["dizziness", "nausea", "chest pain", "headaches", "fever", "fatigue", "rash", "soreness", "shortness of breath", "inflammation"]
body_parts = ["knee", "back", "neck", "abdomen", "shoulder", "foot"]

# Generate more samples
data = {
    "message": [],
    "label": []
}

for _ in range(250):  # 250 appointment messages
    template = random.choice(appointment_templates)
    filled = template.format(
        random.choice(days),
        random.choice(doctor_names)
    )
    data["message"].append(filled)
    data["label"].append("appointment")

for _ in range(250):  # 250 medical messages
    template = random.choice(medical_templates)
    filled = template.format(
        random.choice(conditions),
        random.choice(days),
        random.choice(body_parts),
        random.choice(["hurting", "swollen", "tingling", "painful"]),
        random.choice(conditions),
        random.choice(conditions)
    )[:200]  # Truncate just in case
    data["message"].append(filled)
    data["label"].append("medical")

# Convert to DataFrame and encode label
sms = pd.DataFrame(data)
sms['label'] = sms['label'].map({'appointment': 0, 'medical': 1})
sms.head(100)



Unnamed: 0,message,label
0,Do you have availability for a physical exam o...,0
1,I'd like to schedule a check-up on Monday.,0
2,Do you have availability for a physical exam o...,0
3,Any openings for a dental cleaning on Monday?,0
4,Can I get an appointment with Dr. Monday next ...,0
...,...,...
95,Can I get an appointment with Dr. Monday next ...,0
96,Please book me for a consultation on Saturday ...,0
97,Any openings for a dental cleaning on Thursday?,0
98,I want to set up a follow-up appointment for T...,0


```
sms.shape
```

In [29]:
sms.shape

(500, 2)

```
sms = sms[["v2", "v1"]]
sms.columns = ["message", "label"]
```

In [30]:
sms.head()

Unnamed: 0,message,label
0,Do you have availability for a physical exam o...,0
1,I'd like to schedule a check-up on Monday.,0
2,Do you have availability for a physical exam o...,0
3,Any openings for a dental cleaning on Monday?,0
4,Can I get an appointment with Dr. Monday next ...,0


```
sms.loc[0].message
```

In [31]:
sms.loc[0].message

'Do you have availability for a physical exam on Thursday?'

```
sms.loc[2].message
```

In [32]:
sms.loc[2].message

'Do you have availability for a physical exam on Saturday?'

# Understanding the Data
- It has five columns: v1, v2, and three unnamed columns.
- The v1 column denotes the label of the text whether it is a spam or not.
- The v2 column contains the text.

# The label class distribution

```
sms.label.value_counts()
```

In [33]:
sms.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,250
1,250


```
sms.label.value_counts() / len(sms)
```

In [34]:
sms.label.value_counts() / len(sms)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,0.5
1,0.5


# Spacy Tokenizer
- We will use spaCy library for word tokenization
- We will import spaCy English language model
- We will remove stop words and punctuations
- We will extract lemmas

```
nlp = spacy.load("en_core_web_sm")
```

In [35]:
import spacy
nlp = spacy.load("en_core_web_sm")

```
doc = nlp(sms.loc[0].message)
```

In [36]:
doc = nlp(sms.loc[0].message)

```
sms.loc[0].message
```

In [37]:
sms.loc[0].message

'Do you have availability for a physical exam on Thursday?'

```
tokens_info = []
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)
```

In [16]:


tokens_info = []
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)



Any any DET DT det
openings opening NOUN NNS ROOT
for for ADP IN prep
a a DET DT det
dental dental ADJ JJ amod
cleaning cleaning NOUN NN pobj
on on ADP IN prep
Friday Friday PROPN NNP pobj
? ? PUNCT . punct


```
tokens_info = []
for token in doc:
    tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \
            token.shape_, token.is_alpha, token.is_stop])
tokens_df = pd.DataFrame(tokens_info, columns=['Token', 'Lemma', 'POS', 'TAG', 'DEP', 'Shape', 'Is_Alpha', 'Is_Stop'])
tokens_df
```

In [38]:
tokens_info = []
for token in doc:
    tokens_info.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, \
            token.shape_, token.is_alpha, token.is_stop])
tokens_df = pd.DataFrame(tokens_info, columns=['Token', 'Lemma', 'POS', 'TAG', 'DEP', 'Shape', 'Is_Alpha', 'Is_Stop'])
tokens_df


Unnamed: 0,Token,Lemma,POS,TAG,DEP,Shape,Is_Alpha,Is_Stop
0,Do,do,AUX,VBP,aux,Xx,True,True
1,you,you,PRON,PRP,nsubj,xxx,True,True
2,have,have,VERB,VB,ROOT,xxxx,True,True
3,availability,availability,NOUN,NN,dobj,xxxx,True,False
4,for,for,ADP,IN,prep,xxx,True,True
5,a,a,DET,DT,det,x,True,True
6,physical,physical,ADJ,JJ,amod,xxxx,True,False
7,exam,exam,NOUN,NN,pobj,xxxx,True,False
8,on,on,ADP,IN,prep,xx,True,True
9,Thursday,Thursday,PROPN,NNP,pobj,Xxxxx,True,False


# Create a tokenizer using spacy

```
nlp = spacy.load("en_core_web_sm")

# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization,
    lowercasing, removing stop words and punctuations."""

    # Creating our token object which is used to create documents with linguistic annotations
    doc = nlp(sentence)

    # removing stop words and punctuations
    mytokens = [word for word in doc if not word.is_stop and word.pos_ != 'PUNCT']

    #lemmatizing each token and converting each token in lower case
    mytokens = [word.lemma_.lower().strip() if word.pos_ != "PRON" else word.text.lower() for word in mytokens ]

    # Return preprocessed list of tokens
    return mytokens
```

In [39]:
nlp = spacy.load("en_core_web_sm")

# Creating our tokenzer function
def spacy_tokenizer(sentence):
    """This function will accepts a sentence as input and processes the sentence into tokens, performing lemmatization,
    lowercasing, removing stop words and punctuations."""

    # Creating our token object which is used to create documents with linguistic annotations
    doc = nlp(sentence)

    # removing stop words and punctuations
    mytokens = [word for word in doc if not word.is_stop and word.pos_ != 'PUNCT']

    #lemmatizing each token and converting each token in lower case
    mytokens = [word.lemma_.lower().strip() if word.pos_ != "PRON" else word.text.lower() for word in mytokens ]

    # Return preprocessed list of tokens
    return mytokens



```
spacy_tokenizer(sms.loc[345].message)
```

In [40]:
spacy_tokenizer(sms.loc[345].message)

['feel', 'chest', 'pain', 'past', 'monday', 'day']

## Retrievel practice on text pre-processing

# Feature Engineering
The objective is to predict whether a text is spam or not. For a classification model to understand the text,  we must convert them into numeric format.

## Vectorization
- We will convert labels to 1 or 0 such that spam=1 and ham=0
- We are going to use Bag of Words(BoW) to convert text into numeric format.
- BoW converts text into the matrix of occurrence of words within a given - document. It focuses on whether given word occurred or not in given document and generate the matrix called as BoW matrix/Document Term Matrix
- We are going to use sklearn's CountVectorizer to generate BoW matrix.
- In CountVectorizer we will use custom tokenizer 'spacy_tokenizer' and - ngram range to define the combination of adjacent words. So unigram means sequence of single word and bigrams means sequence of 2 continuous words.
- Likewise, n means sequence of n continuous words.
- In this example we are going to use unigram, so our lower and upper bound of ngram range will be (1,1)

```
from sklearn.feature_extraction.text import CountVectorizer
```

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

## First, test binary vectorization

```
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=True)
```

In [42]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=True)

```
sms.loc[0].message
```

In [43]:
sms.loc[0].message

'Do you have availability for a physical exam on Thursday?'

```
bow_vector.fit_transform(sms.loc[0:5].message).todense()
```

In [44]:
bow_vector.fit_transform(sms.loc[0:5].message).todense()



matrix([[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
        [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]])

```
# Convert all text into vectors
X = bow_vector.fit_transform(sms.message)
```

In [45]:
# Convert all text into vectors
X = bow_vector.fit_transform(sms.message)

```
X.shape
```

In [46]:
X.shape

(500, 78)

```
# Convert class label to numeric 1 or 0
y = sms.label.map({'spam':1, 'ham':0})
y
```

In [56]:
print(sms['label'].unique())
y = sms['label'].astype(int)


['0' '1']


# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [57]:


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
cls = KNeighborsClassifier()
```

In [58]:


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
cls = KNeighborsClassifier()



```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [59]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [60]:
scores

array([1., 1., 1., 1., 1.])

```
np.mean(scores)
```

In [62]:
import numpy as np
np.mean(scores)

np.float64(1.0)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [63]:
cls.fit(X_train, y_train)

```
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
```

In [64]:


from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score



```
preds = cls.predict(X_test)
```

In [65]:
preds = cls.predict(X_test)

```
preds.shape
```

In [66]:
preds.shape

(100,)

```
accuracy_score(preds, y_test)
```

In [67]:
accuracy_score(preds, y_test)

1.0

```
precision_score(preds, y_test)
```

In [68]:
precision_score(preds, y_test)

1.0

```
recall_score(preds, y_test)
```

In [69]:
recall_score(preds, y_test)

1.0

```
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [70]:


print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 1.0
Recall: 1.0
F1-Measure: 1.0
Accuracy: 1.0


```

```

## Second, test count vectorization

```
bow_vector_tf = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=False)
```

In [71]:


bow_vector_tf = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1), binary=False)



```
# Convert all text into vectors
X = bow_vector_tf.fit_transform(sms.message)
```

In [72]:


# Convert all text into vectors
X = bow_vector_tf.fit_transform(sms.message)





```
X.shape
```

In [73]:
X.shape

(500, 78)

```
X[0].todense()
```

In [74]:
X[0].todense()

matrix([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [75]:


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

In [76]:
cls = KNeighborsClassifier()

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [77]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [78]:
scores

array([1., 1., 1., 1., 1.])

```
np.mean(scores)
```

In [79]:
np.mean(scores)

np.float64(1.0)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [80]:
cls.fit(X_train, y_train)

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [81]:


preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 1.0
Recall: 1.0
F1-Measure: 1.0
Accuracy: 1.0


## Retrieval practice on binaryvector and countvector

```

```

## Test TFIDF vectorization

```
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
```

In [82]:


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)



```
# Convert all text into vectors
X = tfidf_vector.fit_transform(sms.message)
```

In [83]:


# Convert all text into vectors
X = tfidf_vector.fit_transform(sms.message)





```
X.shape
```

In [84]:
X.shape

(500, 78)

```
(X[3678].toarray() != 0).sum()
```

In [88]:
(X[100].toarray() != 0).sum()  # Use a valid row index like 0 to 499


np.int64(5)

# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [89]:


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

In [90]:
cls = KNeighborsClassifier()

```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [91]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [92]:
scores

array([1., 1., 1., 1., 1.])

```
np.mean(scores)
```

In [93]:
np.mean(scores)

np.float64(1.0)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [94]:
cls.fit(X_train, y_train)

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [95]:


preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 1.0
Recall: 1.0
F1-Measure: 1.0
Accuracy: 1.0


```

```

## Test Word Embeddings
- Use word2vec to embed each word in a message as a vector.
- Use the mean of all word vectors in a message as the message embedding.

In [96]:
!pip install spacy
!python -m spacy download en_core_web_md


import spacy
import numpy as np
from tqdm import tqdm

# Load the medium model (has 300d word vectors)
nlp = spacy.load("en_core_web_md")  # 300-dim like word2vec-google-news-300

# Function to get mean word embedding
def get_embedding(text):
    doc = nlp(text)
    return doc.vector  # Automatically averages word vectors

# Embed each message
message_embeddings = []
for message in tqdm(sms['message']):
    message_embeddings.append(get_embedding(message))

X = np.array(message_embeddings)
print(X.shape)


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


100%|██████████| 500/500 [00:04<00:00, 109.56it/s]

(500, 300)





# Split data into training and test sets
- We will use sklearn train_test_split to create training and test sets
- We will 80% of the data as training set and the rest 20% for test

```
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
```

In [97]:


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



# Let us build a KNN classifier

```
cls = KNeighborsClassifier()
```

In [98]:


cls = KNeighborsClassifier()



```
scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')
```

In [99]:


scores = cross_val_score(cls, X_train, y_train, scoring='accuracy')



```
scores
```

In [100]:
scores

array([1., 1., 1., 1., 1.])

```
np.mean(scores)
```

In [101]:
np.mean(scores)

np.float64(1.0)

# Test the classifier

```
cls.fit(X_train, y_train)
```

In [102]:
cls.fit(X_train, y_train)

```
preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))
```

In [103]:


preds = cls.predict(X_test)
print("Precision: {}".format(precision_score(preds, y_test)))
print("Recall: {}".format(recall_score(preds, y_test)))
print("F1-Measure: {}".format(f1_score(preds, y_test)))
print("Accuracy: {}".format(accuracy_score(preds, y_test)))



Precision: 1.0
Recall: 1.0
F1-Measure: 1.0
Accuracy: 1.0


## Retrieval practice on tfidf and embeddings

```

```