# <span style="color:Blue">Assignment-5 of COSC5806: Data Analysis with Python</span>

# <span style="color:Blue">Due date: Friday, March 28, 2025, @11:59 PM</span>
# <span style="color:Blue">Cut-off date: Sunday, March 30, 2025, @11:59 PM</span>

## <span style="color:Purple">You are allowed to use core Python built-in modules/packages/libraries, NumPy, Pandas, scikit-learn, matplotlib, Seaborn, NLTK, Gensim, etc. Please read the instructions carefully and do not hesitate to contact me if you have any questions.</span>

### <span style="color:Red">Examples and Resources for this assignment:</span>
<ul>
    <li><span style="color:Red">Chapter 7 from <a href="https://github.com/amueller/introduction_to_ml_with_python/tree/main">Working with Text Data</a></span></li>
</ul>

### <span style="color:Green">Context</span>
CSSRS-Suicide: This Reddit dataset is comprised of 2181 user data from the timeframe between 2005 and 2016 and from 15 mental health-related subreddits. Four practicing psychiatrists followed the guidelines outlined in the Columbia Suicide Severity Rating Scale (C-SSRS) and annotated 500 users' data on suicide risks in five levels: Supportive, Indicator, Ideation, Behavior, and Attempt.

The following <a href="https://scholarcommons.sc.edu/cgi/viewcontent.cgi?params=/context/aii_fac_pub/article/1002/&path_info=knowledge_aware_assessment_of_severity_of_suicide_risk_for_early_intervention.pdf">link</a> might be useful to know more about the dataset.

# <span style="color:Green">P1: Load the dataset.</span>

In [1]:
#Codes of P1 here
import pandas as pd
import numpy as np
df_reddit = pd.read_csv('500_Reddit_users_posts_labels.csv')
print(df_reddit.head(3)) # pd_reddit.head(3)
print(df_reddit['Label'].unique()) # pd_reddit['Label'].unique()
print(df_reddit.any()) 

     User                                               Post       Label
0  user-0  ['Its not a viable option, and youll be leavin...  Supportive
1  user-1  ['It can be hard to appreciate the notion that...    Ideation
2  user-2  ['Hi, so last night i was sitting on the ledge...    Behavior
['Supportive' 'Ideation' 'Behavior' 'Attempt' 'Indicator']
User     True
Post     True
Label    True
dtype: bool


# <span style="color:Green">P2: Print the number of posts and percentages for each label.</span>

In [2]:
#Codes of P2 here
# print the number of posts
print("The number of posts is", len(df_reddit['Post']))
#  Calculate percentage using value_counts/len() and then apply a lambda function to format the output
total_labels = len(df_reddit['Label'])
serie_labels = (df_reddit['Label'].value_counts() / total_labels).apply((lambda x: f"{x:.2%}"))
print(serie_labels)

The number of posts is 500
Label
Ideation      34.20%
Supportive    21.60%
Indicator     19.80%
Behavior      15.40%
Attempt        9.00%
Name: count, dtype: object


# <span style="color:Green">P3: Convert to lowercase. Remove contractions and punctuations. Remove leading and trailing whitespaces. </span>

In [3]:
#Codes of P3 here
# Convert to lowercase
df_reddit['Post'] = df_reddit['Post'].str.lower()

# Remove contractions and punctuations.
pattern = r'[\W+]'
text = df_reddit['Post'][0]
df_reddit['Post'] = df_reddit['Post'].str.replace(pattern, ' ', regex=True)
# Remove leading and trailing whitespace
df_reddit['Post'] = df_reddit['Post'].str.strip()

df_reddit['Post'].head(3) # df_reddit['Post'].head(3)

0    its not a viable option  and youll be leaving ...
1    it can be hard to appreciate the notion that y...
2    hi  so last night i was sitting on the ledge o...
Name: Post, dtype: object

# <span style="color:Green">P4: Tokenize the text based on whitespaces and remove single-character tokens like 'b', 'c', 'd', etc.</span>

In [4]:
#Codes of P4 here
# Remove single character words except for a and i
df_reddit['Post'] = df_reddit['Post'].str.replace(r'\b[B-HJ-Zb-hj-z0-9]\b', '', regex=True) # Remove words with 2 or fewer characters
# Tokenize the df_reddit['Post'] column
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(df_reddit['Post'])
print(f"Vocabulary size: {vect.vocabulary_}")
print(f"Vocabulary content:\n {vect.vocabulary_}")


Vocabulary content:


# <span style="color:Green">P5: Remove stopwords.</span>

In [21]:
#Codes of P5 here
# Remove stop words
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
vect = CountVectorizer(min_df=5, stop_words="english").fit(df_reddit['Post'])
print(f"Vocabulary size: {len(vect.vocabulary_)}")
print(f"Vocabulary content:\n {vect.vocabulary_}")



Vocabulary size: 4264
Vocabulary content:


# <span style="color:Green">P6: Use POS tagging to lemmatize each token. Use <a href="https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet">WordNetLemmatizer</a>.</span>

In [22]:
#Codes of P6 here
# Lemmatization, nltk data is required
import nltk
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# print(wnl.lemmatize('responsibilities'))

# define a custom lemmatizer function
def my_lemmatizer(text):
    tokens = nltk.word_tokenize(text)
    lemmas = [wnl.lemmatize(token) for token in tokens]
    return lemmas
# Tokenize with leematization
lemma_vect = CountVectorizer(min_df=5, stop_words="english", tokenizer=my_lemmatizer).fit(df_reddit['Post'])

print(f"Vocabulary size:\n {len(vect.vocabulary_)}")
print(f"Vocabulary content:\n {lemma_vect.vocabulary_}")




Vocabulary size:
 4264
Vocabulary content:


# <span style="color:Green">P7: Perform label-wise topic modelling using Latent Dirichlet Allocation (LDA). Find an optimal number of topics using a coherence score. </span>

In [None]:
#Codes of P7 here
from sklearn.decomposition import LatentDirichletAllocation
# install gensim which need numpy 1.26.4, so you need to downgrade numpy first
import gensim
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel

# transform the data to a document-term matrix
X_train = lemma_vect.transform(df_reddit['Post'])
# get the feature names
feature_names = lemma_vect.get_feature_names_out()
# get the vocabulary in a list
texts = [my_lemmatizer(doc) for doc in df_reddit['Post']]
# create a gensim dictionary for the texts
id2word = corpora.Dictionary(texts)
coherence_values = []
model_list = []

# set topic range from 2 to 10, then test and compare them
topic_range = range(2, 11) 
no_top_words = 10

# Loop through the topic range and train LDA models
for num_topics in topic_range:
    # Train LDA model using num_topics
    lda = LatentDirichletAllocation(n_components=num_topics, random_state=42, max_iter=10)
    lda.fit(X_train)
    model_list.append(lda)
    
    topics = []
    # attend topic words
    for topic in lda.components_:
        top_indices = topic.argsort()[:-no_top_words - 1:-1]
        topic_words = [feature_names[i] for i in top_indices]
        topics.append(topic_words)
    
    # calculate coherence score for each model using different num_topics
    coherence_model = CoherenceModel(
        topics=topics,
        texts=texts,
        dictionary=id2word,
        coherence='c_v'   # 'c_v' is a common coherence setting
    )
    coherence_score = coherence_model.get_coherence()
    coherence_values.append(coherence_score)
    print(f"Num Topics = {num_topics} has Coherence Value of {coherence_score:.4f}")
optimal_index = np.argmax(coherence_values)
optimal_num_topics = topic_range[optimal_index]
print(f"Optimal number of topics: {optimal_num_topics}")



Num Topics = 2 has Coherence Value of 0.3555
Num Topics = 3 has Coherence Value of 0.3587
Num Topics = 4 has Coherence Value of 0.3603
Num Topics = 5 has Coherence Value of 0.3535
Num Topics = 6 has Coherence Value of 0.3593
Num Topics = 7 has Coherence Value of 0.3568
Num Topics = 8 has Coherence Value of 0.3619
Num Topics = 9 has Coherence Value of 0.3572
Num Topics = 10 has Coherence Value of 0.3591
Optimal number of topics: 8


# <span style="color:Green">P8: Analyze the topics extracted in the previous steps. See the example from pages 349-350 of this <a href="https://www.nrigroupindia.com/e-book/Introduction%20to%20Machine%20Learning%20with%20Python%20(%20PDFDrive.com%20)-min.pdf">link</a>.</span>

In [24]:
#Codes of P8 here
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}: ", " ".join([feature_names[i]
                                                 for i in topic.argsort()[:-no_top_words - 1:-1]]))

display_topics(model_list[optimal_index], feature_names, no_top_words)

Topic 0:  life thought people just wa dont u person help problem
Topic 1:  dont behavior hyperactive just know im thing life people feel
Topic 2:  youre like help im dont thing feel know people think
Topic 3:  xe2 x80 wa people life x99t behavior thing hyperactive im
Topic 4:  dont life want people just wa thing know like live
Topic 5:  like dont just people know im think youre thing want
Topic 6:  im just wa dont know ive feel really want time
Topic 7:  behavior hyperactive people thing time dont mental life just im


### <span style="color:Red">Please submit only your complete Jupyter notebook (.ipynb) file. Do not submit compressed files, entire projects, or any other types of files. Comment your program carefully so that it can be read and understood. If your program is not properly commented, you may lose marks. See \textbf{marking scheme} for details.</span>

### <span style="color:Red">Please note that the submitted work will be considered as your own work and you confirm that you have not received any unauthorized assistance including Large Language Models (LLMs) in preparing for or doing this lab/assignment/examination. You confirm knowing that a mark of 0 may be assigned for entire work.</span>

### <span style="color:Red">You will receive full credits for the working code, otherwise zero. No partial credits!</span>