In [1]:
from IPython.display import Image, display; display(Image(url="https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.sprinklr.com%2Fblog%2Fchatbot-examples%2F&psig=AOvVaw3GjLwPVFaNAUG6e4xKJYH2&ust=1705391165437000&source=images&cd=vfe&opi=89978449&ved=0CBMQjRxqFwoTCJDLi8yZ34MDFQAAAAAdAAAAABAI"))



## <div style="color:white;display:fill;border-radius:8px;background-color:##800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 15px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'> | </span> </span></b>Defining the Question</b></p></div>

## <b><span style='color:#F1A424'>|</span> Executive Summary:</b> 

**Mental health, fundamentally a state of well-being, is crucial for individuals to realize their abilities, manage life's normal stresses, work productively, and contribute to their communities. Despite the rising global prevalence of mental health issues, including a 13% increase over the last decade noted by the WHO, access to effective treatments remains uneven, particularly among urban youths who face distinct challenges and stressors.
 Saidika, a burgeoning mental health service provider for urban youth, has encountered challenges due to the growing demand for mental health services. The volume of clients has impeded the prompt allocation of therapy resources, particularly for urgent cases, prompting the need for innovative solutions to enhance the efficiency and effectiveness of mental health care delivery. By leveraging the capabilities of AI and advancements in NLP, the project aims to bridge the gap between the growing demand for mental health services and the current limitations in supply and accessibility.**


## <b><span style='color:#F1A424'>|</span> Problem Statement:</b> 

**Saidika's platform is currently unable to efficiently handle the increasing influx of clients seeking mental health services. The inability to quickly triage and prioritize client needs is leading to potential delays in addressing urgent cases, which could have severe consequences on the well-being of individuals in need.**
**

## <b><span style='color:#F1A424'>|</span> Proposed Solution:</b> 

**Main Objective is to integrate ban advanced AI-powered mental health chatbot into Saidika's existing platform
to optimize client management processes, ensuring timely and appropriate allocation of therapy resources to those in need.**


## <b><span style='color:#F1A424'>|</span>Specific Obectives:</b> 
- **Client Categorization: To develop a chatbot that can accurately categorize clients based on their responses, distinguishing between varying levels of care requirements and scheduling clients based on their assessed needs and therapists' availability, optimizing the use of Saidika's resources.**
- **Urgency Escalation: To ensure the chatbot is capable of rapidly identifying and escalating urgent cases to therapists, facilitating prompt intervention.**
- **Service Accessibility: To broaden access to mental health care by providing a 24/7 chatbot service that will offer real-time interaction to clients who require immediate attention or a platform to express their concerns, bridging the gap until a professional is available.**
- **Resource Optimization: To aid therapists in managing their workload more effectively by allowing the chatbot to handle routine inquiries and non-urgent interactions.**
- **Data Collection and Analysis: To gather and analyze interaction data to continually improve the chatbot’s performance and the platform’s services.**
- **User Experience Enhancement: To create a user-friendly chatbot interface that provides a supportive environment for clients to express their concerns.**
- **Integration and Compatibility: To seamlessly integrate the chatbot into both web and mobile applications, ensuring functionality across various devices.**


## <b><span style='color:#F1A424'>|</span> Project Impact:</b> 

**The successful implementation of the mental health chatbot is expected to significantly improve the scalability of Saidika's services, enabling them to handle a greater volume of clients without sacrificing the quality of care. This technological solution aims to not only streamline operations but also to provide a critical early support system for individuals seeking mental health assistance. The chatbot's ability to analyze data will also furnish Saidika with valuable insights, driving policy and decision-making to better serve the community's mental health needs. Ultimately, the project endeavors to foster a more resilient urban youth population, better equipped to contribute positively to their communities**

## DATA PERTINENCE AND ATTRIBUTION


**The business aims to gain valuable insights into mental health trends, sentiments, and urgency levels by leveraging a diverse dataset acquired from public domain resources and Saidika's private, anonymized user data with proper consent and privacy law adherence. The data primarily consists of information gathered from health forums, Reddit, a dedicated mental health forum, and Beyond Blue.**

**Data Preparation:**

**Data Sources: Public domain resources and private Saidika user data.**

**Variable Types:**

- **Categorical variables: Representing various types of mental health issues.**

- **Binary variables: Indicating urgency levels.**
- **Continuous variables: Expressing sentiment scores associated with mental health discussions.**

**Preprocessing Steps:**

- **Text data cleaning: Removal of identifiable information.**

- **Tokenization: Breaking down text into tokens.**

- **Lemmatization: Reducing words to their base or root form.**

- **Vectorization: Converting text into numerical vectors suitable for Natural Language Processing (NLP) tasks.**

**Libraries Used:**

- **BeautifulSoup: Utilized for parsing and extracting data from HTML content.**

- **Python Libraries (NLTK, spaCy): Applied for NLP tasks such as tokenization, lemmatization, and other text processing operations.**

**Algorithms:**

- **Logistic Regression: Employed for analyzing categorical and binary variables, predicting urgency levels based on mental health issues.**

- **LSTM (Long Short-Term Memory): Utilized for sequence modeling in NLP, capturing dependencies in sentiment scores over the course of discussions.**

- **BERT (Bidirectional Encoder Representations from Transformers): Implemented for advanced contextualized embeddings, enhancing understanding of the nuanced context within mental health discourse.**

- **GPT (Generative Pre-trained Transformer): Employed for generating human-like text responses and comprehending the context of mental health discussions.**

**Overall, the objective is to extract meaningful insights, patterns, and correlations from this rich dataset, contributing to a deeper understanding of mental health issues, sentiments, and urgency levels, ultimately informing strategies for better mental health support and intervention.**








## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>1 |</span></span></b>Data Loading & Preparation</b></p></div>

## <b>1.1 <span style='color:#F1A424'>|</span> Importing Necessary Libraries</b> 

In [2]:
import re
import string
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  #plotting statistical graphs
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
# import squarify
from collections import Counter

# Load the Text Cleaning Package
import neattext.functions as nfx

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator ##is a data visualization technique used
#for representing text data in which the size of each word indicates its frequency

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix,roc_auc_score,classification_report
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier,SGDClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB


import nltk
from nltk.corpus import stopwords

from tqdm import tqdm ##new progress bars repeatedly
import os
import nltk ##building Python programs to work with human language data
#import spacy #for training the NER model tokenize words
#import random
#from spacy.util import compounding
#from spacy.util import minibatch


pd.set_option('max_colwidth', 400)
pd.set_option('use_mathjax', False)


import warnings
warnings.filterwarnings("ignore")

## <b>1.2 <span style='color:#F1A424'>|</span>Loading in our Data</b> 

In [3]:
# load the dataset -> feature extraction -> data visualization -> data cleaning -> train test split
# -> model building -> model training -> model evaluation -> model saving -> streamlit application deploy

# load the dataset just using specific features
df = pd.read_csv('../data/Aggregated_Data_Final.csv')

df

Unnamed: 0,Subreddit,Reddit Post,Unnamed: 2
0,CPTSD,Feeling like I was made to be unlovable,"I don't know if it was the emotional neglect, the psychological abuse, medical abuse, bullying, the CSA, whatever. I'm a mess right now. I feel like a horrible monster that somewhat a lot of people see as attractive, but under the facade I'm still a monster. As if I was someone who was built for being unlovable and despised, physically and emotionally, since I was born. I keep working and wor..."
1,CPTSD,DAE not know what to do with themselves when they have time?,"See title.\n\nI used to be the person full of hobbies (biking, drawing, reading, writing, walking, gaming) who really disliked people who never knew what to do with their free time and would be clingy. Now I am one of them.\n\nThrough years of hard depression and su.c.dal.ty thanks to cptsd I have stopped all my hobbies. I entrench myself in work and by now also meeting people and sometimes ob..."
2,CPTSD,Yoga triggers me- anyone else?,"I was doing yoga for years as a tool to help me back into my body when I was feeling rough as a form of reconnection. I even went as far as becoming trained in teaching, doing a 200hr training. As my trauma symptoms peaked however yoga would actually start having the reverse effect and would dissociate me. (In retrospect I wonder if I was in fact being dissociated the whole time.)\n\nStarted a..."
3,CPTSD,Did anyone else have a parent who said - you can make the choice - do you want ho listen to the sweet loving voice of tell it in or should I beat you?,The child me thought I made the right choice by listening to him. And he said as much. That I had finally done something right . Especially when they kept blaming me for all the things I did wrong . Anyone else ?
4,CPTSD,"Women: What is the real situation of misogyny, patriarchy, sexual abuse and harassment in your country?",
...,...,...,...
27673,suicidewatch,"But I want help doing it Tired of everyone saying no don't.\n\n*Get up so I can punch you again*\n\nI'm exhausted I want to clock out and be done everything just keeps getting worse, everything keeps losing value including myself- as if any of that even mattered.\n\nFuck the helpline I want real help I want help out",
27674,suicidewatch,Nothing to live for The ONLY reason I am alive right now is because of my sweet cat Pippin. Yesterday was the anniversary of adopting him 2 years ago. \nI've been really depressed and haven't been able to play with him as much so hes been meowing and being a little naughty as a result. I got so mad yesterday and yelled at him. All I can think about now is how I should give him to someone healt...,
27675,suicidewatch,Iâ€™m going to fucking kill myself 18 years too long. I think Iâ€™m going to go,
27676,suicidewatch,Iâ€™m going to pieces All Iâ€™ve done for about a month has been lay in bed. I donâ€™t enjoy anything. Canâ€™t focus on anything. I am terrified of the future. I donâ€™t want to be alive. I am in so much emotional pain. The only reason I am alive is my father because I donâ€™t want to hurt him. That and I canâ€™t decide on a method. I think not wanting to hurt him keeps me from choosing a meth...,


In [4]:
# Combine the two columns,'Reddit Post','Unnamed: 2' into a new column named "reddit_post"
df['reddit_post'] = df['Unnamed: 2'].fillna(df['Reddit Post'])

# Drop the original columns if needed
df.drop(['Reddit Post', 'Unnamed: 2'], axis=1, inplace=True)


In [5]:
df

Unnamed: 0,Subreddit,reddit_post
0,CPTSD,"I don't know if it was the emotional neglect, the psychological abuse, medical abuse, bullying, the CSA, whatever. I'm a mess right now. I feel like a horrible monster that somewhat a lot of people see as attractive, but under the facade I'm still a monster. As if I was someone who was built for being unlovable and despised, physically and emotionally, since I was born. I keep working and wor..."
1,CPTSD,"See title.\n\nI used to be the person full of hobbies (biking, drawing, reading, writing, walking, gaming) who really disliked people who never knew what to do with their free time and would be clingy. Now I am one of them.\n\nThrough years of hard depression and su.c.dal.ty thanks to cptsd I have stopped all my hobbies. I entrench myself in work and by now also meeting people and sometimes ob..."
2,CPTSD,"I was doing yoga for years as a tool to help me back into my body when I was feeling rough as a form of reconnection. I even went as far as becoming trained in teaching, doing a 200hr training. As my trauma symptoms peaked however yoga would actually start having the reverse effect and would dissociate me. (In retrospect I wonder if I was in fact being dissociated the whole time.)\n\nStarted a..."
3,CPTSD,The child me thought I made the right choice by listening to him. And he said as much. That I had finally done something right . Especially when they kept blaming me for all the things I did wrong . Anyone else ?
4,CPTSD,"Women: What is the real situation of misogyny, patriarchy, sexual abuse and harassment in your country?"
...,...,...
27673,suicidewatch,"But I want help doing it Tired of everyone saying no don't.\n\n*Get up so I can punch you again*\n\nI'm exhausted I want to clock out and be done everything just keeps getting worse, everything keeps losing value including myself- as if any of that even mattered.\n\nFuck the helpline I want real help I want help out"
27674,suicidewatch,Nothing to live for The ONLY reason I am alive right now is because of my sweet cat Pippin. Yesterday was the anniversary of adopting him 2 years ago. \nI've been really depressed and haven't been able to play with him as much so hes been meowing and being a little naughty as a result. I got so mad yesterday and yelled at him. All I can think about now is how I should give him to someone healt...
27675,suicidewatch,Iâ€™m going to fucking kill myself 18 years too long. I think Iâ€™m going to go
27676,suicidewatch,Iâ€™m going to pieces All Iâ€™ve done for about a month has been lay in bed. I donâ€™t enjoy anything. Canâ€™t focus on anything. I am terrified of the future. I donâ€™t want to be alive. I am in so much emotional pain. The only reason I am alive is my father because I donâ€™t want to hurt him. That and I canâ€™t decide on a method. I think not wanting to hurt him keeps me from choosing a meth...


In [6]:
#summary of our DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27678 entries, 0 to 27677
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Subreddit    27678 non-null  object
 1   reddit_post  27678 non-null  object
dtypes: object(2)
memory usage: 432.6+ KB


Finding the number of unique classes (subreddits) in our data

In [7]:
#obtain the unique values in the 'Subreddit' column
df.Subreddit.unique()

array(['CPTSD', 'diagnosedPTSD', 'alcoholism', 'socialanxiety',
       'suicidewatch'], dtype=object)

Below, we count the number of characters in each post.

In [8]:
#character count of each reddit post
df['reddit_post'].apply(str).apply(len)

0         443
1        1789
2         522
3         212
4         103
         ... 
27673     311
27674    1147
27675      79
27676     607
27677     526
Name: reddit_post, Length: 27678, dtype: int64

We find the number of null values in each column

In [9]:
#df.isna().sum()

In [10]:
#drop all NAN values in our dataframe
#df.dropna(inplace=True)


In [11]:
#check the number of null values 
#df.isna().sum()


Next, we find the number of words in each post

In [13]:
#word count in each reddit post
df[df['reddit_post'].isna()==False]['reddit_post'].apply(lambda x: len(x.split(" ")))

0         81
1        315
2         96
3         43
4         16
        ... 
27673     56
27674    225
27675     16
27676    125
27677     98
Name: reddit_post, Length: 27678, dtype: int64

## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>2 |</span></span></b> Data Quality Checks</b></p></div>
   
- **Another crucial step in any project involves ensuring the quality of your data. Remember that your model’s performance is directly tied to the data it processes. Therefore, take the time to remove duplicates and handle missing values appropriately.**

- **Here we always check for missing values, outliers and remove any unnecessary variables/features/columns. Since we have text data, outliers cannot be checked.**

## <b>2.1 <span style='color:#F1A424'>|</span> Checking for NaN Values</b> 

In [14]:
#check for the sum NaN values in our dataframe
df.isna().sum()

Subreddit      0
reddit_post    0
dtype: int64

In [15]:
#prints the count of NaN values for each column after dropping NaN values
print(df.isna().sum())
print("*"*40)

Subreddit      0
reddit_post    0
dtype: int64
****************************************


**As noted , we have no missing values in our dataframe.**

## <b>2.2 <span style='color:#F1A424'>|</span> Checking for Sentence Length Consistency</b> 

In [16]:
df['reddit_post'].apply(len)

0         443
1        1789
2         522
3         212
4         103
         ... 
27673     311
27674    1147
27675      79
27676     607
27677     526
Name: reddit_post, Length: 27678, dtype: int64

**This can give you an overview of the number of words per tweet. We also notice that some consist of less then five words hence won't be instrumental in constructing our predictive model.**

In [17]:
sum(df['reddit_post'].apply(len) > 5) , sum(df['reddit_post'].apply(len) <= 5)

(27678, 0)

**All our posts have words greater than five**

## <b>2.3 <span style='color:#F1A424'>|</span> Checking for Duplicates</b> 

In [18]:
#check and print the number of duplicates
print(df.duplicated().sum())
print("*"*40)

12
****************************************


**we notice that we have 12 duplicates.**

In [19]:
#checking if the duolicate values are indeed duplicates
df[df.duplicated(subset=['reddit_post'],keep=False)].sort_values(by='reddit_post').sample(10)

Unnamed: 0,Subreddit,reddit_post
20448,suicidewatch,"If you're going to commit suicide, don't go quietly. Take a psychiatrist or two to Hell with you! Better yet, there's going to be a psychiatrist convention on April 25-29 in Philadelphia. Go there, bring a gun, and open fire!\n\n[https://www.psychiatry.org/psychiatrists/meetings/annual-meeting](https://www.psychiatry.org/psychiatrists/meetings/annual-meeting)"
4206,socialanxiety,Thereâ€™s a party going on in my house and iâ€™m stuck in my room. I hate that iâ€™m like this but donâ€™t know any different.
1755,alcoholism,How yâ€™all get so much booze Iâ€™m 17 and working towards it but I always hit a speedbump whether itâ€™s money or finding somewhere to get some I find myself getting high off household appliances whenever I canâ€™t find any and thatâ€™s not good for me so I just want some ideas on finding a steady constant source
26120,suicidewatch,I canâ€™t stop crying WHY WONT ANYONE HELP ME???
20423,suicidewatch,Bye guys I think I made my choice. \nI realized things are not going to change.\nNo one cares about me.\nIâ€™m very traumatized about something that happened to me.\nI canâ€™t bear it anymore.\nIâ€™m just going to hang myself.\nIâ€™m so tired.\nSo alone. \nBye
4837,socialanxiety,"I need help My grandma says that I have to go to a ""school"" but Im suspicious, I can't find anything about it online and my grandma is being weird about it. I have a feeling they are gonna drive me to a pysch ward as a punishment. They did it before and no one listened to me. I'm almost 100 percent sure that's what they are gonna do, I'm not depressed or anything. I got kicked out of school la..."
832,CPTSD,#NAME?
16952,suicidewatch,"I'm going to go shoot up an elementary school as soon as I run out of money. I can't work, and I can't get a job. I have some money, about $40,000. Inheritance. As soon as my money runs out, I'm going to go shoot up an elementary school and spend the rest of my life in prison.\n\nI'll do it because prison is better than being homeless, and because I want to ""give back"" to this fucked-up world ..."
4590,socialanxiety,Question about benzodiazepines/social exposure therapy It's been a while since I've seen a doctor about some of my social issues and I wanted some perspective before seeking any specific medication. Have benzodiazepines ever facilitated for you any kind of exposure therapy? Does the ease benzodiazepines bring in social scenarios leave any lasting effects that go beyond the timeframe the drug i...
14521,suicidewatch,Am I strange for being aroused by a man's looks? Or does it make me dirty that I like looking at men? Is it common? I ask because Ive been reading a lot on reddit about dating &amp; attraction lately. A lot of men keep telling me that most women are able to be aroused by ugly men so long as those men say &amp; do the right things. I am not this way though. I just like beauty. My question is: i...


In [20]:
df = df.drop_duplicates()

print(df.duplicated().sum())
print("*"*40)

0
****************************************


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27666 entries, 0 to 27677
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Subreddit    27666 non-null  object
 1   reddit_post  27666 non-null  object
dtypes: object(2)
memory usage: 648.4+ KB


## <div style="color:white;display:fill;border-radius:8px;background-color:#800080;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>3 |</span></span></b> Data Preprocessing</b></p></div>




## <b>3.1 <span style='color:#F1A424'>|</span> cleaning textual data </b> 

We will clean and preprocess the textual data in the dataset to enhance its quality and consistency:
- Remove unnecessary characters.
- Convert text to lowercase for uniformity.
- Tokenization: Tokenize the text data to break it into individual words or tokens. This step is crucial for further analysis of the textual content.
- Normalization:Apply normalization techniques, such as stemming or lemmatization, to reduce words to their base or root forms. This aids in standardizing the text.
- Stop Word Removal:Eliminate common stop words from the text to focus on meaningful content. Stop words often do not contribute significantly to the analysis.
- Entity Recognition:Identify and recognize entities within the text. This step is particularly useful when dealing with named entities or specific information entities.
- Syntax Parsing:Perform syntax parsing to analyze the grammatical structure of sentences. This can provide insights into relationships between words.
- Text Transformation:Implement additional text transformations as needed for your specific analysis or modeling requirements.

**We will utilize the NeatText Library for text cleaning, a straightforward NLP package designed for cleaning and preprocessing textual data. This library simplifies the process of cleaning unstructured text by handling tasks such as removing special characters and stopwords, thereby reducing noise in the data.**

In [22]:
# load the text cleaning packages

import neattext as nt
import neattext.functions as nfx

# Methods and Attributes of the function
dir(nt)

['AUTOMATED_READ_INDEX',
 'BTC_ADDRESS_REGEX',
 'CONTRACTIONS_DICT',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Callable',
 'Counter',
 'CreditCard_REGEX',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'FUNCTORS_WORDLIST',
 'HASTAG_REGEX',
 'HTML_TAGS_REGEX',
 'List',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PUNCT_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextCleaner',
 'TextExtractor',
 'TextFrame',
 'TextMetrics',
 'TextPipeline',
 'Tuple',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 'ZIP_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'clean_text',
 'defaultdict',
 'digit2words',
 'emoji_explainer',
 'emojify',
 'explainer',
 'extract_btc_address',
 

### <b>3.1.1 <span style='color:#F1A424'>|</span> Mentions / User Handles</b> 

In [23]:
# Noise scan
df['reddit_post'].apply(lambda x: nt.TextFrame(x).noise_scan()['text_noise'])

0        13.769752
1        11.906093
2        12.835249
3        14.150943
4        10.679612
           ...    
27673    12.540193
27674    14.734089
27675     8.860759
27676    14.332784
27677    14.068441
Name: reddit_post, Length: 27666, dtype: float64

In [24]:
# Ensure all entries in reddit_post column are strings
df['reddit_post'] = df['reddit_post'].astype(str)

# Now apply the clean_text function
df['clean_post'] = df['reddit_post'].apply(lambda x: nfx.clean_text(x, puncts=False, stopwords=False))

In [27]:
df

Unnamed: 0,Subreddit,reddit_post,clean_post
0,CPTSD,"I don't know if it was the emotional neglect, the psychological abuse, medical abuse, bullying, the CSA, whatever. I'm a mess right now. I feel like a horrible monster that somewhat a lot of people see as attractive, but under the facade I'm still a monster. As if I was someone who was built for being unlovable and despised, physically and emotionally, since I was born. I keep working and wor...","i don't know if it was the emotional neglect, the psychological abuse, medical abuse, bullying, the csa, whatever. i'm a mess right now. i feel like a horrible monster that somewhat a lot of people see as attractive, but under the facade i'm still a monster. as if i was someone who was built for being unlovable and despised, physically and emotionally, since i was born. i keep working and work..."
1,CPTSD,"See title.\n\nI used to be the person full of hobbies (biking, drawing, reading, writing, walking, gaming) who really disliked people who never knew what to do with their free time and would be clingy. Now I am one of them.\n\nThrough years of hard depression and su.c.dal.ty thanks to cptsd I have stopped all my hobbies. I entrench myself in work and by now also meeting people and sometimes ob...","see title. i used to be the person full of hobbies (biking, drawing, reading, writing, walking, gaming) who really disliked people who never knew what to do with their free time and would be clingy. now i am one of them. through years of hard depression and su.c.dal.ty thanks to cptsd i have stopped all my hobbies. i entrench myself in work and by now also meeting people and sometimes obligato..."
2,CPTSD,"I was doing yoga for years as a tool to help me back into my body when I was feeling rough as a form of reconnection. I even went as far as becoming trained in teaching, doing a 200hr training. As my trauma symptoms peaked however yoga would actually start having the reverse effect and would dissociate me. (In retrospect I wonder if I was in fact being dissociated the whole time.)\n\nStarted a...","i was doing yoga for years as a tool to help me back into my body when i was feeling rough as a form of reconnection. i even went as far as becoming trained in teaching, doing a 200hr training. as my trauma symptoms peaked however yoga would actually start having the reverse effect and would dissociate me. (in retrospect i wonder if i was in fact being dissociated the whole time.) started agai..."
3,CPTSD,The child me thought I made the right choice by listening to him. And he said as much. That I had finally done something right . Especially when they kept blaming me for all the things I did wrong . Anyone else ?,the child me thought i made the right choice by listening to him. and he said as much. that i had finally done something right . especially when they kept blaming me for all the things i did wrong . anyone else ?
4,CPTSD,"Women: What is the real situation of misogyny, patriarchy, sexual abuse and harassment in your country?","women: what is the real situation of misogyny, patriarchy, sexual abuse and harassment in your country?"
...,...,...,...
27673,suicidewatch,"But I want help doing it Tired of everyone saying no don't.\n\n*Get up so I can punch you again*\n\nI'm exhausted I want to clock out and be done everything just keeps getting worse, everything keeps losing value including myself- as if any of that even mattered.\n\nFuck the helpline I want real help I want help out","but i want help doing it tired of everyone saying no don't. *get up so i can punch you again* i'm exhausted i want to clock out and be done everything just keeps getting worse, everything keeps losing value including myself- as if any of that even mattered. fuck the helpline i want real help i want help out"
27674,suicidewatch,Nothing to live for The ONLY reason I am alive right now is because of my sweet cat Pippin. Yesterday was the anniversary of adopting him 2 years ago. \nI've been really depressed and haven't been able to play with him as much so hes been meowing and being a little naughty as a result. I got so mad yesterday and yelled at him. All I can think about now is how I should give him to someone healt...,nothing to live for the only reason i am alive right now is because of my sweet cat pippin. yesterday was the anniversary of adopting him 2 years ago. i've been really depressed and haven't been able to play with him as much so hes been meowing and being a little naughty as a result. i got so mad yesterday and yelled at him. all i can think about now is how i should give him to someone healthi...
27675,suicidewatch,Iâ€™m going to fucking kill myself 18 years too long. I think Iâ€™m going to go,iâ€™m going to fucking kill myself 18 years too long. i think iâ€™m going to go
27676,suicidewatch,Iâ€™m going to pieces All Iâ€™ve done for about a month has been lay in bed. I donâ€™t enjoy anything. Canâ€™t focus on anything. I am terrified of the future. I donâ€™t want to be alive. I am in so much emotional pain. The only reason I am alive is my father because I donâ€™t want to hurt him. That and I canâ€™t decide on a method. I think not wanting to hurt him keeps me from choosing a meth...,iâ€™m going to pieces all iâ€™ve done for about a month has been lay in bed. i donâ€™t enjoy anything. canâ€™t focus on anything. i am terrified of the future. i donâ€™t want to be alive. i am in so much emotional pain. the only reason i am alive is my father because i donâ€™t want to hurt him. that and i canâ€™t decide on a method. i think not wanting to hurt him keeps me from choosing a meth...


### <b>3.1.2 <span style='color:#F1A424'>|</span> Hashtags</b> 

In [31]:
# Extract hashtags into another column before removing them
df['hashtags'] = df['clean_post'].apply(nfx.extract_hashtags)

df[['reddit_post','clean_post,'hashtags']].head()

SyntaxError: invalid syntax (<ipython-input-31-03c5f47940d7>, line 4)

In [None]:
# Remove hashtags
df['clean_post'] = df['clean_post'].apply(nfx.remove_hashtags)

df[['tweet', 'clean_tweet', 'hashtags']].head()

### <b>3.1.3 <span style='color:#F1A424'>|</span> URLs</b> 

In [None]:
# Extract URLs into another column before removing them
# If we were to remove the URLs after remove the special characters e.g '//' the function would be ubable to detect the URLs
df['urls'] = df['clean_tweet'].apply(nfx.extract_urls)

df[['tweet', 'clean_tweet', 'urls']].sample(5)

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[15]

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[16515]

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[12827]

In [None]:
# Remove URLS
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_urls)

In [None]:
df[['tweet', 'clean_tweet', 'urls']].loc[15]

### <b>3.1.4 <span style='color:#F1A424'>|</span> Special Characters</b> 

In [None]:
# Remove special characters

df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_special_characters)

df[['tweet', 'clean_tweet']].sample(5)

### <b>3.1.5 <span style='color:#F1A424'>|</span> Multiple Whitespaces</b> 

In [None]:
# Remove whitespaces
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_multiple_spaces)

df[['tweet', 'clean_tweet']].head()

### <b>3.1.6 <span style='color:#F1A424'>|</span> Emojis</b> 

In [None]:
# Remove emojis
df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_emojis)

df[['tweet', 'clean_tweet']].sample(5)

### <b>3.1.7 <span style='color:#F1A424'>|</span> Contractions</b> 

In [None]:
import contractions

# Apply the contractions.fix function to the clean_tweet column
df['clean_tweet'] = df['clean_tweet'].apply(contractions.fix)

df[['tweet', 'clean_tweet']].head()

### <b>3.1.8 <span style='color:#F1A424'>|</span> Stopwords</b> 

In [None]:
# Extract stopwords
df['clean_tweet'].apply(lambda x: nt.TextExtractor(x).extract_stopwords())

In [None]:
# Remove the stop words

df['clean_tweet'] = df['clean_tweet'].apply(nfx.remove_stopwords)

df[['tweet', 'clean_tweet']].head()

In [None]:
# Noise Scan after cleaning text
df['clean_tweet'].apply(lambda x: nt.TextFrame(x).noise_scan()['text_noise'])

## <b>3.2 <span style='color:#F1A424'>|</span> Linguistic Processing (Clean Text)</b> 

+ Tokenization
+ Stemming / Lemmatization
+ Parts of Speech Tagging
+ Calculating Sentiment Based on Polarity & Subjectivity

### <b>3.2.1 <span style='color:#F1A424'>|</span> Tokenization</b> 

In [None]:
test_sample = df['clean_tweet'].loc[12827]

test_sample

In [None]:
from nltk.tokenize import RegexpTokenizer

basic_token_pattern = r"(?u)\b\w\w+\b"

tokenizer = RegexpTokenizer(basic_token_pattern)

tokenizer.tokenize(test_sample)

In [None]:
# Tokenise the clean_tweet column
df['preprocessed_tweet'] = df['clean_tweet'].apply(lambda x: tokenizer.tokenize(x))

# df.iloc[100]["preprocessed_tweet"][:20]

In [None]:
df[['clean_tweet', 'preprocessed_tweet']].iloc[100]

### <b>3.2.2 <span style='color:#F1A424'>|</span> Lemmatization</b> 

In [None]:
# Define a function to lemmatise the tokens
def lemmatise_tokens(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Lemmatise the tokens
df['lemma_preprocessed_tweet'] = df['preprocessed_tweet'].apply(lambda x: lemmatise_tokens(x))

# df.iloc[100]["preprocessed_tweet"][:20]
    

In [None]:
df[['clean_tweet', 'lemma_preprocessed_tweet']].iloc[260]

In [None]:
# Define a function to stem the tokens
def stem_tokens(tokens):
    stemmer = nltk.stem.PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Stem the tokens
df['stemma_preprocessed_tweet'] = df['preprocessed_tweet'].apply(lambda x: stem_tokens(x))

# df.iloc[100]["preprocessed_tweet"][:20]

In [None]:
df[['clean_tweet', 'stemma_preprocessed_tweet']].iloc[200]

### <b>3.2.3 <span style='color:#F1A424'>|</span> Calculating Sentiment Based on Polarity & Subjectivity</b>

TextBlob is a Python library for processing textual data, including sentiment analysis. It uses natural language processing (NLP) and the Natural Language Toolkit (NLTK) to achieve its tasks. When a sentence is passed into TextBlob, it returns two outputs: polarity and subjectivity. The polarity score is a float within the range [-1, 1], where -1 indicates a negative sentiment and 1 indicates a positive sentiment. The subjectivity score is a float within the range, where 0 is very objective and 1 is very subjective.

In [None]:
from textblob import TextBlob

# Create a function to get the subjectivity
def getSubjectivity(text):
  return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
  return TextBlob(text).sentiment.polarity

# Create two new columns 'Subjectivity' & 'Polarity'
df['Subjectivity'] = df['clean_tweet'].apply(getSubjectivity)
df['Polarity'] = df['clean_tweet'].apply(getPolarity)

# Show the new dataframe with columns 'Subjectivity' & 'Polarity'
df[['clean_tweet','Subjectivity','Polarity']].head()

In [None]:
# Create a function to compute the negative, positive and nuetral analysis
def getAnalysis(score):
  if score < 0:
    return 'Negative'
  elif score == 0:
    return 'Neutral'
  else:
    return 'Positive'
  
df['sentiment'] = df['Polarity'].apply(getAnalysis)

# Show the dataframe
df[['clean_tweet','Subjectivity','Polarity','sentiment']].head()

In [None]:
df['sentiment'].value_counts()

In [None]:
# # using VADER
# from nltk.sentiment.vader import SentimentIntensityAnalyzer
# analyser = SentimentIntensityAnalyzer()

# # Create a function to get the sentiment scores
# def sentiment_analyzer_scores(text):
#     score = analyser.polarity_scores(text)
#     return score

# # Get the compound sentiment scores
# df['compound_sentiment'] = df['clean_tweet'].apply(lambda x: sentiment_analyzer_scores(x)['compound'])

# # Get the sentiment scores whereby there is positive, negative and neutral sentiment
# df['sentiment'] = df['compound_sentiment'].apply(lambda x: 'positive' if x >= 0.05 else ('negative' if x <= -0.05 else 'neutral'))

# df[['clean_tweet', 'compound_sentiment', 'sentiment']].head()

In [None]:
# df['sentiment'].value_counts()

In [None]:
df['preprocessed_tweet']

In [None]:
df['lemma_preprocessed_tweet'] = df['lemma_preprocessed_tweet'].apply(lambda x: ' '.join(x))

In [None]:
df['stemma_preprocessed_tweet'] = df['stemma_preprocessed_tweet'].apply(lambda x: ' '.join(x))

df['preprocessed_tweet'] = df['preprocessed_tweet'].apply(lambda x: ' '.join(x))

In [None]:
df['preprocessed_tweet']

In [None]:
df.info()

In [None]:
# save the dataframe to csv using the name 'interim_data.csv' fo the data folder
# df.to_csv('interim_data.csv', index=False)