# Klasifikasi Teks Untuk Mendeteksi Depresi dan Kecemasan pada Pengguna Twitter (X) dengan Model Klasik

#### About Dataset
This comprehensive dataset is a meticulously curated collection of mental health statuses tagged from various statements. The dataset amalgamates raw data from multiple sources, cleaned and compiled to create a robust resource for developing chatbots and performing sentiment analysis.

**Data Source**:
The dataset integrates information from the following Kaggle datasets:

- 3k Conversations Dataset for Chatbot
- Depression Reddit Cleaned
- Human Stress Prediction
- Predicting Anxiety in Mental Health Data
- Mental Health Dataset Bipolar
- Reddit Mental Health Data
- Students Anxiety and Depression 
- Suicidal Mental Health Dataset
- Suicidal Tweet Detection Dataset

**Data Overview**:
The dataset consists of statements tagged with one of the following seven mental health statuses:

- Normal
- Depression
- Suicidal
- Anxiety
- Stress
- Bi-Polar
- Personality Disorder

**Data Collection**:
The data is sourced from diverse platforms including social media posts, Reddit posts, Twitter posts, and more. Each entry is tagged with a specific mental health status, making it an invaluable asset for:

- Developing intelligent mental health chatbots.
- Performing in-depth sentiment analysis.
- Research and studies related to mental health trends.

**Features**:

- unique_id: A unique identifier for each entry.
- Statement: The textual data or post.
- Mental Health Status: The tagged mental health status of the statement.

**Usage**:
This dataset is ideal for training machine learning models aimed at understanding and predicting mental health conditions based on textual data. It can be used in various applications such as:

- ~~Chatbot development for mental health support.~~
- Sentiment analysis to gauge mental health trends.
- Academic research on mental health patterns.

**Acknowledgments**:
This dataset was created by aggregating and cleaning data from various publicly available datasets on Kaggle. Special thanks to the original dataset creators for their contributions.

Source: [Kaggle]('https://www.kaggle.com/datasets/suchintikasarkar/sentiment-analysis-for-mental-health')

## Data Collection

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('D:\Portfolio project\Mental Health Sentiment Analysis in Twitter\Data\Cleaned Combined Data.csv')
df.head()

Unnamed: 0,status,cleaned_statements
0,Anxiety,oh my gosh
1,Anxiety,"trouble sleeping, confused mind, restless hear..."
2,Anxiety,"All wrong, back off dear, forward doubt. Stay ..."
3,Anxiety,I've shifted my focus to something else but I'...
4,Anxiety,"I'm restless and restless, it's been a month n..."


In [3]:
df.shape

(52681, 2)

In [4]:
df.value_counts('status')

status
Normal                  16343
Depression              15404
Suicidal                10652
Anxiety                  3841
Bipolar                  2777
Stress                   2587
Personality disorder     1077
Name: count, dtype: int64

### Cek Data

In [5]:
df.duplicated().sum()

np.int64(1606)

In [6]:
df.drop_duplicates()

Unnamed: 0,status,cleaned_statements
0,Anxiety,oh my gosh
1,Anxiety,"trouble sleeping, confused mind, restless hear..."
2,Anxiety,"All wrong, back off dear, forward doubt. Stay ..."
3,Anxiety,I've shifted my focus to something else but I'...
4,Anxiety,"I'm restless and restless, it's been a month n..."
...,...,...
52478,Anxiety,Anxiety cause faintness when standing up ? As ...
52479,Anxiety,anxiety heart symptom does anyone else have th...
52480,Anxiety,Travel Anxiety Hi all! Long time anxiety suffe...
52481,Anxiety,fomo from things i’m not involved in does anyo...


In [7]:
df.isnull().sum()

status                0
cleaned_statements    0
dtype: int64

### Cleaning Data

In [8]:
df.shape

(52681, 2)

In [9]:
df.describe()

Unnamed: 0,status,cleaned_statements
count,52681,52681
unique,7,51053
top,Normal,what do you mean?
freq,16343,22


### Exploratory Data Analysis
Membantu memahami dataset sebelum training model, mengidentifikasi pola, masalah, dan memandu preprocessing

In [10]:
import matplotlib.pyplot as plt

In [11]:
df.value_counts('status')

status
Normal                  16343
Depression              15404
Suicidal                10652
Anxiety                  3841
Bipolar                  2777
Stress                   2587
Personality disorder     1077
Name: count, dtype: int64

### Dengan Model Klasik (TF-IDF + Desicion Tree/Random Forest/Naive Bayes/SVM/Logistic Regression)

### Text Preprocessing

In [12]:
df_classic = df.copy()

In [13]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dinar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dinar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
stop_words = set(stopwords.words('english'))
stemmer = WordNetLemmatizer()
def preprocess_text(text):
    text = text.lower() # Lowercasing
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'\@w+|\#','', text) # Remove mentions and hashtags
    text = re.sub(r'[^a-zA-Z]', ' ', text) # Remove special characters and numbers
    tokens = text.split() # Tokenization
    tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
    tokens = [stemmer.lemmatize(word) for word in tokens] # Lemmatization
    return ' '.join(tokens) 
df_classic['cleaned_statements'] = df_classic['cleaned_statements'].apply(preprocess_text) 
df_classic[['cleaned_statements']].head() 

Unnamed: 0,cleaned_statements
0,oh gosh
1,trouble sleeping confused mind restless heart ...
2,wrong back dear forward doubt stay restless re...
3,shifted focus something else still worried
4,restless restless month boy mean


In [15]:
df_classic.head()

Unnamed: 0,status,cleaned_statements
0,Anxiety,oh gosh
1,Anxiety,trouble sleeping confused mind restless heart ...
2,Anxiety,wrong back dear forward doubt stay restless re...
3,Anxiety,shifted focus something else still worried
4,Anxiety,restless restless month boy mean


#### Feature extration with TFIDFVectorizer

**Untuk case ini, feature engineering tidak diperlukan.**

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.pipeline import FeatureUnion, Pipeline
#from sklearn.svm import LinearSVC
#from sklearn.preprocessing import FucntionTransformer

def extract_features(df_classic):
    # Initialize TF-IDF vectorizer
    tfidf = TfidfVectorizer(
        max_features=5000,  # Limit features to top 5000 terms
        min_df=5,          # Ignore terms that appear in less than 5 documents
        max_df=0.95,       # Ignore terms that appear in more than 95% of documents
        stop_words='english',
        ngram_range=(1, 2) # Include both unigrams and bigrams
    )
    
    # Fit and transform the text data
    features = tfidf.fit_transform(df_classic)
    
    # Convert to DataFrame for better visualization
    feature_names = tfidf.get_feature_names_out()
    feature_matrix = pd.DataFrame(
        features.toarray(),
        columns=feature_names
    )
    
    return feature_matrix, tfidf

In [17]:
# Extract features
feature_matrix, tfidf = extract_features(df_classic['cleaned_statements'])

# Now feature_matrix can be used for machine learning models
print(f"Shape of feature matrix: {feature_matrix.shape}")
print(f"Number of unique terms: {len(tfidf.get_feature_names_out())}")

Shape of feature matrix: (52681, 5000)
Number of unique terms: 5000


In [18]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X = df_classic['cleaned_statements']
y = df_classic['status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Memakai model Logistic Regression/Naive Bayes/Random Forest**

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.pipeline import make_pipeline
import joblib
import os

models = (
    RandomForestClassifier(),
    MultinomialNB(),
    LogisticRegression(max_iter=200),
)
models_names = ['Random Forest', 'Multinomial Naive Bayes', 'Logistic Regression']

best_score = -1.0
best_pipeline = None
best_name = None

for model, name in zip(models, models_names):
    pipeline = make_pipeline(tfidf, model)   # tfidf, X_train, y_train sudah didefinisikan sebelumnya

    # Train the model
    pipeline.fit(X_train, y_train)

    # Make predictions
    y_pred = pipeline.predict(X_test)

    # Evaluate the model
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    print(f"Model: {name}")
    print("Accuracy:", acc)
    print("F1 Score:", f1)
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("-" * 50)

    # cek apakah model ini yang terbaik sejauh ini (berdasarkan F1)
    if f1 > best_score:
        best_score = f1
        best_pipeline = pipeline
        best_name = name

print(f"Model terbaik: {best_name} dengan F1 Score = {best_score:.4f}")

# ====== SIMPAN MODEL TERBAIK ======
os.makedirs("../saved_models", exist_ok=True)

# bikin nama file dari nama model
filename = best_name.lower().replace(" ", "_") + "_pipeline.joblib"
filepath = os.path.join("../saved_models", filename)

joblib.dump(best_pipeline, filepath)
print(f"Model terbaik disimpan di: {filepath}")


Model: Random Forest
Accuracy: 0.7120622568093385
F1 Score: 0.7007663991753732
Classification Report:
                       precision    recall  f1-score   support

             Anxiety       0.84      0.64      0.73       755
             Bipolar       0.94      0.51      0.66       527
          Depression       0.56      0.78      0.66      3016
              Normal       0.84      0.94      0.89      3308
Personality disorder       0.99      0.38      0.55       237
              Stress       0.94      0.27      0.42       536
            Suicidal       0.68      0.48      0.56      2158

            accuracy                           0.71     10537
           macro avg       0.83      0.57      0.64     10537
        weighted avg       0.74      0.71      0.70     10537

Confusion Matrix:
 [[ 482    0  185   66    0    5   17]
 [   4  271  215   20    0    0   17]
 [  45    8 2363  206    0    2  392]
 [  15    2  145 3107    0    1   38]
 [   3    0  125   13   90    1    5]
 [ 

In [20]:
pipeline = joblib.load("../saved_models\logistic_regression_pipeline.joblib")

text_baru = ["I feel very anxious and lonely"]

pred_label = pipeline.predict(text_baru)[0]

if hasattr(pipeline, "predict_proba"):
    probs = pipeline.predict_proba(text_baru)[0]
    classes = pipeline.classes_
    print("Prediksi:", pred_label)
    for cls, p in zip(classes, probs):
        print(f"{cls}: {p:.3f}")
else:
    print("Prediksi:", pred_label)

Prediksi: Anxiety
Anxiety: 0.687
Bipolar: 0.012
Depression: 0.173
Normal: 0.040
Personality disorder: 0.011
Stress: 0.028
Suicidal: 0.049
