# I. Data Loading and Preprocessing

Data source from `Dreaddit: A Reddit Dataset for Stress Analysis in Social Media`.

Following tasks are undertaken:
* Columns Selection
* Feature Transformation
* Handling Missing Values
* Column Encoding

## Categorical Data Encoding, Feature Selection and Training...

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%load_ext lab_black
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

# Data import
dreaddit_df = pd.read_csv("../data/dreaddit/dreaddit-train.csv")

In [3]:
dreaddit_df.head()

Unnamed: 0,subreddit,post_id,sentence_range,text,id,label,confidence,social_timestamp,social_karma,syntax_ari,...,lex_dal_min_pleasantness,lex_dal_min_activation,lex_dal_min_imagery,lex_dal_avg_activation,lex_dal_avg_imagery,lex_dal_avg_pleasantness,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment
0,ptsd,8601tu,"(15, 20)","He said he had not felt that way before, sugge...",33181,1,0.8,1521614353,5,1.806818,...,1.0,1.125,1.0,1.77,1.52211,1.89556,0.86,1,3.253573,-0.002742
1,assistance,8lbrx9,"(0, 5)","Hey there r/assistance, Not sure if this is th...",2606,0,1.0,1527009817,4,9.429737,...,1.125,1.0,1.0,1.69586,1.62045,1.88919,0.65,2,8.828316,0.292857
2,ptsd,9ch1zh,"(15, 20)",My mom then hit me with the newspaper and it s...,38816,1,0.8,1535935605,2,7.769821,...,1.0,1.1429,1.0,1.83088,1.58108,1.85828,0.67,0,7.841667,0.011894
3,relationships,7rorpp,"[5, 10]","until i met my new boyfriend, he is amazing, h...",239,1,0.6,1516429555,0,2.667798,...,1.0,1.125,1.0,1.75356,1.52114,1.98848,0.5,5,4.104027,0.141671
4,survivorsofabuse,9p2gbc,"[0, 5]",October is Domestic Violence Awareness Month a...,1421,1,0.8,1539809005,24,7.554238,...,1.0,1.125,1.0,1.77644,1.64872,1.81456,1.0,1,7.910952,-0.204167


In [7]:
print(dreaddit_df.text[0])

He said he had not felt that way before, suggeted I go rest and so ..TRIGGER AHEAD IF YOUI'RE A HYPOCONDRIAC LIKE ME: i decide to look up "feelings of doom" in hopes of maybe getting sucked into some rabbit hole of ludicrous conspiracy, a stupid "are you psychic" test or new age b.s., something I could even laugh at down the road. No, I ended up reading that this sense of doom can be indicative of various health ailments; one of which I am prone to.. So on top of my "doom" to my gloom..I am now f'n worried about my heart. I do happen to have a physical in 48 hours.


In [9]:
cols = dreaddit_df.columns
list(cols)

['subreddit',
 'post_id',
 'sentence_range',
 'text',
 'id',
 'label',
 'confidence',
 'social_timestamp',
 'social_karma',
 'syntax_ari',
 'lex_liwc_WC',
 'lex_liwc_Analytic',
 'lex_liwc_Clout',
 'lex_liwc_Authentic',
 'lex_liwc_Tone',
 'lex_liwc_WPS',
 'lex_liwc_Sixltr',
 'lex_liwc_Dic',
 'lex_liwc_function',
 'lex_liwc_pronoun',
 'lex_liwc_ppron',
 'lex_liwc_i',
 'lex_liwc_we',
 'lex_liwc_you',
 'lex_liwc_shehe',
 'lex_liwc_they',
 'lex_liwc_ipron',
 'lex_liwc_article',
 'lex_liwc_prep',
 'lex_liwc_auxverb',
 'lex_liwc_adverb',
 'lex_liwc_conj',
 'lex_liwc_negate',
 'lex_liwc_verb',
 'lex_liwc_adj',
 'lex_liwc_compare',
 'lex_liwc_interrog',
 'lex_liwc_number',
 'lex_liwc_quant',
 'lex_liwc_affect',
 'lex_liwc_posemo',
 'lex_liwc_negemo',
 'lex_liwc_anx',
 'lex_liwc_anger',
 'lex_liwc_sad',
 'lex_liwc_social',
 'lex_liwc_family',
 'lex_liwc_friend',
 'lex_liwc_female',
 'lex_liwc_male',
 'lex_liwc_cogproc',
 'lex_liwc_insight',
 'lex_liwc_cause',
 'lex_liwc_discrep',
 'lex_liwc_tent

In [10]:
new_cols = [
    "subreddit",
    "post_id",
    "sentence_range",
    "text",
    "id",
    "label",
    "confidence",
    "social_timestamp",
    "social_karma",
    "syntax_ari",
    "social_upvote_ratio",
    "social_num_comments",
    "syntax_fk_grade",
    "sentiment",
]

In [11]:
train_df = dreaddit_df[new_cols]
train_df.head()

Unnamed: 0,subreddit,post_id,sentence_range,text,id,label,confidence,social_timestamp,social_karma,syntax_ari,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment
0,ptsd,8601tu,"(15, 20)","He said he had not felt that way before, sugge...",33181,1,0.8,1521614353,5,1.806818,0.86,1,3.253573,-0.002742
1,assistance,8lbrx9,"(0, 5)","Hey there r/assistance, Not sure if this is th...",2606,0,1.0,1527009817,4,9.429737,0.65,2,8.828316,0.292857
2,ptsd,9ch1zh,"(15, 20)",My mom then hit me with the newspaper and it s...,38816,1,0.8,1535935605,2,7.769821,0.67,0,7.841667,0.011894
3,relationships,7rorpp,"[5, 10]","until i met my new boyfriend, he is amazing, h...",239,1,0.6,1516429555,0,2.667798,0.5,5,4.104027,0.141671
4,survivorsofabuse,9p2gbc,"[0, 5]",October is Domestic Violence Awareness Month a...,1421,1,0.8,1539809005,24,7.554238,1.0,1,7.910952,-0.204167


### Linguistic Inquiry and Word Count (LIWC)
A lexicon-based tool that gives scores for psychologically relevant categories such as sadness or cognitive processes, as a proxy for topic prevalence and expression variety. We calculate both the percentage of tokens per domain which are included in a specific LIWC word list, and the percentage of words in a specific LIWC word list that appear.

In [12]:
liwc_cols = [
    "lex_liwc_WC",
    "lex_liwc_Analytic",
    "lex_liwc_Clout",
    "lex_liwc_Authentic",
    "lex_liwc_Tone",
    "lex_liwc_WPS",
    "lex_liwc_Sixltr",
    "lex_liwc_Dic",
    "lex_liwc_function",
    "lex_liwc_pronoun",
    "lex_liwc_ppron",
    "lex_liwc_i",
    "lex_liwc_we",
    "lex_liwc_you",
    "lex_liwc_shehe",
    "lex_liwc_they",
    "lex_liwc_ipron",
    "lex_liwc_article",
    "lex_liwc_prep",
    "lex_liwc_auxverb",
    "lex_liwc_adverb",
    "lex_liwc_conj",
    "lex_liwc_negate",
    "lex_liwc_verb",
    "lex_liwc_adj",
    "lex_liwc_compare",
    "lex_liwc_interrog",
    "lex_liwc_number",
    "lex_liwc_quant",
    "lex_liwc_affect",
    "lex_liwc_posemo",
    "lex_liwc_negemo",
    "lex_liwc_anx",
    "lex_liwc_anger",
    "lex_liwc_sad",
    "lex_liwc_social",
    "lex_liwc_family",
    "lex_liwc_friend",
    "lex_liwc_female",
    "lex_liwc_male",
    "lex_liwc_cogproc",
    "lex_liwc_insight",
    "lex_liwc_cause",
    "lex_liwc_discrep",
    "lex_liwc_tentat",
    "lex_liwc_certain",
    "lex_liwc_differ",
    "lex_liwc_percept",
    "lex_liwc_see",
    "lex_liwc_hear",
    "lex_liwc_feel",
    "lex_liwc_bio",
    "lex_liwc_body",
    "lex_liwc_health",
    "lex_liwc_sexual",
    "lex_liwc_ingest",
    "lex_liwc_drives",
    "lex_liwc_affiliation",
    "lex_liwc_achieve",
    "lex_liwc_power",
    "lex_liwc_reward",
    "lex_liwc_risk",
    "lex_liwc_focuspast",
    "lex_liwc_focuspresent",
    "lex_liwc_focusfuture",
    "lex_liwc_relativ",
    "lex_liwc_motion",
    "lex_liwc_space",
    "lex_liwc_time",
    "lex_liwc_work",
    "lex_liwc_leisure",
    "lex_liwc_home",
    "lex_liwc_money",
    "lex_liwc_relig",
    "lex_liwc_death",
    "lex_liwc_informal",
    "lex_liwc_swear",
    "lex_liwc_netspeak",
    "lex_liwc_assent",
    "lex_liwc_nonflu",
    "lex_liwc_filler",
    "lex_liwc_AllPunc",
    "lex_liwc_Period",
    "lex_liwc_Comma",
    "lex_liwc_Colon",
    "lex_liwc_SemiC",
    "lex_liwc_QMark",
    "lex_liwc_Exclam",
    "lex_liwc_Dash",
    "lex_liwc_Quote",
    "lex_liwc_Apostro",
    "lex_liwc_Parenth",
    "lex_liwc_OtherP",
    "lex_dal_max_pleasantness",
    "lex_dal_max_activation",
    "lex_dal_max_imagery",
    "lex_dal_min_pleasantness",
    "lex_dal_min_activation",
    "lex_dal_min_imagery",
    "lex_dal_avg_activation",
    "lex_dal_avg_imagery",
    "lex_dal_avg_pleasantness",
]

In [13]:
liwc_df = dreaddit_df[liwc_cols]

In [15]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2838 entries, 0 to 2837
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   subreddit            2838 non-null   object 
 1   post_id              2838 non-null   object 
 2   sentence_range       2838 non-null   object 
 3   text                 2838 non-null   object 
 4   id                   2838 non-null   int64  
 5   label                2838 non-null   int64  
 6   confidence           2838 non-null   float64
 7   social_timestamp     2838 non-null   int64  
 8   social_karma         2838 non-null   int64  
 9   syntax_ari           2838 non-null   float64
 10  social_upvote_ratio  2838 non-null   float64
 11  social_num_comments  2838 non-null   int64  
 12  syntax_fk_grade      2838 non-null   float64
 13  sentiment            2838 non-null   float64
dtypes: float64(5), int64(5), object(4)
memory usage: 310.5+ KB


In [19]:
train_df.subreddit.value_counts()

ptsd                584
relationships       552
anxiety             503
domesticviolence    316
assistance          289
survivorsofabuse    245
homeless            168
almosthomeless       80
stress               64
food_pantry          37
Name: subreddit, dtype: int64

In [20]:
train_df.label.value_counts()

1    1488
0    1350
Name: label, dtype: int64