# Data Preprocessing
--------

This notebook contains:
* [Preprocessing](#Preprocessing)
* [Word Processing](#Word-Processing)

In [2]:
import pandas as pd 
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import re
import emoji

In [3]:
# Read in data set of merged data and sentiment analysis
df = pd.read_csv('../data/full_subreddits_w_sentiment.csv')
df.head()

Unnamed: 0,post_id,title,self_text,subreddit,comment,comment_id,pos_score,neg_score,neu_score,compound_score
0,1319b5m,25F Rash on back of knee,"25F, I've had a rash on the back of my right k...",AskDocs,,,0.033,0.15,0.817,-0.9559
1,1319avc,idk if me or my gf have hpv or smthing why is ...,im a 20 year old girl (with a peepee) and i ha...,AskDocs,,,0.016,0.179,0.805,-0.9865
2,13199bu,bruise after blood test?,20f uk \n\njust concerned cause it’s been over...,AskDocs,,,0.051,0.102,0.847,-0.8848
3,13199a7,Heart palpitations while pushing during a bowe...,Hi i (19m 186cm 89kg) have been having bad day...,AskDocs,,,0.039,0.07,0.891,-0.3182
4,13197g1,Left side of my body between rib cage and hip ...,"So as the title says, I [22M] have been feelin...",AskDocs,,,0.1,0.086,0.813,-0.0123


## Preprocessing

For the preprocessing portion, I will conduct an analysis of the text for the `self_text` and `title` features. 

I want to check if there is any use of emojis in the dataset, this will have to be removed as to remove any unwanted characters in the vectorization. 

In [4]:
def emoji_finder(text:list):
    """
    Counts the number of strings that have an emoji in a list of strings
    """
    count = 0
    for i in text:
        if not (i != i):
            if emoji.emoji_count(i) > 0:
                count +=1
    return count

In [5]:
print(f'Number of entries with emojis in Title: {emoji_finder(df["title"])}')
print(f'Number of entries with emojis in Self Text: {emoji_finder(df["self_text"])}')

Number of entries with emojis in Title: 10
Number of entries with emojis in Self Text: 137


There are a couple of emojis used in the titles and selt_text, so I will add a feature in the preprocesser to remove the emojis. 

In [6]:
def url_finder(text:list):
    """
    Counts the number of strings that have a url in a list of strings
    """
    count = 0
    for i in text:
        if not (i != i):
            if re.findall(r"\S*https?:\S*", i):
                count +=1
    return count

In [7]:
print(f'Number of entries with urls in Title: {url_finder(df["title"])}')
print(f'Number of entries with urls in Self Text: {url_finder(df["self_text"])}')

Number of entries with urls in Title: 0
Number of entries with urls in Self Text: 1630


Urls are seen frequently in the `selt_text` of the 

In [8]:
def reddit_talk (text:list):
    """
    Counts the number of strings that have an instance of reddit lingo in a list of strings
    """
    scount = 0
    count = 0
    for i in text:
        if not (i != i):
            if re.findall('[MF][0-9][0-9]', i):
                count +=1
            if re.findall('/[a-z]\S', i):
                count += 1
    return count

In [9]:
print(f'Number of entries with urls in Title: {reddit_talk(df["title"])}')
print(f'Number of entries with urls in Self Text: {reddit_talk(df["self_text"])}')

Number of entries with urls in Title: 363
Number of entries with urls in Self Text: 3215


## Word Processing

In [19]:
def my_preprocessor(text):
    """
    Custom preporcessor to be used for TfidfVectorizer. Removal and change of word features based on preporcessing such as emoji removal, 
    url removal, and reddit lingo adaption. 
    """
    text  = text.lower()
    text = re.sub('\\n','', text)
    text = re.sub('\[removed\]','', text)
    text = emoji.replace_emoji(text, replace="")
    text = re.sub(r"\S*https?:\S*", "", text)
    text = re.sub('[MF][0-9][0-9]', '', text)
    text = re.sub('/[a-z]\S', ' or ', text)
        
    return text

In [13]:
cv = TfidfVectorizer(
    stop_words=stopwords.words('english'),
    ngram_range=(2, 5),
    max_df= 0.3,
    min_df= 5,
    preprocessor=my_preprocessor,
    max_features = 250
    )

df_t  = cv.fit_transform(df['self_text'])

In [14]:
df_tt = pd.DataFrame(df_t.todense(), columns = cv.get_feature_names_out())

In [15]:
df_tt.sum().sort_values(ascending=False).tail(20)

photo relevant         20.190235
coming back            20.183432
primary care           20.120734
get worse              20.072575
around time            20.020446
go see                 19.963300
30 minutes             19.021204
swollen lymph          18.679661
left ear               18.457302
weeks later            18.431558
right ear              17.969385
started feel           17.956120
feel better            17.705412
back negative          17.501683
blood work done        17.243456
went urgent care       16.871956
went urgent            16.871956
high blood             15.640944
high blood pressure    15.271110
reference range        10.386774
dtype: float64

In [20]:
cv1 = TfidfVectorizer(
    stop_words=stopwords.words('english'),
    ngram_range=(1, 1),
    max_df= 0.25,
    min_df= 10,
    preprocessor=my_preprocessor,
    max_features = 250
    )

df_t1  = cv1.fit_transform(df['self_text'])

In [21]:
df_t1 = pd.DataFrame(df_t1.todense(), columns = cv1.get_feature_names_out())

In [22]:
df_t1.sum().sort_values(ascending=False).head(20)

pain         321.618365
one          268.742667
back         248.973951
know         247.249632
could        246.637184
house        237.676566
time         237.082537
right        231.272741
water        222.764809
years        214.343511
need         212.773866
something    209.640740
really       207.819965
days         205.977448
since        205.583923
day          204.064163
want         203.797690
ago          201.728339
normal       199.962246
go           199.649868
dtype: float64