# Notebook para preprocesar los datos provenientes de Reddit

Notebook que realiza la limpieza de las columnas, titulo, cuerpo y comentarios de la tabla master.
La limpieza consta en:

    - Normalización de mayusculas y minusculas

    - Expansión las contracciones

    - Remoción de URLs

    - Remoción de emojis

    - Remoción de multiples espacios

    - Remocion de simbolos innecesario

    - Lematizacion y tokenización usando SpacY.


Finalmente se genera una tabla llamada reddit_ibd_preprocessed.csv, la cual incluira las columnas limpias.

In [2]:
import re
import emoji
import contractions
import pandas as pd
import spacy
from tqdm import tqdm

In [3]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

In [4]:
# Regex patterns
URL_PATTERN = re.compile(r'https?://\S+|www\.\S+')
SYMBOL_PATTERN = re.compile(r'[^a-zA-Z0-9\s]')

def normalize_text(text, remove_emoji=True):
    """Lowercase, expand contractions, remove URLs and symbols."""
    if not isinstance(text, str):
        return ""
    
    # Lowercase
    text = text.lower()

    # Expand contractions (e.g. "don't" -> "do not")
    text = contractions.fix(text)

    # Remove URLs
    text = URL_PATTERN.sub('', text)

    # Optionally remove emojis
    if remove_emoji:
        text = emoji.replace_emoji(text, replace='')

    # Remove unnecessary symbols (keep alphanumeric and whitespace)
    text = SYMBOL_PATTERN.sub(' ', text)

    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text


def lemmatize_text(text):
    """Tokenize and lemmatize using spaCy."""
    if not isinstance(text, str):
        return ""
    
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(lemmas)

In [5]:
def preprocess_dataframe(df, text_columns):
    """Apply normalization and lemmatization to specified text columns."""
    tqdm.pandas()

    for col in text_columns:
        print(f"→ Preprocessing column: {col}")
        df[f"{col}_clean"] = df[col].progress_apply(normalize_text)
        df[f"{col}_lemma"] = df[f"{col}_clean"].progress_apply(lemmatize_text)
    
    return df

In [6]:
df = pd.read_csv('../data/interim/IBD_estructured_text.csv')
df.head()

Unnamed: 0,id,subreddit,author,title,selftext,created_utc,comments,cuerpo
0,dflwn,CrohnsDisease,zakool21,Don't be afraid of diagnostic procedures....,"I'm not likely to frontpage this subreddit, bu...",2010-09-18 11:59:11,"['I have to agree, the worse part of a Colonos...",TÍTULO:\nDon't be afraid of diagnostic procedu...
1,dfyy1,CrohnsDisease,sphinctersayzwha,Mayo Clinic article on Crohn's Disease.,Mayo Clinic article on Crohn's Disease.,2010-09-19 15:11:21,[],TÍTULO:\nMayo Clinic article on Crohn's Diseas...
2,dfz1m,CrohnsDisease,WeDeserveDessert,Has anyone else here taken Remicade? What did...,Has anyone else here taken Remicade? What did...,2010-09-19 15:19:54,"[""Remicade (and a bypass) changed my life. Wh...",TÍTULO:\nHas anyone else here taken Remicade? ...
3,dh6zd,CrohnsDisease,RosenTurd,"I was Diagnosed With Crohns Disease at age 9 ,...",Like the title Says AMA,2010-09-22 04:50:57,"[""My son was diagnosed at ten. It's hell for h...",TÍTULO:\nI was Diagnosed With Crohns Disease a...
4,dhgot,CrohnsDisease,unknownpleasures,Has anyone here had to have bowel surgery more...,I had a small bowel resection and appendectomy...,2010-09-22 18:31:36,"[""I have a story:\n\nI had a small bowel resec...",TÍTULO:\nHas anyone here had to have bowel sur...


In [None]:
text_cols = ["title", "selftext", "comments"]
processed_df = preprocess_dataframe(df, text_cols)

In [8]:
processed_df.to_csv("../data/processed/reddit_ibd_preprocessed.csv", index=False)

In [1]:
#ver el dataset preprosesado
import pandas as pd
df = pd.read_csv('/Users/fjosesala/Documents/GitHub/IBD-NLP-RiskPrediction/data/processed/reddit_ibd_preprocessed.csv')
df.head()

Unnamed: 0,id,subreddit,author,title,selftext,created_utc,comments,cuerpo,title_clean,title_lemma,selftext_clean,selftext_lemma,comments_clean,comments_lemma
0,dflwn,CrohnsDisease,zakool21,Don't be afraid of diagnostic procedures....,"I'm not likely to frontpage this subreddit, bu...",2010-09-18 11:59:11,"['I have to agree, the worse part of a Colonos...",TÍTULO:\nDon't be afraid of diagnostic procedu...,do not be afraid of diagnostic procedures,afraid diagnostic procedure,i am not likely to frontpage this subreddit bu...,likely frontpage subreddit glad exist want rea...,i have to agree the worse part of a colonoscop...,agree bad colonoscopy prep glad ok case thank ...
1,dfyy1,CrohnsDisease,sphinctersayzwha,Mayo Clinic article on Crohn's Disease.,Mayo Clinic article on Crohn's Disease.,2010-09-19 15:11:21,[],TÍTULO:\nMayo Clinic article on Crohn's Diseas...,mayo clinic article on crohn s disease,mayo clinic article crohn s disease,mayo clinic article on crohn s disease,mayo clinic article crohn s disease,,
2,dfz1m,CrohnsDisease,WeDeserveDessert,Has anyone else here taken Remicade? What did...,Has anyone else here taken Remicade? What did...,2010-09-19 15:19:54,"[""Remicade (and a bypass) changed my life. Wh...",TÍTULO:\nHas anyone else here taken Remicade? ...,has anyone else here taken remicade what did y...,take remicade think help,has anyone else here taken remicade what did y...,take remicade think help,remicade and a bypass changed my life while it...,remicade bypass change life eventually decreas...
3,dh6zd,CrohnsDisease,RosenTurd,"I was Diagnosed With Crohns Disease at age 9 ,...",Like the title Says AMA,2010-09-22 04:50:57,"[""My son was diagnosed at ten. It's hell for h...",TÍTULO:\nI was Diagnosed With Crohns Disease a...,i was diagnosed with crohns disease at age 9 n...,diagnose crohns disease age 9 28yo ama,like the title says ama,like title say ama,my son was diagnosed at ten it is hell for him...,son diagnose hell colonoscopy hospital visit d...
4,dhgot,CrohnsDisease,unknownpleasures,Has anyone here had to have bowel surgery more...,I had a small bowel resection and appendectomy...,2010-09-22 18:31:36,"[""I have a story:\n\nI had a small bowel resec...",TÍTULO:\nHas anyone here had to have bowel sur...,has anyone here had to have bowel surgery more...,bowel surgery,i had a small bowel resection and appendectomy...,small bowel resection appendectomy 6 year ago ...,i have a story n ni had a small bowel resectio...,story n ni small bowel resection appendectomy ...
