### Normalización.

 Implementa una función para limpiar y normalizar los textos de los posts 
(texto en la columna “post”). La versión normalizada se grabará en una nueva columna 
“clean_post”. Deberá realizarse una limpieza y normalización de los datos en función de las 
características que observes en los textos. Describe con detalle cada paso seguido durante 
el preprocesado y normalización y añade la función preprocess_post(text: str) al fichero 
core.py con la implementación final de este módulo. 

In [1]:
import pandas as pd 
pd.options.display.max_columns = None

In [2]:
path = 'reddit_database_sentiment.csv'

In [3]:
df = pd.read_csv(
    path,
    delimiter=';', 
    on_bad_lines='skip',  #ignores rows who are wrong formated
    header=0, 
    quotechar='"',  # deals with string which contains "
    encoding='utf-8',
    low_memory=False
)

In [5]:
df.head()

Unnamed: 0,created_date,created_timestamp,subreddit,title,author,author_created_utc,full_link,score,num_comments,num_crossposts,subreddit_subscribers,post,sentiment
0,2010-02-11 19:47:22,1265910442.0,analytics,So what do you guys all do related to analytic...,xtom,1227476000.0,https://www.reddit.com/r/analytics/comments/b0...,7.0,4.0,0.0,,There's a lot of reasons to want to know all t...,NEGATIVE
1,2010-03-04 20:17:26,1267726646.0,analytics,"Google's Invasive, non-Anonymized Ad Targeting...",xtom,1227476000.0,https://www.reddit.com/r/analytics/comments/b9...,2.0,1.0,0.0,,"I'm cross posting this from /r/cyberlaw, hopef...",NEGATIVE
2,2011-01-06 04:51:18,1294282278.0,analytics,"DotCed - Functional Web Analytics - Tagging, R...",dotced,1294282000.0,https://www.reddit.com/r/analytics/comments/ew...,1.0,1.0,,,"DotCed,a Functional Analytics Consultant, offe...",NEGATIVE
3,2011-01-19 11:45:30,1295430330.0,analytics,Program Details - Data Analytics Course,iqrconsulting,1288245000.0,https://www.reddit.com/r/analytics/comments/f5...,0.0,0.0,,,Here is the program details of the data analyt...,NEGATIVE
4,2011-01-19 21:52:28,1295466748.0,analytics,potential job in web analytics... need to anal...,therewontberiots,1278672000.0,https://www.reddit.com/r/analytics/comments/f5...,2.0,4.0,,,i decided grad school (physics) was not for me...,POSITIVE


In [8]:
df.columns


Index(['created_date', 'created_timestamp', 'subreddit', 'title', 'author',
       'author_created_utc', 'full_link', 'score', 'num_comments',
       'num_crossposts', 'subreddit_subscribers', 'post', 'sentiment'],
      dtype='object')

In [4]:
%pip install emoji

Collecting emoji
  Using cached emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Using cached emoji-2.14.0-py3-none-any.whl (586 kB)
Installing collected packages: emoji
Successfully installed emoji-2.14.0
Note: you may need to restart the kernel to use updated packages.


In [7]:
import re
import pandas as pd
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import emoji

In [9]:
def preprocess_post(text: str, lem: bool = True) -> str: 
    '''
    input:
    - text (str): input text to be processed.  
    - lem (bool): if True, apply lemmatization. if False, will not apply.  

    steps :
    1. convert the text to lowercase to standardize case.
    2. remove urls, user (@) and hashtags (#).
    3. strip html tags.
    4. remove emojis from the text.
    5. remove punctuation marks.
    6. eliminate extra whitespace.
    7. tokenize the text into individual words.
    8. remove stopwords to focus on meaningful words.
    9. apply stemming and lemmatization to reduce words to their base forms.

    output :
    - str: cleaned and normalized version of the input text.
    '''

    text = str(text)  
    # 1.
    text = text.lower()

    # 2. 
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+|#\w+', '', text)

    # 3.
    text = re.sub(r'<.*?>', '', text)

    # 4.
    text = emoji.replace_emoji(text, replace='')

    # 5. 
    text = text.translate(str.maketrans("", "", string.punctuation))

    # 6. 
    text = re.sub(r'\s+', ' ', text).strip()

    # 7. 
    words = word_tokenize(text)

    # 8.
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # 9. 
    # stemmer = PorterStemmer() could be used as well, less accurate for sentiment analysis
    if lem:
        lemmatizer = WordNetLemmatizer()
        words = [lemmatizer.lemmatize(word) for word in words]
    
    #reconstruc text
    text = ' '.join(words)
    
    return text


In [10]:
df['post'].head()

0    There's a lot of reasons to want to know all t...
1    I'm cross posting this from /r/cyberlaw, hopef...
2    DotCed,a Functional Analytics Consultant, offe...
3    Here is the program details of the data analyt...
4    i decided grad school (physics) was not for me...
Name: post, dtype: object

In [11]:
df['clean_post'] = df['post'].apply(preprocess_post)

df[['post', 'clean_post']].head()


Unnamed: 0,post,clean_post
0,There's a lot of reasons to want to know all t...,there lot reason want know stuff figured id ge...
1,"I'm cross posting this from /r/cyberlaw, hopef...",im cross posting rcyberlaw hopefully guy find ...
2,"DotCed,a Functional Analytics Consultant, offe...",dotceda functional analytics consultant offeri...
3,Here is the program details of the data analyt...,program detail data analytics certification co...
4,i decided grad school (physics) was not for me...,decided grad school physic branching job marke...


In [14]:
df['clean_post'] = df['post'].apply(lambda x: preprocess_post(x, lem=False))


In [15]:
# save in a new df (not modify the original) in case we want to use the clean_post column in other modules
df.to_csv("processed_dataset.csv", index=False, sep=',', quotechar='"')

In [16]:
df_check = pd.read_csv("processed_dataset.csv", sep=',')
df_check.head()

  df_check = pd.read_csv("processed_dataset.csv", sep=',')


Unnamed: 0,created_date,created_timestamp,subreddit,title,author,author_created_utc,full_link,score,num_comments,num_crossposts,subreddit_subscribers,post,sentiment,clean_post
0,2010-02-11 19:47:22,1265910442.0,analytics,So what do you guys all do related to analytic...,xtom,1227476000.0,https://www.reddit.com/r/analytics/comments/b0...,7.0,4.0,0.0,,There's a lot of reasons to want to know all t...,NEGATIVE,theres lot reasons want know stuff figured id ...
1,2010-03-04 20:17:26,1267726646.0,analytics,"Google's Invasive, non-Anonymized Ad Targeting...",xtom,1227476000.0,https://www.reddit.com/r/analytics/comments/b9...,2.0,1.0,0.0,,"I'm cross posting this from /r/cyberlaw, hopef...",NEGATIVE,im cross posting rcyberlaw hopefully guys find...
2,2011-01-06 04:51:18,1294282278.0,analytics,"DotCed - Functional Web Analytics - Tagging, R...",dotced,1294282000.0,https://www.reddit.com/r/analytics/comments/ew...,1.0,1.0,,,"DotCed,a Functional Analytics Consultant, offe...",NEGATIVE,dotceda functional analytics consultant offeri...
3,2011-01-19 11:45:30,1295430330.0,analytics,Program Details - Data Analytics Course,iqrconsulting,1288245000.0,https://www.reddit.com/r/analytics/comments/f5...,0.0,0.0,,,Here is the program details of the data analyt...,NEGATIVE,program details data analytics certification c...
4,2011-01-19 21:52:28,1295466748.0,analytics,potential job in web analytics... need to anal...,therewontberiots,1278672000.0,https://www.reddit.com/r/analytics/comments/f5...,2.0,4.0,,,i decided grad school (physics) was not for me...,POSITIVE,decided grad school physics branching job mark...
