# 01 Text Preprocessing

This notebook focuses on preparing text data for sentiment analysis through a series of essential cleaning and preprocessing steps. The following tasks are performed to ensure the text data is in a suitable format for analysis:

- <b>Convert to Lowercase:</b> All text is converted to lowercase for uniformity.

- <b>Remove URLs:</b> Any URLs present in the text are removed to avoid noise.

- <b>Remove Usernames:</b> Usernames starting with '@' are removed to protect privacy.

- <b>Remove Hashtags:</b> Hashtags starting with '#' are removed to simplify the text.

- <b>Remove Non-Alphabetic Characters:</b> Non-alphabetic characters, such as special symbols, are removed to retain only meaningful words.

- <b>Lemmatize Words:</b> Words are lemmatized to their base form for better analysis.

- <b>Remove One-Character Words:</b> Single-character words are removed to filter out irrelevant terms.

- <b>Remove Stopwords:</b> Common stopwords, such as "the" and "is," are removed to reduce noise.

- <b>Remove Duplicate Words:</b> Duplicate words are removed while preserving the order of words.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [3]:
# Load "combined_data.csv" and display a random sample of 10 rows

combined_df = pd.read_csv('../dataset/combined_data.csv')
combined_df.sample(10)

Unnamed: 0,corpus_name,raw_sentence
1105812,sentiment140,@djELITE hahaha.. I'm sure I could find something to wear that I would NOT look good in!
900352,sentiment140,@yaOHya Thanks for your support on my road to 1000
239872,sentiment140,@milecyrus Why u a little
939017,sentiment140,"Just tlked to a 3rd grader for baptism. I grilled her hard on sin, Jesus, knowing vs believing it...she had it down! Shes been redeemed!"
1390682,sentiment140,@Anniejunieee heyyy i saw u on icarly ! you rocked xoxoxo
1264544,sentiment140,Wow...what a great weekend we have had. Brilliant weather! One of many more to come I hope
73837,sentiment140,getting ready to go to work....hate working on a Sunday
1337309,sentiment140,"@Zoddies Eh, got up at 4:55...technically only got to the gym at 5:10 The gym, fortunately, is only 5mins up the road..."
20069,large_movie_review,"Never viewed this 1971 film and was greatly entertained by this great production created by the Walt Disney Studios and great animation creations. Angela Lansbury, (Eglantine Price) played an outstanding role as a woman who had taken a course in witch craft and was an apprentice who was beginning to fly on a broomstick and had quite a few difficulties taking off. Eglantine discovered many tricks and was able to make a bed travel to different parts of the world. However, Eglantine missed her final exams to becoming an accomplished witch. Mr. Emerlius Browne, (David Tomlinson) was the person who sold Eglantine this course in witchcraft and he tries to help her in every way possible to find her solution. Eglantine has a purpose to her madness and that is to stop the Nazi's from evading England. Great family entertainment and we need more films like this today."
653413,sentiment140,"School today.This is the last full thursday of my elementry school life.. I graduate in 4 days,I'm gonna bawl Harper's Island on 2nite!"


In [7]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Load stopwords outside the function for better performance
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text, stop_words):
    if text and isinstance(text, str):
        text = re.sub(r'https?://\S+|www\.\S+|@\w+|#\w+|[^a-zA-Z]', ' ', text.lower())
        text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if len(word) > 1 and word not in stop_words])
        text = ' '.join(list(dict.fromkeys(text.split())))
    else:
        text = ''
    return text

combined_df['clean_text'] = combined_df['raw_sentence'].apply(clean_text, stop_words=stop_words)
combined_df[['raw_sentence', 'clean_text']].sample(10)

Unnamed: 0,raw_sentence,clean_text
1189253,@stephaniepratt the black dress is the cutest dress out of all 3! Wear that one to the MTV awards,black dress cutest wear one mtv award
638865,"In a ghetto ohio hotel with my mom&amp;lilsis, annoyed as fuq! Wishin i was at tays w/ everyone. I miss em! Kwow i need conversation.",ghetto ohio hotel mom amp lilsis annoyed fuq wishin tay everyone miss em kwow need conversation
1462473,@souljaboytellem in my home town!,home town
1595434,Strange song. Strange radio. HELL YEA!,strange song radio hell yea
383180,too much work and class woww,much work class woww
1244872,"Loving toddla t and bashy, peggle also is taking over my life",loving toddla bashy peggle also taking life
1551536,Guess the hacker tryna say Ima Beast Stat outta my shit tho!,guess hacker tryna say ima beast stat outta shit tho
279635,t wanna sleep in my own bed a little nauseas still...boo!,wanna sleep bed little nausea still boo
744117,"Tweet Tweet Tweet, bored heaps",tweet bored heap
386972,@salandnat lmao! Prob... Or one if the dogs.,lmao prob one dog


# Sentiment Labeling with TextBlob Analysis

In this section, we employ the TextBlob library for sentiment analysis on the preprocessed text data. By calculating polarity scores with get_sentiment(text), we assess sentiment intensity. The categorize_sentiment(score) function then assigns more nuanced sentiment labels such as "Positive," "Moderately Positive," "Neutral," "Moderately Negative," and "Negative." This approach ensures a more accurate representation of sentiment, allowing for a finer-grained interpretation. By discerning between different degrees of positivity and negativity, we can gain more precise insights into the dataset's sentiment distribution and make informed decisions based on the nuanced sentiment labels.

In [8]:
from textblob import TextBlob

def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

combined_df['textblob_polarity'] = combined_df['clean_text'].apply(get_sentiment).round(2)

def categorize_sentiment(score):
    if score >= 0.5:
        return 'Positive'
    elif score >= 0.05 and score < 0.5:
        return 'Moderately Positive'
    elif score > -0.05 and score < 0.05:
        return 'Neutral'
    elif score > -0.5 and score <= -0.05:
        return 'Moderately Negative'
    else:
        return 'Negative'

combined_df['sentiment_textblob'] = combined_df['textblob_polarity'].apply(categorize_sentiment)
combined_df[['clean_text', 'textblob_polarity', 'sentiment_textblob']].sample(10)

Unnamed: 0,clean_text,textblob_polarity,sentiment_textblob
523098,terrible hope find something interweb listen,-1.0,Negative
1450919,like want cake icecream prefer healthy food lol,0.65,Positive
1466996,get follower day using add everyone train pay vip,0.0,Neutral
541551,california law hunch company would rather move another state slap one,0.0,Neutral
1444830,fun day jetskiing boy couple hour sleep making soup dinner,0.3,Moderately Positive
810803,jealous dumb class th grade,-0.38,Moderately Negative
1218187,nothing annoying someone xd,-0.8,Negative
792113,weekend success dont want leave lexington today,0.3,Moderately Positive
986952,going,0.0,Neutral
954591,well look fab even say,0.0,Neutral


In [9]:
# Save sentiment analysis results with text preprocessing and TextBlob labeling to CSV
combined_df.to_csv('../dataset/sentiment_results.csv', index=False)