<div style="border-bottom: 2px solid black; padding: 10px;">
    <h1 style="margin-bottom: 3px;">Machine Learning and business analytics</h1>
    <h3 style="margin-top: 0px; color: gray;">Sentiment analysis</h3>
    <p>Date: September 2, 2024</p>
</div>

- **Group:** Team Delta

## Sentiment analysis - Part 2 - Indexing

### Requirements

- Python 3.10.12
- pandas
- numpy
- matplotlib
- scikitlearn
- langdetect/langid
- nltk

### Environment

#### Libraries

In [16]:
import pandas as pd
import numpy as np
import warnings
import csv
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
import re
import math
import langid
import nltk
from nltk.corpus import stopwords

warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)

#### Constants and environment variables

In [17]:
training_file_path = f'../dataset/preprocessed/df_training_preprocessed.csv'
test_file_path = f'../dataset/preprocessed/df_test_preprocessed.csv'

TITLE_FONT_SIZE = 24
LABEL_FONT_SIZE = 20
AXIS_FONT_SIZE = 15
FIGURE_MAX_WIDTH = 8
FIGURE_MAX_HEIGHT = FIGURE_MAX_WIDTH*3/4

CLEANR = re.compile('<.*?>') #remove html tags

SPECIAL_CHARACTERS_REGEX = r'[-!@#$%^&*()_+={}\[\]:;"\'|<,>.?/~`]' # remove special character
SYMBOLS_REGEX = re.compile("["
                     u"\U0001F600-\U0001F64F"  # Emoticons
                     u"\U0001F300-\U0001F5FF"  # Symbols & pictographs
                     u"\U0001F680-\U0001F6FF"  # Transport & map symbols
                     u"\U0001F1E0-\U0001F1FF"  # Flags (iOS)
                     u"\U00002500-\U00002BEF"  # Chinese/Japanese/Korean characters
                     u"\U00002702-\U000027B0"
                     u"\U00002702-\U000027B0"
                     u"\U000024C2-\U0001F251"
                     u"\U0001f926-\U0001f937"
                     u"\U00010000-\U0010ffff"
                     u"\u2640-\u2642"
                     u"\u2600-\u2B55"
                     u"\u200d"
                     u"\u23cf"
                     u"\u23e9"
                     u"\u231a"
                     u"\ufe0f"  # dingbats
                     u"\u3030"
                     "]+", re.UNICODE)

#### Stopword domain

We will use the set from nltk library:

In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vaso\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
STOP_WORDS = set(stopwords.words('english'))

### Helper functions

In [20]:
def is_english(text):
    try:
        lang, _ = langid.classify(text)
        return lang == 'en'
    except:
        return False

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in STOP_WORDS]
    return ' '.join(filtered_words)

def cleanUp(field):
    if field == None:
        return ""
    field = field.lower()
    field = re.sub(CLEANR, ' ', field)
    field = re.sub(r'[^\w\s]', ' ', field) # Remove punctuation
    field = re.sub(r'\d+', ' ', field) # Remove digits
    field = re.sub(r'\s+', ' ', field).strip() # Remove extra whitespaces
    field = re.sub(r'\b\w\b', ' ', field) # Remove single letters
    field = re.sub(r'\b\w{2}\b', ' ', field) #remove 2 letter words
    field = re.sub(SYMBOLS_REGEX, '', field)
    field = re.sub(SPECIAL_CHARACTERS_REGEX, '', field)
    field = remove_stopwords(field)
    return field

### Loading the dataset

We load the preprocessed files.

In [21]:
df_training_filtered = pd.read_csv(training_file_path, encoding='ISO-8859-1')
df_test_filtered = pd.read_csv(test_file_path, encoding='ISO-8859-1')

Check for attribute types:

In [22]:
df_training_filtered.dtypes

textID       object
text         object
sentiment    object
dtype: object

In [23]:
df_test_filtered.dtypes

textID       object
text         object
sentiment    object
dtype: object

### Indexing

We index the text column. Tokenize the sentence, remove symbols.

Create index attribute:

In [24]:
df_training_indexed = df_training_filtered.copy()

df_test_indexed = df_test_filtered.copy()


for dataset in [df_training_indexed, df_test_indexed]:
    dataset.loc[:,'index'] = dataset.loc[:,'text'].str.lower()
    dataset.loc[:,'index'] = dataset.loc[:,'index'].astype('string')

Remote punctuation, digits, extra whitespaces, single letters, links, emoticons, symbols:

In [25]:

for dataset in [df_training_indexed, df_test_indexed]:
    for index, row in dataset.iterrows():
        dataset.at[index, 'index'] = cleanUp(row['index'])

In [26]:
df_test_indexed

Unnamed: 0,textID,text,sentiment,index
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,last session day http twitpic com ezh
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,shanghai also really exciting precisely skyscr...
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,recession hit veronique branquinho quit compan...
3,01082688c6,happy bday!,positive,happy bday
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,http twitpic com like
...,...,...,...,...
3370,0bb5b4d19c,"Friday evening......what to do, what to do. I...",neutral,friday evening idea
3371,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative,tired sleep try
3372,416863ce47,All alone in this old house again. Thanks for...,positive,alone old house thanks net keeps alive kicking...
3373,6332da480c,I know what you mean. My little dog is sinkin...,negative,know mean little dog sinking depression wants ...


### Save the indexed datasets to disk

In [27]:
df_training_indexed.to_csv('../dataset/index/df_training_indexed.csv', index=False)

df_test_indexed.to_csv('../dataset/index/df_test_indexed.csv', index=False)