## <div align="center"> UJIAN TENGAH SEMESTER IF540-L MACHINE LEARNING </div>
## <div align="center"> Semester Gasal 2024/2025 </div>
## <div align="center"> Penerapan Recurrent Neural Network untuk Analisis Sentimen di Sosial Media X </div>

---
### Kelompok - 7

#### Anggota Kelompok : 
1. Leonardo Tyoes Huibu - 00000065503
2. Dylan William - 00000067644
3. Eduardus Farrel Tirtawinata - 00000069931
4. Emanuel Bernandhika Dwi Friskola - 00000077703


---

### Dataset yang digunakan untuk projek:

1. [Sentiment140 dataset with 1.6 million tweets – sumber : https://www.kaggle.com/datasets/kazanova/sentiment140

### Hasil kerja

#### Column Names
##### 1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
##### 2. text: the text of the tweet

#### 1. Load Data

In [4]:
import pandas as pd
import numpy as np

# Read data
data = pd.read_csv('tweets.csv', encoding='latin-1', header=None)

# Rename columns
data.columns = ['target', 'id', 'date', 'flag', 'user', 'text']

# Map sentiment labels to binary (0 = negative, 1 = positive)
# Replace 4 with 1
data['target'] = data['target'].replace(4, 1)

# Drop unneeded columsn/features
data = data[['target', 'text']]

data.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


#### 2. Data Preprocessing (Tokenizing, cleaning text, etc.)

In [14]:
import re
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Download stopwords
import nltk
nltk.download('stopwords')

# Function to clean text
def clean_tweet(tweet):
    tweet = re.sub(r'http\S+', '', tweet)  # Remove URLs
    tweet = re.sub(r'@\w+', '', tweet)     # Remove mentions
    tweet = re.sub(r'#', '', tweet)        # Remove hashtags
    tweet = re.sub(r'[^\w\s]', '', tweet)  # Remove punctuation
    tweet = tweet.lower()                  # Convert to lowercase
    return tweet

# Apply the cleaning function to the tweets
data['cleaned_text'] = data['text'].apply(clean_tweet)

# Tokenization and padding
max_words = 20000  # Maximum number of words in the vocabulary
max_len = 100      # Maximum length of each tweet

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data['cleaned_text'])

# Convert tweets to sequences
sequences = tokenizer.texts_to_sequences(data['cleaned_text'])

# Pad sequences to ensure uniform length
X = pad_sequences(sequences, maxlen=max_len)

# Target variable
y = data['target'].values

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dylan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Compare Data To See Cleaned Data

In [7]:
comparison_df = pd.DataFrame({
    'Uncleaned_Tweet': data['text'],        # Original uncleaned tweets
    'Cleaned_Tweet': data['cleaned_text']   # Cleaned tweets after preprocessing
})

# Display the first 10 rows to compare
comparison_df.head(10)

Unnamed: 0,Uncleaned_Tweet,Cleaned_Tweet
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww thats a bummer you shoulda got david ...
1,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...
2,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sa...
3,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire
4,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i he...
5,@Kwesidei not the whole crew,not the whole crew
6,Need a hug,need a hug
7,@LOLTrish hey long time no see! Yes.. Rains a...,hey long time no see yes rains a bit only a ...
8,@Tatiana_K nope they didn't have it,nope they didnt have it
9,@twittera que me muera ?,que me muera


#### 3. Split Data

In [9]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Split Done")

Split Done


#### 4. Create Embeddings using GloVe

In [11]:
import os
import numpy as np

# Load GloVe embeddings
def load_glove_embeddings(glove_file_path, embedding_dim):
    embeddings_index = {}
    with open(glove_file_path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

glove_file_path = 'glove.twitter.27B.200d.txt' 
embedding_dim = 200 

# Load GloVe embeddings
embeddings_index = load_glove_embeddings(glove_file_path, embedding_dim)
print(f'Loaded {len(embeddings_index)} word vectors.')

Loaded 1193514 word vectors.


In [15]:
# Create an embedding matrix
word_index = tokenizer.word_index
num_words = min(max_words, len(word_index) + 1)  # Add 1 for the padding token
embedding_matrix = np.zeros((num_words, embedding_dim))

for word, i in word_index.items():
    if i < num_words:  # Only consider the top words
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:  # Check if the word is in GloVe
            embedding_matrix[i] = embedding_vector  # Add GloVe vector to matrix

In [17]:
import numpy as np

# Load the embedding matrix
embedding_matrix = np.load('embedding_matrix.npy')

# Print the shape of the matrix
print(f'Loaded embedding matrix with shape: {embedding_matrix.shape}')

print(embedding_matrix[:5])

Loaded embedding matrix with shape: (20000, 200)
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.000000

In [None]:
!jupyter nbconvert --to html "./UTS2024_IF540L_KelasE_Kelompok_7.ipynb" --output-dir="./"

### Next step:
* convert the generated html file to PDF using the online tool: https://www.sejda.com/html-to-pdf
* choose the following settings:
    * Page size: One long page
    * Page Orientation: auto
    * Use print stylesheet
* Submit your ipython notebook and PDF files

Markdown basics https://markdown-guide.readthedocs.io/en/latest/basics.html#