<a href="https://colab.research.google.com/github/firarru/Text-Mining/blob/main/Tugas_3_Kelompok_4_Text_Mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Package**

In [1]:
import pandas as pd
import re
import ast
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# **Read Data**

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/rizalespe/Dataset-Sentimen-Analisis-Bahasa-Indonesia/refs/heads/master/dataset_tweet_sentimen_tayangan_tv.csv')
df.head()

Unnamed: 0,Id,Sentiment,Acara TV,Jumlah Retweet,Text Tweet
0,1,positive,HitamPutihTransTV,12,"Undang @N_ShaniJKT48 ke hitamputih, pemenang S..."
1,2,positive,HitamPutihTransTV,6,Selamat berbuka puasa Semoga amal ibadah hari ...
2,3,positive,HitamPutihTransTV,9,"Ada nih di trans7 hitam putih, dia dpt penghar..."
3,4,positive,HitamPutihTransTV,2,selamat ya mas @adietaufan masuk hitamputih
4,5,positive,HitamPutihTransTV,1,Asiknya nonton Hitam Putih Trans7


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Id              400 non-null    int64 
 1   Sentiment       400 non-null    object
 2   Acara TV        400 non-null    object
 3   Jumlah Retweet  400 non-null    int64 
 4   Text Tweet      400 non-null    object
dtypes: int64(2), object(3)
memory usage: 15.8+ KB


In [4]:
df = df[['Sentiment', 'Text Tweet']]
df.shape

(400, 2)

## A. Case Folding

In [5]:
def convert_text(text):
    try:
        teks = ast.literal_eval(text)
        return " ".join([word.lower() for word in teks]) # Convert to lowercase here
    except (ValueError, SyntaxError):
        if isinstance(text, str):
          return " ".join([word.lower() for word in text.split()]) # Handle string cases
        return text # Return original text if conversion fails

df['Text Tweet'] = df['Text Tweet'].apply(convert_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Text Tweet'] = df['Text Tweet'].apply(convert_text)


In [6]:
df['Text Tweet'].tolist()

['undang @n_shanijkt48 ke hitamputih, pemenang ssk jkt48 harusnya mjkt48 ini lebih layak di undang karena prestasinya',
 'selamat berbuka puasa semoga amal ibadah hari ni diterima allah #hitamputih',
 'ada nih di trans7 hitam putih, dia dpt penghargaan juga di norwegia #hitamputih',
 'selamat ya mas @adietaufan masuk hitamputih',
 'asiknya nonton hitam putih trans7',
 '@trans7 acara paling komplit dan menarik apalagi ada hitam putih',
 'hitam putih t7 inspiratif banget',
 'suka banget dengan acara hitam putih',
 'keren lu bro #hitamputihtrans7',
 'tadi ada yg liat hitam putih di trans7 ga, ada sanggu ganteng',
 'cinta mengikat silaturahmi di hati ... #lunamaya #hitamputihtrans7 .... https://www.instagram.com/p/btqszj3jo9a/',
 'terima kasih pak.... sudah mau membantu kami untuk menyekolahkan adik saya.... #hitamputihtrans7',
 'semoga lancar hitamputihtrans7',
 'trans7 hitam putih terbaik https://www.instagram.com/p/btyytxmgvkd/',
 'acara hitam putih paling bagus buat di lihat',
 '@trans

## B. Cleaning

In [7]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

def remove_emojis(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_character(text):
    character_pattern = re.compile(r'[^a-zA-Z0-9\s]')
    return character_pattern.sub(r'', text)

df['Text Tweet'] = df['Text Tweet'].apply(remove_urls)
df['Text Tweet'] = df['Text Tweet'].apply(remove_emojis)
df['Text Tweet'] = df['Text Tweet'].apply(remove_character)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Text Tweet'] = df['Text Tweet'].apply(remove_urls)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Text Tweet'] = df['Text Tweet'].apply(remove_emojis)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Text Tweet'] = df['Text Tweet'].apply(remove_character)


In [8]:
df['Text Tweet'].tolist()

['undang nshanijkt48 ke hitamputih pemenang ssk jkt48 harusnya mjkt48 ini lebih layak di undang karena prestasinya',
 'selamat berbuka puasa semoga amal ibadah hari ni diterima allah hitamputih',
 'ada nih di trans7 hitam putih dia dpt penghargaan juga di norwegia hitamputih',
 'selamat ya mas adietaufan masuk hitamputih',
 'asiknya nonton hitam putih trans7',
 'trans7 acara paling komplit dan menarik apalagi ada hitam putih',
 'hitam putih t7 inspiratif banget',
 'suka banget dengan acara hitam putih',
 'keren lu bro hitamputihtrans7',
 'tadi ada yg liat hitam putih di trans7 ga ada sanggu ganteng',
 'cinta mengikat silaturahmi di hati  lunamaya hitamputihtrans7  ',
 'terima kasih pak sudah mau membantu kami untuk menyekolahkan adik saya hitamputihtrans7',
 'semoga lancar hitamputihtrans7',
 'trans7 hitam putih terbaik ',
 'acara hitam putih paling bagus buat di lihat',
 'trans7 undang da3rafly di acara hitam putih yadia jebolan dangdut academi pinter nyanyi lagu india suaranya keren'

In [10]:
# nltk.download('stopwords')
# stop_words = set(stopwords.words('indonesian'))
# filtered_tokens = [word for word_tokenize in df['Text Tweet'] if word not in stop_words]