Sebelum memprediksi / menganilisis data, perlu dilakukan tahapan preprocessing yang bertujuan untuk memastikan bahwa data yang akan digunakan untuk analisis memiliki kualitas yang baik sehingga model yang dibangun akan menghasilkan akurasi dan klasifikasi yang baik pula. 


---


Dataset yang akan digunakan untuk melakukan text preprocessing berupa 10000 data tweet yang terbagi menjadi dua kelas yaitu data yang mengandung kata - kata tidak pantas dan data normal.

Sumber : https://www.kaggle.com/competitions/nlp-getting-started/data

### **LIBRARY**

In [1]:
!pip install pyspellchecker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from google.colab import drive
from collections import Counter
from nltk.corpus import stopwords
from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.preprocessing import LabelEncoder

import re
import nltk
import string
import pandas as pd

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### **DATASET PREPARATION**

In [6]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# Import data

dTrain = pd.read_csv("/content/drive/My Drive/Datasets/Tweets/train.csv")
dTest = pd.read_csv("/content/drive/My Drive/Datasets/Tweets/train.csv")

In [8]:
# Menampilkan data

dTrain.tail()

Unnamed: 0,id,keyword,location,text,target
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1
7612,10873,,,The Latest: More Homes Razed by Northern Calif...,1


In [9]:
# Counting the amount of training data
amount = len(dTrain)

# Count the number of column and rows
length = dTest.shape

print("Total keseluruhan data :",amount)
print("Total baris dan kolom  :",length)

Total keseluruhan data : 7613
Total baris dan kolom  : (7613, 5)


In [10]:
# Membagi data menjadi data latih dan data uji
x = dTrain.iloc[:, :-1] # Data feature 
y = dTrain.iloc[:, -1] # Data target

### **DATASET PREPROCESSING**

In [11]:
# Encode data target menggunakan Label Encoder
le = LabelEncoder()
le.fit(y)
y = le.transform(y)

print("1 for True")
print("0 for False")
print("Encode Result : ", y)

1 for True
0 for False
Encode Result :  [1 1 1 ... 1 1 1]


#### **LOWER TEXT**

In [12]:
# Mengubah seluruh data yang dipanggil menjadi huruf kecil
class LowerText:
  def process(self, text):
    return text.str.lower()

x['text'] = LowerText.process(x['text'], x['text'])
x['location'] = LowerText.process(x['location'], x['location'])
x['keyword'] = LowerText.process(x['keyword'], x['keyword'])

In [13]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,,,two giant cranes holding a bridge collapse int...
7609,10870,,,@aria_ahrary @thetawniest the out of control w...
7610,10871,,,m1.94 [01:04 utc]?5km s of volcano hawaii. htt...
7611,10872,,,police investigating after an e-bike collided ...
7612,10873,,,the latest: more homes razed by northern calif...


#### **MISSING VALUE**

In [14]:
# Mengecek apakah ada data yang hilang atau tidak
x.isnull().sum()

id             0
keyword       61
location    2533
text           0
dtype: int64

In [15]:
# Mengatasi data yang hilang dengan cara mengubah nilai kosong menjadi nol
x['keyword'] = x['keyword'].fillna('No Keyword')
x['location'] = x['location'].fillna('No Location')

# Menghitung ulang apakah ada data yang hilang atau tidak
x.isnull().sum()

id          0
keyword     0
location    0
text        0
dtype: int64

#### **REMOVAL OF URLs**

In [16]:
def remove_urls(text):
  url_pattern = re.compile(r'https?://\S+|www\.\S+http?://\S')
  return url_pattern.sub(r'', text)

x['text'] = x['text'].apply(lambda text: remove_urls(text))

In [17]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant cranes holding a bridge collapse int...
7609,10870,No Keyword,No Location,@aria_ahrary @thetawniest the out of control w...
7610,10871,No Keyword,No Location,m1.94 [01:04 utc]?5km s of volcano hawaii.
7611,10872,No Keyword,No Location,police investigating after an e-bike collided ...
7612,10873,No Keyword,No Location,the latest: more homes razed by northern calif...


#### **CHAT WORDS CONVERSION** 

In [18]:
chat_word = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [19]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_word.split("\n"):
  if line != "":
    chat = line.split("=")[0]
    chat_expanded = line.split("=")[1]
    chat_words_list.append(chat)
    chat_words_map_dict[chat] = chat_expanded

chat_words_list = set(chat_words_list)

In [20]:
def chat_words_conversion(text):
  new_text = []
  for w in text.split():
    if w.upper() in chat_words_list:
      new_text.append(chat_words_map_dict[w.upper()])
    else:
      new_text.append(w)
  return " ".join(new_text)

x['text'] = x['text'].apply(lambda text: chat_words_conversion(text))

#### **REMOVAL OF PUNCTUATIONS**

In [21]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
  """custom function to remove the punctuation"""
  return text.translate(str.maketrans('','', PUNCT_TO_REMOVE))

x['text'] = x['text'].apply(lambda text: remove_punctuation(text))
x['location'] = x['location'].apply(lambda text: remove_punctuation(text))
x['keyword'] = x['keyword'].apply(lambda text: remove_punctuation(text))

In [22]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant cranes holding a bridge collapse int...
7609,10870,No Keyword,No Location,ariaahrary thetawniest the out of control wild...
7610,10871,No Keyword,No Location,m194 0104 utc5km s of volcano hawaii
7611,10872,No Keyword,No Location,police investigating after an ebike collided w...
7612,10873,No Keyword,No Location,the latest more homes razed by northern califo...


#### **REMOVAL OF STOPWORDS**

In [23]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
  """custom function to remove the stopwords"""
  return " ".join([word for word in str(text).split() if word not in STOPWORDS])

x['text'] = x['text'].apply(lambda text: remove_stopwords(text))

In [24]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant cranes holding bridge collapse nearb...
7609,10870,No Keyword,No Location,ariaahrary thetawniest control wild fires cali...
7610,10871,No Keyword,No Location,m194 0104 utc5km volcano hawaii
7611,10872,No Keyword,No Location,police investigating ebike collided car little...
7612,10873,No Keyword,No Location,latest homes razed northern california wildfir...


#### **REMOVAL OF FREQUENT WORDS**

In [25]:
count = Counter()
for text in x['keyword'].values:
  for word in text.split():
    count[word] += 1

count.most_common(15)

[('No', 61),
 ('Keyword', 61),
 ('fatalities', 45),
 ('armageddon', 42),
 ('deluge', 42),
 ('body20bags', 41),
 ('damage', 41),
 ('harm', 41),
 ('sinking', 41),
 ('collided', 40),
 ('evacuate', 40),
 ('fear', 40),
 ('outbreak', 40),
 ('siren', 40),
 ('twister', 40)]

In [26]:
FREQWORDS = set([w for (w, wc) in count.most_common(15)])
def remove_freqwords(text):
  """custom function to remove the frequent words"""
  return " ".join([word for word in str(text).split() if word not in FREQWORDS])

x['text'] = x['text'].apply(lambda text: remove_freqwords(text))

In [27]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant cranes holding bridge collapse nearb...
7609,10870,No Keyword,No Location,ariaahrary thetawniest control wild fires cali...
7610,10871,No Keyword,No Location,m194 0104 utc5km volcano hawaii
7611,10872,No Keyword,No Location,police investigating ebike car little portugal...
7612,10873,No Keyword,No Location,latest homes razed northern california wildfir...


#### **REMOVAL OF RARE WORDS**

In [28]:
n_rare_words = 10
RAREWORDS = set([w for (w, wc) in count.most_common()[:-n_rare_words-1 : 1]])

def remove_rarewords(text):
  return " ".join([word for word in str(text).split() if word not in RAREWORDS])

x['text'] = x['text'].apply(lambda text: remove_rarewords(text))

In [29]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant cranes holding bridge nearby homes
7609,10870,No Keyword,No Location,ariaahrary thetawniest control wild fires cali...
7610,10871,No Keyword,No Location,m194 0104 utc5km volcano hawaii
7611,10872,No Keyword,No Location,investigating ebike car little portugal ebike ...
7612,10873,No Keyword,No Location,latest homes northern california abc news


#### **STEMMING**

In [30]:
stemmer = PorterStemmer()
def stem_words(text):
  return " ".join([stemmer.stem(word) for word in text.split()])

x['text'] = x['text'].apply(lambda text: stem_words(text))

In [31]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant crane hold bridg nearbi home
7609,10870,No Keyword,No Location,ariaahrari thetawniest control wild fire calif...
7610,10871,No Keyword,No Location,m194 0104 utc5km volcano hawaii
7611,10872,No Keyword,No Location,investig ebik car littl portug ebik rider suff...
7612,10873,No Keyword,No Location,latest home northern california abc news


#### **LEMMATIZATION**

In [32]:
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
  return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

x['text'] = x['text'].apply(lambda text: lemmatize_words(text))

In [33]:
x.tail()

Unnamed: 0,id,keyword,location,text
7608,10869,No Keyword,No Location,two giant crane hold bridg nearbi home
7609,10870,No Keyword,No Location,ariaahrari thetawniest control wild fire calif...
7610,10871,No Keyword,No Location,m194 0104 utc5km volcano hawaii
7611,10872,No Keyword,No Location,investig ebik car littl portug ebik rider suff...
7612,10873,No Keyword,No Location,latest home northern california abc news
