# Tweet Intent Classification with Twitter Dataset
<hr>

We will build a intent classification model using GRU model using twitter dataset. The dataset was scrapped using 'Twint'. Since there is no standard train/test split for this dataset, we will use 10-Fold Cross Validation (CV).

<hr>

## Load the library

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import random
from sklearn.model_selection import KFold

In [18]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv

ModuleNotFoundError: No module named 'torch'

In [2]:
## test GPU

In [3]:
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Load the dataset

In [4]:
df = pd.read_csv('dataset/tweetlabels1000_labeled.xlsx - Sheet2.csv')

In [5]:
print(df.shape)
df

(1003, 4)


Unnamed: 0,No,Label,Username,Tweet
0,0,none,chyrisalys,wifi watcha pasti indihome
1,1,indirect complaint,woiidal,indihome ada masalah apasih??!!
2,2,remark,ranieaw,sore sore hujan rebahan bareng bocil nonton nu...
3,3,negative remark,daeguv_,"indihome plis untuk tanggal 10,13 jangan kesur..."
4,4,indirect complaint,fiorincha,INDIHOME NGAPASIIII
...,...,...,...,...
998,998,campaign,sangwarior,Bisa sharing bareng sobat indihome gini kan as...
999,999,campaign,sangwarior,Sekarang nyantai dulu bentar sambil scroll sos...
1000,1000,negative remark,untextend,@JefriHandri Sini indihome down dr semalem jam...
1001,1001,campaign,sangwarior,Ini barusan slesei sob... Lumayan buat nyari k...


In [6]:
# drop rows with label = campaign, own tweet, & incomplete
df = df.loc[~df['Label'].isin(['campaign','own tweet','incomplete'])].copy()
df

Unnamed: 0,No,Label,Username,Tweet
0,0,none,chyrisalys,wifi watcha pasti indihome
1,1,indirect complaint,woiidal,indihome ada masalah apasih??!!
2,2,remark,ranieaw,sore sore hujan rebahan bareng bocil nonton nu...
3,3,negative remark,daeguv_,"indihome plis untuk tanggal 10,13 jangan kesur..."
4,4,indirect complaint,fiorincha,INDIHOME NGAPASIIII
...,...,...,...,...
994,994,inquiry,pecintamochi,"@IndiHomeCare Min, kalo jatuh tempo pembayaran..."
995,995,inquiry,tetehaisyah51,"@IndiHomeCare Min, 1 IndiHome TV bisa gak berl..."
997,997,direct complaint,untextend,@IndiHomeCare @fauzindrianto Dari semalem down...
1000,1000,negative remark,untextend,@JefriHandri Sini indihome down dr semalem jam...


In [7]:
# Label Encoding & drop columns
df['label'] = df['Label'].map({'indirect complaint': 0, 'remark': 1, 'negative remark': 2, 'direct compliment' : 3 , 'direct complaint' : 4, 'none' : 5, 'inquiry' : 6})
df.drop(['No', 'Label', 'Username'], axis=1, inplace=True)
df

Unnamed: 0,Tweet,label
0,wifi watcha pasti indihome,5
1,indihome ada masalah apasih??!!,0
2,sore sore hujan rebahan bareng bocil nonton nu...,1
3,"indihome plis untuk tanggal 10,13 jangan kesur...",2
4,INDIHOME NGAPASIIII,0
...,...,...
994,"@IndiHomeCare Min, kalo jatuh tempo pembayaran...",6
995,"@IndiHomeCare Min, 1 IndiHome TV bisa gak berl...",6
997,@IndiHomeCare @fauzindrianto Dari semalem down...,4
1000,@JefriHandri Sini indihome down dr semalem jam...,2


In [8]:
df.info()
df.groupby( by='label').count()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 722 entries, 0 to 1002
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Tweet   722 non-null    object
 1   label   722 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 16.9+ KB


Unnamed: 0_level_0,Tweet
label,Unnamed: 1_level_1
0,68
1,153
2,128
3,64
4,110
5,123
6,76


In [9]:
# Separate the tweet and label
tweets, labels = list(df.Tweet), list(df.label)

In [10]:
tweets[123]

'@IndiHomeCare Halo min, wifi indihome LOS dari tadi siang! Ini gimna?? Urgent nih gw mau make!!  https://t.co/WRXvsHaAUJ'

## Data Preprocessing
<hr>

for data preprocessing we will do:
- Lower the letter case
- Cleansing (remove ascii, digits, punctuations, extra whitespaces, urls)
- Normalization Indonesian words
- Stemming 'kata berimbuhan'
- Tokenization

The tokenization process will be handled by __Tokenizer__ class in TensorFlow

<b>For padding sequence purpose, one way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.</b>

In [11]:
# to lowercase
df['tweet_lower'] = df['Tweet'].str.lower()
df

Unnamed: 0,Tweet,label,tweet_lower
0,wifi watcha pasti indihome,5,wifi watcha pasti indihome
1,indihome ada masalah apasih??!!,0,indihome ada masalah apasih??!!
2,sore sore hujan rebahan bareng bocil nonton nu...,1,sore sore hujan rebahan bareng bocil nonton nu...
3,"indihome plis untuk tanggal 10,13 jangan kesur...",2,"indihome plis untuk tanggal 10,13 jangan kesur..."
4,INDIHOME NGAPASIIII,0,indihome ngapasiiii
...,...,...,...
994,"@IndiHomeCare Min, kalo jatuh tempo pembayaran...",6,"@indihomecare min, kalo jatuh tempo pembayaran..."
995,"@IndiHomeCare Min, 1 IndiHome TV bisa gak berl...",6,"@indihomecare min, 1 indihome tv bisa gak berl..."
997,@IndiHomeCare @fauzindrianto Dari semalem down...,4,@indihomecare @fauzindrianto dari semalem down...
1000,@JefriHandri Sini indihome down dr semalem jam...,2,@jefrihandri sini indihome down dr semalem jam...


In [12]:
# Regex Manipulation for text cleansing
import re
import string

In [13]:
def text_cleansing(text):
    
       
    # remove non ASCII (emoticon, chinese word, etc)
    text = text.encode('ascii', 'replace').decode('ascii')
    
    # remove digits (using regex) -> subtitute
    text = re.sub('\d+', '', text)
    
    # remove punctuation, reference: https://stackoverflow.com/a/34294398
    # text = text.translate(str.maketrans('', '', string.punctuation))
    
    # remove whitespace in the beginning and end of sentence
    text = text.strip()
    
    # remove extra whitespace in the middle of sentence (using regex)
    text = re.sub('\s+', ' ', text)
    
    # remove url in tweet (using regex)
    text = re.sub(r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)", "", text)
    
    # masking indihome
    text = re.sub(r"@indihome", "xxxindihomexxx", text)
    
    # remove url in tweet (using regex)
    text = re.sub(r"xxxindihomexxx\w+", "xxxindihomexxx", text)
    
    # masking user
    text = re.sub(r"@\w+", "xxxuserxxx", text)
    
    return text

In [14]:
# text_cleansing()
df['tweet_clean'] = df['tweet_lower'].apply(lambda x: text_cleansing(x))
df

Unnamed: 0,Tweet,label,tweet_lower,tweet_clean
0,wifi watcha pasti indihome,5,wifi watcha pasti indihome,wifi watcha pasti indihome
1,indihome ada masalah apasih??!!,0,indihome ada masalah apasih??!!,indihome ada masalah apasih??!!
2,sore sore hujan rebahan bareng bocil nonton nu...,1,sore sore hujan rebahan bareng bocil nonton nu...,sore sore hujan rebahan bareng bocil nonton nu...
3,"indihome plis untuk tanggal 10,13 jangan kesur...",2,"indihome plis untuk tanggal 10,13 jangan kesur...","indihome plis untuk tanggal , jangan kesurupan?"
4,INDIHOME NGAPASIIII,0,indihome ngapasiiii,indihome ngapasiiii
...,...,...,...,...
994,"@IndiHomeCare Min, kalo jatuh tempo pembayaran...",6,"@indihomecare min, kalo jatuh tempo pembayaran...","xxxindihomexxx min, kalo jatuh tempo pembayara..."
995,"@IndiHomeCare Min, 1 IndiHome TV bisa gak berl...",6,"@indihomecare min, 1 indihome tv bisa gak berl...","xxxindihomexxx min, indihome tv bisa gak berla..."
997,@IndiHomeCare @fauzindrianto Dari semalem down...,4,@indihomecare @fauzindrianto dari semalem down...,xxxindihomexxx xxxuserxxx dari semalem down .....
1000,@JefriHandri Sini indihome down dr semalem jam...,2,@jefrihandri sini indihome down dr semalem jam...,xxxuserxxx sini indihome down dr semalem jam a...


In [15]:
print(df['Tweet'][69])
print(df['Tweet'][63])

@bherlindayl @IndiHome  https://t.co/C07nnk5jkk
@IndiHomeCare @IndiHome Mana. Masi lemot


In [16]:
print(df['tweet_clean'][69])
print(df['tweet_clean'][63])

xxxuserxxx xxxindihomexxx 
xxxindihomexxx xxxindihomexxx mana. masi lemot
