<a href="https://colab.research.google.com/github/daniahmad92/NLP-Klasifikasi-TEXT-IMDB/blob/main/NLP_Klasifikasi_Teks_Ulasan_Film_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Diri**

Nama : Dadan Ahmad Dani

Email: email@dadanahmaddani.com


**Submission**


Belajar Pengembangan Machine Learning

Proyek Pertama : Membuat Model NLP dengan TensorFlow

Klasifikasi Teks Ulasan Film IMDB

**Libarary**

In [1]:
#Library
import tensorflow as tf
import pandas as pd
import numpy as np
import os
print(tf.__version__)

2.8.2


**Dataset**

In [3]:
df = pd.read_csv('IMDB-Dataset.csv')            
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Explorasi Jumlah Data**

In [6]:
df.shape

(50000, 2)

In [7]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

**Penyiapan Data Latih**

In [8]:
df['sentiment'] = df['sentiment'].map({'positive':0,'negative':1})

train_df = df.sample(frac=0.8,random_state=100)
test_df = df.drop(train_df.index)

print(f"Train data shape: {train_df.shape}")
print(f"Test  data shape: {test_df.shape}")

Train data shape: (40000, 2)
Test  data shape: (10000, 2)


**Tokenisasi Data**

In [9]:
tokenizer  = tf.keras.preprocessing.text.Tokenizer(num_words=8000)
tokenizer.fit_on_texts(np.append(train_df['review'].values,test_df['review'].values))

word_index = tokenizer.word_index
nb_words = len(word_index) + 1

train_seq = tokenizer.texts_to_sequences(train_df["review"])
test_seq = tokenizer.texts_to_sequences(test_df["review"])

train_data = tf.keras.preprocessing.sequence.pad_sequences(train_seq, maxlen=100)
test_data = tf.keras.preprocessing.sequence.pad_sequences(test_seq, maxlen=100)

print(f"Train data shape: {train_data.shape}")
print(f"Test  data shape: {test_data.shape}")

Train data shape: (40000, 100)
Test  data shape: (10000, 100)


In [11]:
word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

In [12]:
train_label = train_df['sentiment'].values
test_label = test_df['sentiment'].values

**Pemodelan**

In [13]:
def create_model():
    model = tf.keras.Sequential([
      tf.keras.layers.Embedding(nb_words, 128),
      tf.keras.layers.LSTM(64),
      tf.keras.layers.Dense(64,activation='relu'),
      tf.keras.layers.Dense(1)])

    model.compile(optimizer='adam',
              loss=tf.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

    return model

model = create_model()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         15904384  
                                                                 
 lstm (LSTM)                 (None, 64)                49408     
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 15,958,017
Trainable params: 15,958,017
Non-trainable params: 0
_________________________________________________________________


**Latih Model**

In [14]:
call_back = [tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",patience=2,
verbose=1,restore_best_weights=True)]

model.fit(train_data, train_label, epochs=10, batch_size=32,
          validation_data = (test_data,test_label),
          callbacks=call_back)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 4: early stopping


<keras.callbacks.History at 0x7fe6fcc2c1d0>

**Evaluasi**

In [15]:
loss, accuracy = model.evaluate(test_data,test_label)

print(f"Accuracy : {accuracy}")
print(f"Loss     : {loss}")

Accuracy : 0.8693000078201294
Loss     : 0.29917556047439575
