## **Here I will use CNN for text classification**

## Download Data set

**Here I will use Sentiment140 dataset**<br>
**Source : [link text](https://www.kaggle.com/kazanova/sentiment140)**

In [None]:
from google.colab import files
files.upload()

In [None]:
# Make directory named kaggle and copy kaggle.json file there.

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

In [None]:
# Change the permissions of the file.
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
#dowload dataset
!kaggle datasets download -d kazanova/sentiment140

Downloading sentiment140.zip to /content
 90% 73.0M/80.9M [00:03<00:00, 13.8MB/s]
100% 80.9M/80.9M [00:03<00:00, 21.9MB/s]


In [None]:
#unzio dataset
!unzip sentiment140.zip

Archive:  sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


In [None]:
#rename file
import os
os.rename('training.1600000.processed.noemoticon.csv', 'twitter_sentiment_analysis.csv')

# Import

In [None]:
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from tensorflow.keras import models
import pandas as pd
import re
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf


# Explore dataset



In [None]:
# with default encoding i am getting UnicodeDecodeError so use encoding = 'ISO-8859-1
# i got solution at https://stackoverflow.com/a/18172249
# you can find columns name at dataset link

data = pd.read_csv("/content/twitter_sentiment_analysis.csv",encoding='ISO-8859-1',header=None,names=['target','id','date','flag','user','text'])

In [None]:
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
data['target'].value_counts()

4    800000
0    800000
Name: target, dtype: int64


**csv file contains following fields**

* target: the polarity of the tweet (0 = negative 4 = positive)

* ids: The id of the tweet

* date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

* flag: The query (lyx). If there is no query, then this value is NO_QUERY.

* user: the user that tweeted (robotickilldozr)

* text: the text of the tweet (Lyx is cool)


**For our purpuse we need only two fields text and target**

In [None]:
#extract two fields text and target
data = data[['text','target']]
data.head()

Unnamed: 0,text,target
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


In [None]:
data.shape

(1600000, 2)

* **Here we have 1600000 revies with target** <br>
* **we can see that text contains some special symbol like @, it might contains some html tags we have to remove it. and target column has 0 and 4 we will convert it into 0 and 1.**<br>
* **let pre-process this text**

# Pre-Processing

In [None]:
def clean_tweets(tweet):
  tweet = BeautifulSoup(tweet).get_text() # remove all html tags
  tweet = re.sub(r'@[A-Za-z0-9]+',' ',tweet) # replace each word which start from @ (example : @Kenichan) with space
  tweet = re.sub(r'https?://[A-Za-z0-9./]+',' ',tweet) # replace url or links with space
  tweet = re.sub(r"[^A-Za-z.!?']",' ',tweet) #replace everything exceptspecified in group
  tweet = re.sub(r" +",' ',tweet) # replace multple white spaces with single space
  return tweet


In [None]:
# convert_labels = {0:0,2:1,4:2}
data['text'] = data['text'].apply(clean_tweets)
# data['target'] = data['target'].apply(lambda t : convert_labels[t]) # convert target columns into 0,1 and 2

In [None]:
data['target'].replace(4, 1,inplace=True)

In [None]:
data.head()

Unnamed: 0,text,target
0,Awww that's a bummer. You shoulda got David C...,0
1,is upset that he can't update his Facebook by ...,0
2,I dived many times for the ball. Managed to s...,0
3,my whole body feels itchy and like its on fire,0
4,no it's not behaving at all. i'm mad. why am ...,0


**Here we cleaned our text and change value of target field**

In [None]:
data.to_csv('/content/drive/My Drive/Data/sentiment140/twitter_sentiment_analysis.csv',index=False)

## Tokenizing each tweets

In [None]:
data = pd.read_csv('/content/drive/My Drive/Data/sentiment140/twitter_sentiment_analysis.csv')

In [None]:
data['target'].value_counts()

1    800000
0    800000
Name: target, dtype: int64

In [None]:
# get both column as list
clean_text = data.text.to_list()
labels = data.target.to_list()

In [None]:
from collections import Counter
Counter(labels)

Counter({0: 800000, 1: 800000})

In [None]:
# build tokenizer
# here we will use SubwordTextEncoder which will create token(number) for each word in corpus.
# In evaluation if new word comes it will create tokenes base on character or sub word
# https://stackoverflow.com/a/58123024 
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(clean_text,target_vocab_size=2**16)

In [None]:
# we can save tokenizer
tokenizer.save_to_file('/content/drive/My Drive/Data/sentiment140/tokenizer')

In [None]:
# load tokenizer
tokenizer = tfds.features.text.SubwordTextEncoder.load_from_file('/content/drive/My Drive/Data/sentiment140/tokenizer')


In [None]:
text_input = [tokenizer.encode(sentence) for sentence in clean_text]

In [None]:
print(clean_text[0])
print(text_input[0])

 Awww that's a bummer. You shoulda got David Carr of Third Day to do it. D
[65316, 1570, 113, 65323, 10, 6, 3553, 1, 135, 5262, 50, 1484, 38165, 16, 13337, 606, 2, 49, 33, 1, 65352]


In [None]:
MAX_LEN = max([len(tokenize_sentence) for tokenize_sentence in text_input])
MAX_LEN

73

In [None]:
# let pad this tokens of sentece to make it equal length (MAX_LEN)
# we will pad with 0
padded_text_input = tf.keras.preprocessing.sequence.pad_sequences(text_input,maxlen=MAX_LEN,value=0,padding='post',)

In [None]:
print("Tokenizer Sequence : \n")
print(text_input[0])

print("\n\nPadded Sequence : \n")
print(padded_text_input[0])

Tokenizer Sequence : 

[65316, 1570, 113, 65323, 10, 6, 3553, 1, 135, 5262, 50, 1484, 38165, 16, 13337, 606, 2, 49, 33, 1, 65352]


Padded Sequence : 

[65316  1570   113 65323    10     6  3553     1   135  5262    50  1484
 38165    16 13337   606     2    49    33     1 65352     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0]


# Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(padded_text_input,labels,test_size=0.2)

In [None]:
print("Training data consists {} examples".format(len(X_train)))
print("Testing data consists {} examples".format(len(X_test)))


Training data consists 1280000 examples
Testing data consists 320000 examples


In [None]:
X_train, X_test, y_train, y_test = np.asarray(X_train), np.asarray(X_test), np.asarray(y_train), np.asarray(y_test)

# Model Creation

In [None]:
class SentimentCNN(tf.keras.Model):

  def __init__(self,vocab_size,embedding_dim=128,num_filterS=50,FFN=512,num_classes=2
               ,dropout_rate=0.1,training=False,name='sentiment_cnn'):
    super(SentimentCNN,self).__init__(name=name)

    #define our layers
    self.embeddings = tf.keras.layers.Embedding(vocab_size,embedding_dim)

    # here we are using Conv1D because we need to convolve only on one axis
    self.bigram = tf.keras.layers.Conv1D(filters=num_filterS,kernel_size=2,
                                         padding='valid',activation='relu')
    self.pool_1  = tf.keras.layers.GlobalAvgPool1D()

    self.trigram = tf.keras.layers.Conv1D(filters=num_filterS,kernel_size=3,
                                          padding='valid',activation='relu')
    self.pool_2  = tf.keras.layers.GlobalAvgPool1D()


    self.fourgram = tf.keras.layers.Conv1D(filters=num_filterS,kernel_size=4,
                                           padding='valid',activation='relu')
    self.pool_3  = tf.keras.layers.GlobalAvgPool1D()

    self.dense_1 = tf.keras.layers.Dense(FFN,activation='relu')
    self.dropout = tf.keras.layers.Dropout(rate=dropout_rate)

    if num_classes == 2:
      self.output_layer = tf.keras.layers.Dense(1,activation='sigmoid')
    else:
      self.output_layer = tf.keras.layers.Dense(num_classes,activation='softmax')

  def call(self,input,training):

    embeddings = self.embeddings(input)

    bigram = self.bigram(embeddings)
    bigram_pooled = self.pool_1(bigram)

    trigram = self.trigram(embeddings)
    trigram_pooled = self.pool_2(trigram)

    fourgram = self.fourgram(embeddings)
    fourgram_pooled = self.pool_3(fourgram)

    merged = tf.concat([bigram_pooled,trigram_pooled,fourgram_pooled],axis=-1) # (batch_size,3*num_filter) 3->because we have bigram trigram and fourgram
    merged = self.dense_1(merged)
    merged = self.dropout(merged,training)
    merged = self.output_layer(merged)

    return merged





# Training

In [None]:
VOCAB_SIZE = tokenizer.vocab_size
EMBEDDING_SIZE = 256
NUM_FILTERS = 64
FFN = 512
NUM_CLASSES = 2
DROPOUT_RATE = 0.2
BATCH_SIZE = 64
NUM_EPOCH = 5


In [None]:
SentimentCnn = SentimentCNN(VOCAB_SIZE,EMBEDDING_SIZE,NUM_FILTERS,FFN,NUM_CLASSES,DROPOUT_RATE)

In [None]:
SentimentCnn.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
checkpoint_filepath = '/content/drive/My Drive/Data/sentiment140/checkPoints/'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_best_only=True)

In [None]:
SentimentCnn.fit(X_train,y_train,validation_data=(X_test,y_test),batch_size=BATCH_SIZE,epochs=NUM_EPOCH,callbacks=[callback,model_checkpoint_callback])

# Prediction

In [None]:
SentimentCnn.save('/content/drive/My Drive/Data/sentiment140/model')

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


INFO:tensorflow:Assets written to: /content/drive/My Drive/Data/sentiment140/model/assets


INFO:tensorflow:Assets written to: /content/drive/My Drive/Data/sentiment140/model/assets


In [None]:
model = models.load_model('/content/drive/My Drive/Data/sentiment140/model')

In [None]:
index_to_sentiment = {0:'negative',1:'positive'}
def predict_sentiment(text):
  text =  clean_tweets(text)
  embedding = tokenizer.encode(text)
  embedding = np.expand_dims(embedding,axis=0)
  pad_embedding = tf.keras.preprocessing.sequence.pad_sequences(embedding,maxlen=MAX_LEN,value=0,padding='post')
  prediction = model.predict(pad_embedding)
  
  response = {index_to_sentiment[0]:1-prediction[0][0],index_to_sentiment[1]:prediction[0][0]}
  return response

In [None]:
predict_sentiment('I really like the new design of your website!')

{'negative': 0.0001634359359741211, 'positive': 0.99983656}

In [None]:
predict_sentiment('The new design is awful!')

{'negative': 0.9328088983893394, 'positive': 0.0671911}

In [None]:
predict_sentiment('impossible to reach customer service')

{'negative': 0.8121347725391388, 'positive': 0.18786523}