Hi, if you are a beginner in Tensorflow and would like to catch up the essence of Natural Language Processing (Like me), please upvote <3

In [1]:
# importing the libraries
import pandas as pd
import tensorflow as tf
import numpy as np

In [2]:
# importing the Deep Learning Libraries
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dense, Dropout

In [3]:
# loading the training data
training_data = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/train.csv')

In [4]:
training_data.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


1. We dont need the qid to train the model.
2. The question_text is the text input that has to be fitted in the model along with the target
3. The target has 2 classes.

In [5]:
# dropping the qid
training_data = training_data.drop(['qid'], axis = 1)

In [6]:
# creating a feature length that contains the total length of the question
training_data['length'] = training_data['question_text'].apply(lambda s: len(s))
# I used a basic way of utilizing a lambda function.

In [7]:
training_data['length']

0          72
1          81
2          67
3          57
4          77
           ..
1306117    93
1306118    91
1306119    25
1306120    71
1306121    52
Name: length, Length: 1306122, dtype: int64

In [8]:
# now checking the mean length of the text for tokenizing the data.
min(training_data['length']), max(training_data['length']), round(sum(training_data['length'])/len(training_data['length']))

(1, 1017, 71)

minimum length = 1 ?? looks like outliers, How can a question contain just a single word ? Let us do some preprocessing. 

In [9]:
training_data[training_data['length'] <= 9]

Unnamed: 0,question_text,target,length
61968,Quora:,1,7
78445,Is,1,3
126166,In Islam?,0,9
155026,Dowry:,1,7
230024,I 12?,0,5
299304,If,1,3
356798,To Quora:,1,9
365554,Sexism:,1,8
367936,Hungary:,1,9
369692,History:,1,9


oops...**Are these even complete questions ???**

In [10]:
training_data = training_data.drop(training_data[training_data['length'] <= 9].index, axis = 0)
min(training_data['length']), max(training_data['length']), round(sum(training_data['length'])/len(training_data['length']))

(10, 1017, 71)

Looks sensible. Now lets check for missing values (if any)

In [11]:
training_data.isnull().sum()

question_text    0
target           0
length           0
dtype: int64

No Missing Values !

Let us start the Deep Learning part now !

In [12]:
# Tokenizing the text - Converting each word, even letters into numbers. 
max_length = round(sum(training_data['length'])/len(training_data['length']))
tokenizer = Tokenizer(num_words = max_length, 
                      filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                     lower = True,
                     split = ' ')

In [13]:
tokenizer.fit_on_texts(training_data['question_text'])

In [14]:
# Actual Conversion takes place here.
X = tokenizer.texts_to_sequences(training_data['question_text'])

In [15]:
print(len(X), len(X[0]), len(X[1]), len(X[2]))

1306099 7 11 4


As you can see the lengths are not same. So Pad sequences are used. Pad sequences adds a specific value, usually 0, before or after the text sequence to make them equal in length

In [16]:
X = pad_sequences(sequences = X, padding = 'pre', maxlen = max_length)
print(len(X), len(X[0]), len(X[1]), len(X[2]))

1306099 71 71 71


In [17]:
y = training_data['target'].values
y.shape

(1306099,)

Now the data is ready to be fed into the neural network. Now constructing the neural network NLP. 

I will create a neural network with minimum layers so that beginners like me can understand without complexity.

In [18]:
# LSTM Neural Network
lstm = Sequential()
lstm.add(Embedding(input_dim = max_length, output_dim = 120))
lstm.add(LSTM(units = 120, recurrent_dropout = 0.2))
lstm.add(Dropout(rate = 0.2))
lstm.add(Dense(units = 120, activation = 'relu'))
lstm.add(Dropout(rate = 0.1))
lstm.add(Dense(units = 2, activation = 'softmax'))

lstm.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

In [19]:
lstm.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 120)         8520      
_________________________________________________________________
lstm (LSTM)                  (None, 120)               115680    
_________________________________________________________________
dropout (Dropout)            (None, 120)               0         
_________________________________________________________________
dense (Dense)                (None, 120)               14520     
_________________________________________________________________
dropout_1 (Dropout)          (None, 120)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 242       
Total params: 138,962
Trainable params: 138,962
Non-trainable params: 0
__________________________________________________

In [20]:
lstm_fitted = lstm.fit(X, y, epochs = 1)



Note: you can play with the hyperparameters to get the expected accuracy.

In [21]:
# importing the testing data
testing_data = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/test.csv')

In [22]:
testing_data.head()

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?


In [23]:
# converting the data into tokens
X_test = tokenizer.texts_to_sequences(testing_data['question_text'])

In [24]:
print(len(X_test), len(X_test[0]), len(X_test[1]), len(X_test[2]))

375806 12 15 7


In [25]:
# paddding the sequences
X_test = pad_sequences(X_test, maxlen = max_length, padding = 'pre')
print(len(X_test), len(X_test[0]), len(X_test[1]), len(X_test[2]))

375806 71 71 71


We are good to go !!

In [26]:
# predicting the test set
lstm_prediction = lstm.predict_classes(X_test)

In [27]:
# creating a dataframe for submitting
submission = pd.DataFrame(({'qid':testing_data['qid'], 'prediction':lstm_prediction}))

In [28]:
submission.head()

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,1
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0


In [29]:
submission.to_csv('submission.csv', index = False)

Thank you for viewing my kernel. Please comment if you have any creative ideas of doing traditional methods.

*Let's Learn ! Let's Learn !* 