## Loading modules

In [0]:
import numpy as np 
import pandas as pd 
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import LSTM, GRU,SimpleRNN
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Embedding, BatchNormalization
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from tensorflow.keras.layers import GlobalMaxPooling1D, Conv1D, Conv2D, MaxPool1D, Concatenate,MaxPool2D, Flatten, Bidirectional, SpatialDropout1D, Reshape, Input, Dropout, Dense
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.callbacks import EarlyStopping
import datetime
from sklearn.linear_model import LogisticRegression

In [None]:
tf.__version__

In [20]:
# Default distribution strategy for one GPU or CPUs 
strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


In [0]:
# !unzip glove.840B.300d.zip

# Data loading

Loading of training set we developed from preprocess notebook

In [23]:
train_df = pd.read_csv('training_data_dl.csv')
train_df.head()

Unnamed: 0,id,document,gender
0,d7d392835f50664fc079f0f388e147a0,youch good things to know is that sort of stuf...,male
1,d7d392835f50664fc079f0f388e147a0,succumbed to fomo and bought gnr tickets . rem...,male
2,d7d392835f50664fc079f0f388e147a0,brown eye broom a cool number then to the resc...,male
3,d7d392835f50664fc079f0f388e147a0,shout out to auckland tennis fans who get to s...,male
4,d7d392835f50664fc079f0f388e147a0,someone had some balls to come up with that,male


In [0]:
train_df['gender'] = train_df['gender'].apply(lambda x: 1 if x=='male' else 0)

In [0]:
xml_df = train_df[['id','gender']].drop_duplicates()

In [0]:
Xml_train, Xml_test, y_train, y_test = train_test_split(xml_df['id'].values, xml_df['gender'].values,
                                                        random_state=123,
                                                        shuffle=True, 
                                                        test_size=0.2,
                                                        stratify=xml_df['gender'].values)

In [0]:
train_df.dropna(subset=['document'], inplace=True)

In [28]:
# get the maximum size of document to get an estimate of padding at a later stage
train_df['document'].apply(lambda x:len(str(x).split())).max()

45

In [0]:
xtrain = train_df.loc[train_df['id'].isin(Xml_train),'document'].values
ytrain = train_df.loc[train_df['id'].isin(Xml_train),'gender'].values

xvalid = train_df.loc[~train_df['id'].isin(Xml_train),'document'].values
yvalid = train_df.loc[~train_df['id'].isin(Xml_train),'gender'].values

In [30]:
xtrain[0]

'donald the menace thanks comey'

In [0]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 45

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

 We represent every word as one hot vectors of dimensions : Numbers of words in Vocab +1.
What keras Tokenizer does is , it takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let's suppose word 'the' occured the most in the corpus then it will assigned index 1 and vector representing 'the' would be a one-hot vector with value 1 at position 1 and rest zereos.
Try printing first 2 elements of xtrain_seq you will see every word is represented as a digit now

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [0]:
type(word_index)

dict

In [0]:
xtrain_seq[0]

[51, 2, 520, 135, 1413, 20, 58, 481]

Following piece of code is for reference. It was used to perform experiments using GloVe Embeddings. Unfortunately due to gender bias in the learned embeddings, models performed did not work well they were biased towards female class [Read more](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf)

Some other literature if you are interested:

1) [Stanford paper](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/reports/6835575.pdf)

2) [ACL paper](https://www.aclweb.org/anthology/P19-1160.pdf)

3) [University of Toronto](https://arxiv.org/pdf/1810.03611.pdf)

```python
# GloVe vectors loading into dictionary:

# downloaded from  http://www-nlp.stanford.edu/data/glove.840B.300d.zip
embeddings_index = {}
f = open(r'glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(vala for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
```

Following confusion matrix was obtained on models developed using GLoVe:

1) 1DCNN:
```
<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[241,  11],
       [180, 68]], dtype=int32)>
```
2) BiLSTM:
```
<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[244,  8],
       [175, 73]], dtype=int32)>
```
2) GRU:
```
<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[235,  17],
       [165, 83]], dtype=int32)>
```

## Vocabulary Building

In [None]:
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [None]:
sentences = train_df["document"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:5]})

100%|██████████████████████████████████████████████████████████████████████| 309990/309990 [00:00<00:00, 398982.41it/s]<br>
100%|██████████████████████████████████████████████████████████████████████| 309990/309990 [00:00<00:00, 397466.71it/s]

## Coverage experiment

Here we will find how much percentage of words are covered by GloVe dictionary

In [None]:
# vocabulary (oov) words that we can use to improve our preprocessing

import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [None]:
oov = check_coverage(vocab,embeddings_index)
oov[:25]

Found embeddings for 76.24% of vocab<br>
Found embeddings for  99.13% of all text

All three models develped below uses learned embeddings. Glove can be used in the below models by passing `embedding_matrix` and `trainable=False` parameter in the first layer of all below models but as mentioned earlier, it won't give good results.

# BiLSTM Model

Bidirectional LSTM based architecture was chosen because this type of layers, can look through in both the direction: forward and backward. Hence, it is very useful for text classfication where importance of word is determined by the words around it in both direction.

The hyperparameter for these type of architectures were tuned manually, as I didn't have high computation power. 

In [0]:
tf.keras.backend.clear_session()

In [33]:
%%time
with strategy.scope():
    
    # A LSTM with custom embeddings
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     input_length=max_len))

    model.add(Bidirectional(LSTM(128, dropout=0.2, return_sequences=True)))
    model.add(Bidirectional(LSTM(64, dropout=0.2)))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.4))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
          optimizer=tf.keras.optimizers.Adam(0.001),
          metrics=['accuracy'])
    
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 45, 300)           27567900  
_________________________________________________________________
bidirectional (Bidirectional (None, 45, 256)           439296    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               164352    
_________________________________________________________________
dense (Dense)                (None, 32)                4128      
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 28,175,709
Trainable params: 28,175,709
Non-trainable params: 0
____________________________________________

Experiments were carried with different values of batch size and epoch till it converge or till the time when validation loss stops improving/fluctuates.

In [None]:
history = model.fit(xtrain_pad,
                    ytrain,
                    epochs=5,
                    batch_size=128,
                    validation_data=(xvalid_pad, yvalid),
                    verbose=1)
                    # callbacks=[tensorboard_callback])

As this didn't give better performance, it was neglected

# 1DCNN model 

Other approach was taken from [Paragraph classification](http://cs229.stanford.edu/proj2016/report/NhoNg-ParagraphTopicClassification-report.pdf) where multilayered Convolutional layers with maxpooling operation are used.

Convolution layers applies filterss through the rows of representation/embeddings for sentences, and learns important information. Size of this filter indicates how far to look in a window for aggregration using Maxpooling.

In [0]:
tf.keras.backend.clear_session()

In [0]:
embed_size = 300
filter_sizes = [1,3,5]
num_filters = 50

def get_model():
    inp = Input(shape=(max_len, ))
    x = Embedding(len(word_index) + 1,
                  embed_size)(inp)
   
    conv_0 = Conv1D(num_filters, kernel_size=(filter_sizes[0]),
                                 kernel_initializer='he_normal', activation='tanh')(x)
    conv_1 = Conv1D(num_filters, kernel_size=(filter_sizes[1]),
                                 kernel_initializer='he_normal', activation='tanh')(x)
    conv_2 = Conv1D(num_filters, kernel_size=(filter_sizes[2]), 
                                 kernel_initializer='he_normal', activation='tanh')(x)
    
    maxpool_0 = MaxPool1D(pool_size=(max_len - filter_sizes[0] + 1))(conv_0)
    maxpool_1 = MaxPool1D(pool_size=(max_len - filter_sizes[1] + 1))(conv_1)
    maxpool_2 = MaxPool1D(pool_size=(max_len - filter_sizes[2] + 1))(conv_2)
        
    z = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])   
    z = Flatten()(z)
    z = Dropout(0.2)(z) 
    z = Dense(32, activation = 'relu')(z)
    z = Dropout(0.4)(z)
    outp = Dense(1, activation="sigmoid")(z)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(0.001),
              metrics=['accuracy'])
    
    return model

model = get_model()

In [0]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 45)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 45, 300)      27567900    input_1[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 45, 50)       15050       embedding[0][0]                  
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 43, 50)       45050       embedding[0][0]                  
______________________________________________________________________________________________

In [0]:
history = model.fit(xtrain_pad,
                    ytrain,
                    epochs=3, # optimize this value 
                    batch_size=128,
                    validation_data=(xvalid_pad, yvalid),
                    verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


Only 3 epochs are logs are provided here. 

# 2D CNN

Finally, inspired from [Gender Classification approach](https://www.mdpi.com/2076-3417/9/6/1249) architecture, 2 dimentional filters were used to get global meaning from a sentence learning from emebedding of words in a sentence.

In [0]:
tf.keras.backend.clear_session()

In [0]:
filter_sizes = [1,2,3,5]
num_filters = 42
embed_size = 300

def get_model():    
    inp = Input(shape=(max_len, ))
    x = Embedding(len(word_index) + 1, embed_size)(inp)
    x = SpatialDropout1D(0.2)(x)
    x = Reshape((max_len, embed_size, 1))(x)
    
    conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embed_size), kernel_initializer='normal',
                                                                                    activation='elu')(x)
    conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embed_size), kernel_initializer='normal',
                                                                                    activation='elu')(x)
    conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embed_size), kernel_initializer='normal',
                                                                                    activation='elu')(x)
    conv_3 = Conv2D(num_filters, kernel_size=(filter_sizes[3], embed_size), kernel_initializer='normal',
                                                                                    activation='elu')(x)
    
    maxpool_0 = MaxPool2D(pool_size=(max_len - filter_sizes[0] + 1, 1))(conv_0)
    maxpool_1 = MaxPool2D(pool_size=(max_len - filter_sizes[1] + 1, 1))(conv_1)
    maxpool_2 = MaxPool2D(pool_size=(max_len - filter_sizes[2] + 1, 1))(conv_2)
    maxpool_3 = MaxPool2D(pool_size=(max_len - filter_sizes[3] + 1, 1))(conv_3)
        
    z = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2, maxpool_3])   
    z = Flatten()(z)
    z = Dropout(0.1)(z)
        
    outp = Dense(1, activation="sigmoid")(z)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(0.001),
              metrics=['accuracy'])

    return model

model = get_model()


In [0]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 45)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 45, 300)      27567900    input_1[0][0]                    
__________________________________________________________________________________________________
spatial_dropout1d (SpatialDropo (None, 45, 300)      0           embedding[0][0]                  
__________________________________________________________________________________________________
reshape (Reshape)               (None, 45, 300, 1)   0           spatial_dropout1d[0][0]          
______________________________________________________________________________________________

In [0]:
history = model.fit(xtrain_pad,
                    ytrain,
                    epochs=10, # should be optimized 
                    batch_size=256, # should be optimized
                    validation_data=(xvalid_pad, yvalid),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Normalized prob testing

Probability for each xml was determined by averaging prediction probability of all documents to get normalized value.

Lets load the testing data

In [None]:
result_df = pd.read_csv('testing_data_dl.csv')

We will remove duplicates and convert target row into numbers, with ```male:0 and female:1```

In [0]:
result_df['gender'] = result_df['gender'].apply(lambda x: 1 if x=='male' else 0) # number conversion of target
result_df.dropna(subset=['document'], inplace=True) # drop na
result_ids = result_df['id'].unique() # get ids for all test data

In [None]:
pred_result = []
target = []
success = 0
for id in result_ids:
    xtest = result_df.loc[result_df['id']==id,'document'].values
    label = result_df.loc[result_df['id']==id,'gender'].iloc[0]
    xtest_seq = token.texts_to_sequences(xtest)
    xtest_pad = sequence.pad_sequences(xtest_seq, maxlen=max_len)
    test_prob = np.mean(model.predict(xtest_pad))
    if (test_prob>0.5 and label==1) or (test_prob<0.5 and label==0):
        success+=1
    if test_prob>0.5:
        pred_result.append(1)
    else:
        pred_result.append(0)
    target.append(label)
print(success)

In [None]:
# For 1D CNN architecture
# accuracy was: 68%
tf.math.confusion_matrix(labels=target, predictions=pred_result)

In [None]:
# For 2D CNN architecture
# accuracy was: 68.6%
tf.math.confusion_matrix(labels=target, predictions=pred_result)

Some other experiments were performed based on Universal Sentence Encoder(USE), I have used this encoder in many text classification taks and they seem to be very powerful, they are based on a famous attention based mechanism [Attention is All you need](https://arxiv.org/abs/1706.03762) 

However, it didn't performed very well. 

In [None]:
input_text = tf.keras.Input((), dtype = tf.string, name = 'input_text')
# load embedding module
embed_use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
# convert it into a tensorflow layer
embedding_layer = hub.KerasLayer(embed_use, input_shape = [],
                           dtype = tf.string,
                           trainable = False)(input_text)

Now a feeed forward network is build on top of it

In [None]:
model = tf.keras.Sequential()
model.add(embedding_layer)
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(0.005),
              metrics=['accuracy'])

In [None]:
history = model.fit(x_train,
                    y_train,
                    epochs=5,
                    batch_size=128,
                    validation_data=(x_val, y_val),
                    verbose=1)

Train on 247987 samples, validate on 61993 samples<br>
Epoch 1/5<br>
247987/247987 [==============================] - 35s 142us/sample - loss: 0.6686 - accuracy: 0.5854 - val_loss: 0.6737 - <br>val_accuracy: 0.6253<br>
Epoch 2/5<br>
247987/247987 [==============================] - 32s 128us/sample - loss: 0.6625 - accuracy: 0.6060 - val_loss: 0.6746 - <br>val_accuracy: 0.6314<br>
Epoch 3/5<br>
247987/247987 [==============================] - 32s 128us/sample - loss: 0.6590 - accuracy: 0.6179 - val_loss: 0.6734 - <br>val_accuracy: 0.6290<br>
Epoch 4/5<br>
247987/247987 [==============================] - 32s 128us/sample - loss: 0.6567 - accuracy: 0.6258 - val_loss: 0.6746 - <br>val_accuracy: 0.6231<br>
Epoch 5/5<br>
247987/247987 [==============================] - 32s 129us/sample - loss: 0.6536 - accuracy: 0.6345 - val_loss: 0.6752 - <br>val_accuracy: 0.6276<br>

As we can se val_accuracy decrease and fluctuates even after tuning hyperparamter for layer size and dropout values

After some research, found that more extensive research and sophisticated architecture needs to be developed to solve particulary this type of classification problem by avoiding gender bias issue in the learned embeddings.

The final version of the model can be found in `./scripts/train.py` which can be deployed on any cloud platform.

Here, we will deploy it on **AWS SageMaker** with custom container. Please read `./deploy.ipynb` file for instructions on deploying this TF model.

Following experiments will be considered for future work.

# ToDo:

1. Spell checker and analyze mispelled ones in Text preprocessing
2. Hyperparameter tuning with LSTM different cells
3. Try an ensemble kind of structure for CNN and LSTM
4. Roberta implementation 
5. Attention with BiLSTM 

### Code formatting:
1. tensorflow data API for train, valid batch sets like https://www.tensorflow.org/hub/tutorials/bangla_article_classifier
2. Custom call with class for model if possible https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub_on_kaggle