# Superhero (and Supervillain) Name Generator

---

[Superhero Names Dataset](https://github.com/am1tyadav/superhero)

## Task 1

1. Import the data
2. Create a tokenizer
3. Char to index and Index to char dictionaries

In [2]:
!git clone https://github.com/am1tyadav/superhero

Cloning into 'superhero'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 8 (delta 0), reused 4 (delta 0), pack-reused 0[K
Unpacking objects: 100% (8/8), done.


In [3]:
with open('superhero/superheroes.txt','r')as file:
  data=file.read()
data[:100]

'jumpa\t\ndoctor fate\t\nstarlight\t\nisildur\t\nlasher\t\nvarvara\t\nthe target\t\naxel\t\nbattra\t\nchangeling\t\npyrrh'

In [4]:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~',
    split='\n',
)

In [5]:
tokenizer.fit_on_texts(data)

In [6]:
char_to_idx=tokenizer.word_index
idx_to_char=dict((v,k) for k,v in char_to_idx.items())
print(idx_to_char)

{1: '\t', 2: 'a', 3: 'e', 4: 'r', 5: 'o', 6: 'n', 7: 'i', 8: ' ', 9: 't', 10: 's', 11: 'l', 12: 'm', 13: 'h', 14: 'd', 15: 'c', 16: 'u', 17: 'g', 18: 'k', 19: 'b', 20: 'p', 21: 'y', 22: 'w', 23: 'f', 24: 'v', 25: 'j', 26: 'z', 27: 'x', 28: 'q'}


## Task 2

1. Converting between names and sequences

In [7]:
names=data.split('\n')
names[:10]

['jumpa\t',
 'doctor fate\t',
 'starlight\t',
 'isildur\t',
 'lasher\t',
 'varvara\t',
 'the target\t',
 'axel\t',
 'battra\t',
 'changeling\t']

In [8]:
tokenizer.texts_to_sequences(names[0])

[[25], [16], [12], [20], [2], [1]]

In [9]:
def name_to_seq(name):
  return [tokenizer.texts_to_sequences(c)[0][0] for c in name]

In [10]:
name_to_seq(names[0])

[25, 16, 12, 20, 2, 1]

In [11]:
def seq_to_name(seq):
  return ''.join([idx_to_char[i] for i in seq if i!=0])


In [12]:
seq_to_name(name_to_seq(names[0]))

'jumpa\t'

## Task 3

1. Creating sequences
2. Padding all sequences

In [13]:
sequences=[]
for name in names:
  seq=name_to_seq(name)
  if len(seq)>=2:
    sequences += [seq [:i]for i in range(2, len(seq)+1)]


In [14]:
sequences[:10]

[[25, 16],
 [25, 16, 12],
 [25, 16, 12, 20],
 [25, 16, 12, 20, 2],
 [25, 16, 12, 20, 2, 1],
 [14, 5],
 [14, 5, 15],
 [14, 5, 15, 9],
 [14, 5, 15, 9, 5],
 [14, 5, 15, 9, 5, 4]]

In [15]:
max_len =max(len(x) for x in sequences)
print(max_len)

33


In [16]:
padded_sequences= tf.keras.preprocessing.sequence.pad_sequences(
    sequences,padding='pre', maxlen= max_len
)
print(padded_sequences[0])

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0 25 16]


In [17]:
padded_sequences.shape

(88279, 33)

## Task 4: Creating Training and Validation Sets

1. Creating training and validation sets

In [18]:
x,y= padded_sequences[:,:-1], padded_sequences[:,-1]
print(x.shape, y.shape)

(88279, 32) (88279,)


In [19]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y)
print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(66209, 32) (66209,) (22070, 32) (22070,)


In [20]:
num_char=len(char_to_idx.keys())+1
print(num_char)

29


## Task 5: Creating the Model

In [35]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D,MaxPool1D,LSTM,Bidirectional,Dense
model=tf.keras.Sequential([
                           Embedding(num_char,8,input_length=max_len-1),
                           Conv1D(64,5,strides=1,activation='tanh',padding='causal'),
                           MaxPool1D(2),
                           LSTM(32),
                           Dense(num_char,activation='softmax')
])
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="adam",
    metrics=['accuracy']
)
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 32, 8)             232       
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 32, 64)            2624      
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 16, 64)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense_4 (Dense)              (None, 29)                957       
Total params: 16,229
Trainable params: 16,229
Non-trainable params: 0
_________________________________________________________________


## Task 6: Training the Model

In [36]:
model.fit(x_train,y_train,
          validation_data=(x_test,y_test),
          epochs=50,verbose=2,
          callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_accuracy',patience=3)
          ])

Epoch 1/50
2070/2070 - 12s - loss: 2.7436 - accuracy: 0.1905 - val_loss: 2.5787 - val_accuracy: 0.2248
Epoch 2/50
2070/2070 - 9s - loss: 2.5304 - accuracy: 0.2383 - val_loss: 2.5002 - val_accuracy: 0.2472
Epoch 3/50
2070/2070 - 9s - loss: 2.4625 - accuracy: 0.2552 - val_loss: 2.4457 - val_accuracy: 0.2642
Epoch 4/50
2070/2070 - 9s - loss: 2.4164 - accuracy: 0.2674 - val_loss: 2.4161 - val_accuracy: 0.2717
Epoch 5/50
2070/2070 - 10s - loss: 2.3833 - accuracy: 0.2762 - val_loss: 2.3866 - val_accuracy: 0.2786
Epoch 6/50
2070/2070 - 10s - loss: 2.3556 - accuracy: 0.2824 - val_loss: 2.3632 - val_accuracy: 0.2824
Epoch 7/50
2070/2070 - 9s - loss: 2.3300 - accuracy: 0.2891 - val_loss: 2.3457 - val_accuracy: 0.2903
Epoch 8/50
2070/2070 - 9s - loss: 2.3070 - accuracy: 0.2969 - val_loss: 2.3299 - val_accuracy: 0.2927
Epoch 9/50
2070/2070 - 9s - loss: 2.2857 - accuracy: 0.3032 - val_loss: 2.3155 - val_accuracy: 0.2977
Epoch 10/50
2070/2070 - 9s - loss: 2.2659 - accuracy: 0.3094 - val_loss: 2.2996

<tensorflow.python.keras.callbacks.History at 0x7f6de70f4710>

## Task 7: Generate Names!

In [37]:
def generate_names(seed):
  for i in range(0,40):
    seq=name_to_seq(seed)
    padded=tf.keras.preprocessing.sequence.pad_sequences([seq], padding='pre',maxlen=max_len-1,
                                                         truncating='pre'
                                                         )
    pred=model.predict(padded)[0]
    pred_char=idx_to_char[tf.argmax(pred).numpy()]
    seed+= pred_char
    if pred_char=='\t':
      break
  print(seed)

In [41]:
generate_names('yash')

yashian strange	
