# Generating TV show names with a recurrent neural network (RNN)
### 1. First we wrangle some wikipedia data to train the RNN!
It looks like using the __[wikipedia python library](https://pypi.org/project/wikipedia/)__  by Jonathan Goldsmith would be simplest. Everything we want to scrape is neatly organized on: https://en.wikipedia.org/wiki/List_of_American_television_programs.

In [1]:
import wikipedia

wikipedia.set_lang("en")
tv = wikipedia.page("List_of_American_television_programs")
print(tv.summary)

The following lists television programs made for audiences in the United States, not counting its territories.  A separate list contains television programs made for Puerto Rico.


In [2]:
RNN_training_data = []
last_valid_ele = ""

for ele in tv.links:
        if "List" in ele: #removes reference links
            continue
            
        if "(" in ele: #removes occurrences of suffixal (19xx TV series)
            if ele.split("(")[0] != last_valid_ele:
                RNN_training_data.append(ele.split("(")[0]) 
                last_valid_ele = ele.split("(")[0]
            continue
            
        if ele != last_valid_ele: #removes occurrences of duplicate names
            RNN_training_data.append(ele)
            last_valid_ele = ele

print(RNN_training_data,"\n\nTotal TV show names:",len(RNN_training_data))

['$h*! My Dad Says', '100 Deeds for Eddie McDowd', '100 Questions', '106 & Park', '10 Things I Hate About You ', '12 Monkeys ', '13 Reasons Why', '1600 Penn', '1 vs. 100 ', '1st & Ten ', '20/20 ', '21 Jump Street', '227 ', '24: Conspiracy', '24: Live Another Day', '24 ', '2 Broke Girls', '3-2-1 Contact', '30 Days ', '30 Rock', '3AM ', '3 South', '3 lbs', '3rd Rock from the Sun', '48 Hours ', '4th and Long', '5ive Days to Midnight', '60 Minutes', '666 Park Avenue', '704 Hauser', '77 Sunset Strip', '7th Heaven ', '8 Simple Rules', '8th & Ocean', '90210 ', 'A.N.T. Farm', 'ABC Afterschool Special', 'ALF ', 'A Current Affair ', 'A Different World ', 'A Man Called Hawk', 'A Man Called Shenandoah', 'A Man Called Sloane', 'A Nero Wolfe Mystery', 'A New Kind of Family', 'A Touch of Grace', 'Aaron Stone', 'Abby ', 'About a Boy ', 'Access Hollywood', 'Accidental Family', 'According to Jim', 'Ace Crawford, Private Eye', 'Ace Ventura: Pet Detective ', 'Action ', 'Action League Now!', 'Adam-12', 'Ad

### 2. Now we move on to training the RNN!
We use  __[textgenrnn](https://github.com/minimaxir/textgenrnn)__, a char-rnn module by Max Woolf, albeit at the word-level and not the char-level because TV show names generally have to be recognizable words.

In [30]:
from textgenrnn import textgenrnn
textgenrnn = textgenrnn()
textgenrnn.train_on_texts(RNN_training_data,                          
                          new_model = True, 
                          rnn_layers = 2,
                          rnn_size = 64,
                          rnn_bidirectional = True, 
                          max_length = 20,
                          dim_embeddings = 32,
                          word_level = True, 
                          num_epochs = 2, 
                          gen_epochs = 1, 
                          batch_size = 128, 
                          train_size = 0.9,
                          dropout = 0.1, 
                          context = False)    

Training new model w/ 2-layer, 64-cell Bidirectional LSTMs
Training on 8,489 word sequences.
Epoch 1/2
####################
Temperature: 0.2
####################
the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the th

In [34]:
textgenrnn.generate_samples(n=2, temperatures=[0.5, 0.7, 0.9, 1.0, 1.2, 1.5])

####################
Temperature: 0.5
####################
the the the next

the the the the ted

####################
Temperature: 0.7
####################
spider

the the the hunt

####################
Temperature: 0.9
####################
the split neighborhood

wild cop gotham

####################
Temperature: 1.0
####################
with how

young giants & hannibal !

####################
Temperature: 1.2
####################
double agents s 101

force men

####################
Temperature: 1.5
####################
carter hillbillies saddles i force

genius salad endings the roswell ' t addams our none kipper kids manimal thief barney gracie recess met homicide 35



### 3. How about we remove prefixal "The "s from our training dataset?
It appears the RNN is overweighting "the" too much (for our enjoyment's sake). This is reflected in textgenrnn_vocab.json's ranking for "the"!

{"the": 1, ".": 2, "'": 3, "of": 4, "show": 5 ...

In [5]:
import re
new_RNN_training_data=[]
new_RNN_training_data[:] = [re.sub("\\bThe \\b",'',ele) for ele in RNN_training_data]
print(new_RNN_training_data,"\n\nTotal TV show names:",len(new_RNN_training_data))

[' $ h * ! My Dad Says', '100 Deeds for Eddie McDowd', '100 Questions', '106 & Park', '10 Things I Hate About You ', '12 Monkeys ', '13 Reasons Why', '1600 Penn', '1 vs . 100 ', '1st & Ten ', '20 / 20 ', '21 Jump Street', '227 ', '24 : Conspiracy', '24 : Live Another Day', '24 ', '2 Broke Girls', '3 - 2 - 1 Contact', '30 Days ', '30 Rock', '3AM ', '3 South', '3 lbs', '3rd Rock from the Sun', '48 Hours ', '4th and Long', '5ive Days to Midnight', '60 Minutes', '666 Park Avenue', '704 Hauser', '77 Sunset Strip', '7th Heaven ', '8 Simple Rules', '8th & Ocean', '90210 ', 'A . N . T . Farm', 'ABC Afterschool Special', 'ALF ', 'A Current Affair ', 'A Different World ', 'A Man Called Hawk', 'A Man Called Shenandoah', 'A Man Called Sloane', 'A Nero Wolfe Mystery', 'A New Kind of Family', 'A Touch of Grace', 'Aaron Stone', 'Abby ', 'About a Boy ', 'Access Hollywood', 'Accidental Family', 'According to Jim', 'Ace Crawford , Private Eye', 'Ace Ventura : Pet Detective ', 'Action ', 'Action League N

In [36]:
from textgenrnn import textgenrnn
textgenrnn = textgenrnn()
textgenrnn.train_on_texts(new_RNN_training_data,                          
                          new_model = True, 
                          rnn_layers = 2,
                          rnn_size = 64,
                          rnn_bidirectional = True, 
                          max_length = 20,
                          dim_embeddings = 32,
                          word_level = True, 
                          num_epochs = 2, 
                          gen_epochs = 1, 
                          batch_size = 128, 
                          train_size = 0.9,
                          dropout = 0.1, 
                          context = False)   

Training new model w/ 2-layer, 64-cell Bidirectional LSTMs
Training on 7,983 word sequences.
Epoch 1/2
####################
Temperature: 0.2
####################






####################
Temperature: 0.5
####################
the



s

####################
Temperature: 1.0
####################
hearts

always new

dance show

Epoch 2/2
####################
Temperature: 0.2
####################
new new new new the rangers

'



####################
Temperature: 0.5
####################
electric

knight

big law

####################
Temperature: 1.0
####################
of , hogan , style

nfl of real & rack t

in new loves madeline



In [40]:
textgenrnn.generate_samples(n=2, temperatures=[0.5, 0.7, 0.9, 1.0, 1.2, 1.5])

####################
Temperature: 0.5
####################
territories

my alias

####################
Temperature: 0.7
####################
amazing rose

game

####################
Temperature: 0.9
####################
head

zoey family

####################
Temperature: 1.0
####################
outsourced of luck

days

####################
Temperature: 1.2
####################
recreation broke

randy patty ? of leftovers edition

####################
Temperature: 1.5
####################
go bunch charlie hazel house - how states show s baltimore eye longmire show

affair progress franklin me tin marvels mind america



This project was inspired by learning about Web 3.0's glamor in platform management class at  __[CUHK MBA](https://twitter.com/cuhk_mba)__ and from discussions on technology's steady march forwards with Dr. Tsai of Fudan University. Last, heartfelt thanks are given to my brother for his encouragement.