# Wait Wait, Don't Analyze Me!

![NPR logo](https://media.npr.org/branding/programs/wait-wait-dont-tell-me/branding_main-c5920a167d6a5d445ce86fac30b90454223b6b57.png "One nerd's attempt to learn everything there is to know about NPR's greatest quiz show.")


# Introduction
[Wait Wait, Don't Tell Me!](https://www.npr.org/programs/wait-wait-dont-tell-me/) is NPR's longest-running news quiz show. Contestents call in to answer questions about the week's news, and a rotating cast of three panelists make jokes and parody newsworthy (and not-so-newsworthy) current events. Listening to "Wait wait" has been a highlight of my week since I was a kid, and it remains one of NPR's most popular segments. So what better way to show my appreciation than to take it apart and see what makes it tick?

For this project, I have pulled text transcripts of each episode of "Wait, Wait", storing them as a MySQL library. I have two goals:
1. Understand and predict jokes in the program.
2. Create a "Wait wait" transcript generator, so that I don't have to wait a whole week between episodes!

In this section, I will create a transcript generator.

# Table of Contents
* 0 Data Processing
    * 0.1 [Loading data](#data-loading)
    * 0.2 [Example transcript](#data-example)
    * 0.3 [Encoding transcripts](#data-encoding)
    * 0.4 [Building a training set](#data-train)
* 1 Modeling
    * 1.1 [Model Architecture](#model-initialize)

# Section 0: Initial data processing

## 0.1 Loading the data <a name="loading"></a>
Before I can analyze the data, I must first load it and process it. To accomplish this, I wrote a simple function to load in text files containing the transcripts.

In [1]:
# Importing the libraries I'll be using
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import mysql.connector
import re
from sklearn.preprocessing import QuantileTransformer
import time
import matplotlib.cm as cm

%matplotlib inline

# change the default font size in figures to be larger
font = {'size'   : 15}
plt.rc('font', **font)

In [2]:
# connect to the database of wait wait don't tell me transcripts
cnx = mysql.connector.connect(database='wait_wait',
                              user='root')

In [3]:
# function to pull some transcripts from the database
def pull_transcript(n=5):
    # instantiate a cursor to select data from the database
    curs = cnx.cursor()
    curs.execute(f'select * from transcripts limit {n}')
    
    # pull the data and convert to a pandas dataframe
    df = pd.DataFrame(data = np.array(curs.fetchmany(n)),columns=curs.column_names)
    df = df.set_index('id')
    
    # close the cursor
    curs.close()
    return df

Let's go ahead and pull all of the transcripts from the database - this dataset happens to be small enough that I can load it all at once.

I also divide the transcripts randomly into testing, training, and validation sets. This will ensure that when I perform analyses, I don't build models that over-fit the data.

In [4]:
num_transcripts = 4131
transcript_df = pull_transcript(n=num_transcripts)

# split the tables into testing and training sets, so that we don't over-fit. 
np.random.seed(42) # Ensures that the split is the same each round
transcript_df['train'] = np.random.rand(num_transcripts)>.2
transcript_df['test'] = transcript_df['train']==False

# Further separate the training dataset into a training and validation set
transcript_df['val'] = (np.random.rand(num_transcripts)>.8) & (transcript_df['train'])

# ensure that the training and validation sets don't overlap
transcript_df['train'] = transcript_df['train'] & (transcript_df['val']==False)

In [5]:
transcript_df.head(10)

Unnamed: 0_level_0,episode_id,aired_at,url,segment,transcript,train,test,val
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,who,\n \n \n \n\n BILL KURTIS: Fro...,True,False,False
2,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,panel,"\n \n \n \n\n PETER SAGAL, HOS...",True,False,False
3,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,bluff,\n \n \n \n\n BILL KURTIS: Fro...,True,False,False
4,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,job,"\n \n \n \n\n PETER SAGAL, HOS...",True,False,False
5,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,panel,"\n \n \n \n\n PETER SAGAL, HOS...",False,True,False
6,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,limerick,"\n \n \n \n\n PETER SAGAL, HOS...",False,True,False
7,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,lightning,"\n \n \n \n\n PETER SAGAL, HOS...",False,True,False
8,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,predictions,"\n \n \n \n\n PETER SAGAL, HOS...",True,False,False
9,2,2019-04-27,https://www.npr.org/templates/transcript/trans...,who,\n \n \n \n\n BILL KURTIS: Fro...,True,False,False
10,2,2019-04-27,https://www.npr.org/templates/transcript/trans...,panel,"\n \n \n \n\n PETER SAGAL, HOS...",True,False,False


In [6]:
# For simplicity, I'll set all letters to lower-case
transcript_df.loc[:,'transcript'] = transcript_df.loc[:,'transcript'].str.lower()

## 0.2 Example transcript<a name="data-example"></a>

To understand the data, it helps to first see what the raw data looks like. Let's print a little bit of the transcript from the first dataset.

In [7]:
print(transcript_df.loc[1,'transcript'][:500])


    
        
    

    bill kurtis: from npr and wbez chicago, this is wait wait... don't tell me, the npr news quiz. hey, arthur miller - step into this cruci-bill (ph).
    (laughter)
    kurtis: i'm bill kurtis. and here's your host at the chase bank auditorium in downtown chicago, peter sagal.
    peter sagal, host: 
    thank you, bill. thank you, everybody.
    (cheering)
    sagal: thank you so much. we have a very interesting show for you today. later on, we're going to be talking to m


In [8]:
transcript_df.loc[1,'transcript'][:1000]

"\n    \n        \n    \n\n    bill kurtis: from npr and wbez chicago, this is wait wait... don't tell me, the npr news quiz. hey, arthur miller - step into this cruci-bill (ph).\n    (laughter)\n    kurtis: i'm bill kurtis. and here's your host at the chase bank auditorium in downtown chicago, peter sagal.\n    peter sagal, host: \n    thank you, bill. thank you, everybody.\n    (cheering)\n    sagal: thank you so much. we have a very interesting show for you today. later on, we're going to be talking to microsoft co-founder steve ballmer. he is, we believe, the richest guest we've ever had. but, of course, your true wealth is measured in your friends. and this just in - he has more friends, too.\n    (laughter)\n    sagal: but first, as many of you know, the npr podcast feeds got all screwed up last week. people who tried to download our show got, for example, how i built this instead, for which i apologize. and the people who wanted how i built this got us, for which i apologize eve

Initially, we can note a number of features. First, audience responses are noted with the '(LAUGHTER)' marker and '(APPLAUSE)' marker. This will prove very useful, as we have an automatic metric for "funniness" of the preceding text. 

Speakers' names are in all caps, followed by a colon. Speakers are also separated by a line break and a tab, which could potentially be used to segment the text into phrases by various people. 

## 0.3 Encoding the text <a name='data-encoding'></a>



I will be building a letter-based generator for now, so I want a way to encode both letters and punctuation as integers (eventually, to be transferred into a one-hot encoding scheme for transferring to the model).

In [9]:
from sklearn.preprocessing import OneHotEncoder

In [10]:
all_tokens = set(transcript_df.loc[transcript_df.train,'transcript'].str.cat())
print(f'The dataset includes {len(all_tokens)} unique tokens')

The dataset includes 77 unique tokens


In [11]:
# Make a dictionary converting letters/punctuation to integers
conversion_dict = {}
for i, token in enumerate(all_tokens):
    conversion_dict[token] = i
    
# Make a second dictionary to go in the other direction
reversion_dict = dict( (v,k) for k, v in conversion_dict.items() )

In [12]:
# simple function to encode transcript
def encode_transcript(transcript):
    return [conversion_dict.get(n,len(all_tokens)) for n in transcript]
def decode_transcript(transcript):
    return [reversion_dict.get(n,len(all_tokens)) for n in transcript]

In [13]:
# Encode each of the transcripts by converting letters and punctuation to integers
transcript_df['encoded'] = transcript_df.loc[:,'transcript'].apply(encode_transcript)

In [14]:
transcript_df.head()

Unnamed: 0_level_0,episode_id,aired_at,url,segment,transcript,train,test,val,encoded
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,who,\n \n \n \n\n bill kurtis: fro...,True,False,False,"[5, 26, 26, 26, 26, 5, 26, 26, 26, 26, 26, 26,..."
2,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,panel,"\n \n \n \n\n peter sagal, hos...",True,False,False,"[5, 26, 26, 26, 26, 5, 26, 26, 26, 26, 26, 26,..."
3,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,bluff,\n \n \n \n\n bill kurtis: fro...,True,False,False,"[5, 26, 26, 26, 26, 5, 26, 26, 26, 26, 26, 26,..."
4,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,job,"\n \n \n \n\n peter sagal, hos...",True,False,False,"[5, 26, 26, 26, 26, 5, 26, 26, 26, 26, 26, 26,..."
5,1,2019-05-04,https://www.npr.org/templates/transcript/trans...,panel,"\n \n \n \n\n peter sagal, hos...",False,True,False,"[5, 26, 26, 26, 26, 5, 26, 26, 26, 26, 26, 26,..."


In [15]:
# Generate a one-hot encoder to finally yield data in a one-hot version
onehotencoder = OneHotEncoder(categories='auto',sparse=False)
onehotencoder.fit(np.concatenate(transcript_df.encoded.values).reshape(-1,1))

OneHotEncoder(categorical_features=None, categories='auto', drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

In [16]:
# Define a function to apply the one hot scheme to the integer-encoded values
def one_hot_transcript(transcript):
    integer_transcript = np.array(encode_transcript(transcript)).reshape(-1,1)
    return onehotencoder.transform(integer_transcript)

In [17]:
# Convert the encoded transcript values to the one-hot encodings
transcript_df['encoded'] = transcript_df.loc[:,'transcript'].apply(one_hot_transcript)

## 0.4 Building a training set <a name='data-train'></a>

To build up the training set, we will be taking all of our training transcripts and breaking them up into pieces of a set size. The "x" values will be the set of encoded integers, and the "y" value will be the integer that immediately follows. The goal of the model will be to predict the next letter (or punctuation mark), given the previous letters.

In [19]:
# parameters
n_times = 100
n_components = len(all_tokens)+1
step_size = 50
batch_size = 32

In [20]:
def generate_model_data(transcript):
    # makes calculating size easier to pre-build the iterator
    iterator = range(0,transcript.shape[0]-step_size-n_times,step_size) 
    
    # calculate the size of data we will be generating
    n_examples = len(iterator)

    # initialize x and y values
    x = np.zeros([n_examples,n_times,n_components])
    y = np.zeros([n_examples,n_components])

    # fill in the values for each split
    for step,startpos in enumerate(iterator):
        x[step] = transcript[startpos:startpos+n_times,:]
        y[step] = transcript[startpos+n_times,:]
    
    # return x and y
    return x,y

In [43]:
# For each element of the training transcript set, generate training data
def combine_model_data(transcript_df):
    x = list()
    y = list()
    for i,transcript in enumerate(transcript_df.encoded):
        x_,y_ = generate_model_data(transcript)
        x.append(x_)
        y.append(y_)

        # report progress
        if i%50==0:
            print(i)
    
    # combine all sets into arrays
    x = np.concatenate(x,axis=0)
    y = np.concatenate(y,axis=0)
        
    return x,y

In [55]:
# keeping things small for now, to ensure my code base works before going in with everything
x_train,y_train = combine_model_data(transcript_df[:10])
x_val,y_val = combine_model_data(transcript_df[10:15])
x_test,y_test = combine_model_data(transcript_df[15:20])

0
0
0


# 1 Building and Training the model

## 1.0 Specifying model architecture <a name='model-initialize'></a>