In [22]:
import pandas as pd

data = pd.read_csv("desert-island-discs-episodes.csv", sep="\t", on_bad_lines='warn')
df=data[['Castaway','Job','Book','Luxury','Artist 1','Song 1','Artist 2','Song 2','Artist 3','Song 3']]
# Before I set up the processing pipeline, let's do a quick check on the nulls in the dataset
def print_nulls(column_name):
    print(df[column_name].isnull().value_counts())

[print_nulls(i) for i in df.columns]

Castaway
False    3388
Name: count, dtype: int64
Job
True     3234
False     154
Name: count, dtype: int64
Book
False    2942
True      446
Name: count, dtype: int64
Luxury
False    3180
True      208
Name: count, dtype: int64
Artist 1
False    3192
True      196
Name: count, dtype: int64
Song 1
False    3331
True       57
Name: count, dtype: int64
Artist 2
False    3185
True      203
Name: count, dtype: int64
Song 2
False    3331
True       57
Name: count, dtype: int64
Artist 3
False    3195
True      193
Name: count, dtype: int64
Song 3
False    3331
True       57
Name: count, dtype: int64


Skipping line 35: expected 26 fields, saw 28
Skipping line 42: expected 26 fields, saw 28
Skipping line 45: expected 26 fields, saw 42
Skipping line 59: expected 26 fields, saw 42
Skipping line 2235: expected 26 fields, saw 40
Skipping line 3131: expected 26 fields, saw 28
Skipping line 3161: expected 26 fields, saw 28
Skipping line 3166: expected 26 fields, saw 30



[None, None, None, None, None, None, None, None, None, None]

First we want to clean the data and then vectorize the fields of interest. There's lots of nulls in the jobs column but this is only a nice-to-have anyway.
We'll leave the nulls in but for data cleanliness we'll call them not-specified.

In [23]:
# We have so little data I want to retain as much as possible so I'll replace nans in job with 'not-specified'
df['Job']= df['Job'].fillna('not-specified')
#We can drop the classics since they are duplicates
df= df.drop(df.loc[df['Castaway'].str.contains('Classic')].index, axis=0)
#The first row has loaded the csv strangely so we can throw it out
df.drop(0,axis=0,inplace=True)
# Drop nans now
df.dropna(axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Job']= df['Job'].fillna('not-specified')


Now it's time to start classifying the podcasts:
1. Vectorize the songs and artists
2. Vectorize their professions, books and luxury items
3. Concatenate my two vectors
4. Query my vector space using a podcast embedded with the same model

In [24]:
# Prepare the data for vectorizing
# Group columns
df['musicChoice']= df['Artist 1'] + ' ' + df['Song 1'] + ' ' + df['Artist 2'] + ' ' + df['Song 2'] + ' '+ df['Artist 3'] + ' ' + df['Song 3']
df['extraChoices']= df['Job'] + ' ' + df['Book'] + ' ' + df['Luxury']

# Initialize vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 3), min_df = 0, stop_words = "english")

# Create vector embeddings for each choice
tf_idf1 = tf.fit_transform(df['extraChoices'])
tf_idf2 = tf.fit_transform(df['musicChoice'])
combi_tf_idf= tf.fit_transform(df['musicChoice']+df['extraChoices'])
cosine_sim= linear_kernel(combi_tf_idf,combi_tf_idf)


Now we have the first pass at some vectors let's take a look at recommended podcasts based on similarity scores. I want to see the top three podcasts based on music choice, extras choices and then I'll combine the vectors and see the result. 

In [25]:
# Define function for getting recommendations
indices = pd.Series(df.index, index=df['Castaway']).drop_duplicates() 

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the Castaway chosen
    idx = indices[title]

    # Get the pairwsie similarity scores of all Castaways with that Castaway
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the podcasts based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 3 most similar Castaways - ignoring the 1st which will be the Castaway in question
    sim_scores = sim_scores[1:4]

    # Get the indices
    pod_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar podcasts
    return df.iloc[pod_indices]

Time to try it out...

In [26]:
get_recommendations('Helen Mirren')

Unnamed: 0,Castaway,Job,Book,Luxury,Artist 1,Song 1,Artist 2,Song 2,Artist 3,Song 3,musicChoice,extraChoices
1568,Delia Smith,not-specified,Autobiography by Saint Therese of Lisieux,Writing materials,Simon & Garfunkel,The Sound Of Silence,Benjamin Britten,Missa brevis in D major,Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Simon & Garfunkel The Sound Of Silence Benjami...,not-specified Autobiography by Saint Therese o...
1595,Malcolm Muggeridge,not-specified,Writings by Dr Samuel Johnson,Beehive,Wolfgang Amadeus Mozart,"Exsultate, jubilate",Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Antonín Dvořák,Cello Concerto in B Minor,"Wolfgang Amadeus Mozart Exsultate, jubilate To...",not-specified Writings by Dr Samuel Johnson Be...
1740,Anna Raeburn,not-specified,Daughter of the Earth by Agnes Smedley,Indian tea,Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Paul Robeson,Ballad for Americans,Erik Satie,Trois Gymnopédies No.1,Tomaso Albinoni & Remo Giazotto Adagio in G Mi...,not-specified Daughter of the Earth by Agnes S...


These look okay - let's compare with some others.

In [30]:
get_recommendations('Michael Holding')

Unnamed: 0,Castaway,Job,Book,Luxury,Artist 1,Song 1,Artist 2,Song 2,Artist 3,Song 3,musicChoice,extraChoices
245,Levi Roots,not-specified,Long Walk To Freedom by Nelson Mandela,A guitar,Stevie Wonder,Master Blaster (Jammin'),Bob Marley & The Wailers,War,Dionne Warwick,Walk On By,Stevie Wonder Master Blaster (Jammin') Bob Mar...,not-specified Long Walk To Freedom by Nelson M...
609,Lord Joffe,not-specified,A Long Walk to Freedom by Nelson Mandela,Wind-up radio,Sontonga,Nkosi sikelel' iAfrika,Bashir Am Israelim,Hava Nagila,Ludwig van Beethoven,Violin Concerto in D major - 3rd movement,Sontonga Nkosi sikelel' iAfrika Bashir Am Isra...,not-specified A Long Walk to Freedom by Nelson...
105,Dame Sue Campbell,Director Women's Football at the FA,The Long Walk to Freedom by Nelson Mandela,A photo album,John Farnham & Olivia Newton‐John,Dare to Dream,Gloria Estefan And *N SYNC,Music Of My Heart,Edward Elgar,Land Of Hope And Glory,John Farnham & Olivia Newton‐John Dare to Drea...,Director Women's Football at the FA The Long W...


In [31]:
get_recommendations('Robert Macfarlane')

Unnamed: 0,Castaway,Job,Book,Luxury,Artist 1,Song 1,Artist 2,Song 2,Artist 3,Song 3,musicChoice,extraChoices
1494,Stubby Kaye,not-specified,Diary,Hard hat,Nat King Cole,Nature Boy,Shirley Ross,Thanks for the Memory,David Rose and His Orchestra,The Stripper,Nat King Cole Nature Boy Shirley Ross Thanks f...,not-specified Diary Hard hat
803,Professor Baruch Blumberg,not-specified,Ulysses by James Joyce,A flat water kayak suitable for rough water,David Bowie,Space Oddity,The Mamas & the Papas,California Dreamin',Willie Nelson,City of New Orleans,David Bowie Space Oddity The Mamas & the Papas...,not-specified Ulysses by James Joyce A flat wa...
783,Mark Tully,not-specified,Major works by Gerard Manley Hopkins,Modern mini brewery,James Galway & The Chieftains,Londonderry Air (Danny Boy),Ustad Vilayat Khan & Ustad Bismillah Khan,Ragini Yamani/God Be With You,Johann Sebastian Bach,St Matthew Passion - O Haupt voll Blut,James Galway & The Chieftains Londonderry Air ...,not-specified Major works by Gerard Manley Hop...


The recommendations are good but not great. The suggestions appear strong based on the Castaway's choices but maybe I need more information about the interviewee for the model to really work well... I'll try and source a new dataset with more information. I know BBC include a summary of the Castaway on each podcast page.

## V2 Using scraped synopsis data ##

In [18]:
import pandas as pd
# Initialize vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 3), min_df = 0, stop_words = "english")
tf_idf3 = tf.fit_transform(df['synopsis'])
cosine_sim= linear_kernel(tf_idf3,tf_idf3)

df= pd.read_csv('synopsis_data.csv')
df=df[['synopsis','castaway']][:1587]
df['synopsis']=[df['synopsis'][x].replace("\n"," ").replace("Show less","").replace("[Taken from the original programme material for this archive edition of Desert Island Discs]","").strip().replace("   "," ").replace("  ","") for x in range(len(df['synopsis']))]
df.loc[df["castaway"].str.contains("Macfarlane")]

Unnamed: 0,synopsis,castaway
601,Robert Macfarlane is a writer whose books abou...,"Robert Macfarlane, writer"


In [19]:
# Define function for getting recommendations
indices = pd.Series(df.index, index=df['castaway'])

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the Castaway chosen
    idx = indices[title]

    # Get the pairwsie similarity scores of all Castaways with that Castaway
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the podcasts based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 3 most similar Castaways - ignoring the 1st which will be the Castaway in question
    sim_scores = sim_scores[0:4]

    # Get the indices
    pod_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar podcasts
    out= df.iloc[pod_indices]
    out['scores']=sim_scores
    return out

In [20]:
get_recommendations('Robert Macfarlane, writer')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  out['scores']=sim_scores


Unnamed: 0,synopsis,castaway,scores
601,Robert Macfarlane is a writer whose books abou...,"Robert Macfarlane, writer","(601, 1.0000000000000058)"
611,Joanne Harris is a writer who is best known fo...,"Joanne Harris, writer","(611, 0.02948454029969765)"
652,"Richard E Grant was born in Swaziland, now Esw...","Richard E Grant, actor","(652, 0.028792092327949152)"
668,Robert Webb first reached a wide audience as t...,"Robert Webb, comedian","(668, 0.026510746331912513)"


## V3 new word embeddings ##
Clearly the new recommender isn't picking up much information from the castaway synopses. A TF-IDF vector embedding would match castaways who have similar words while ignoring context so I think this is probably why the recommendations are so poor. Let's try a word2vec embedding.

In [33]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
import re
import string
import random
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
from matplotlib import pyplot
from gensim.models import KeyedVectors

def _removeNonAscii(s):
    return "".join(i for i in s if  ord(i)<128)

def make_lower_case(text):
    return text.lower()

def remove_stop_words(text):
    text = text.split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]
    text = " ".join(text)
    return text

def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

def remove_punctuation(text):
    tokenizer = RegexpTokenizer(r'\w+')
    text = tokenizer.tokenize(text)
    text = " ".join(text)
    return text

df= pd.read_csv('synopsis_data.csv')
df=df[['synopsis','castaway']][:1587]
df['synopsis']=[df['synopsis'][x].replace("\n"," ").replace("Show less","").replace("[Taken from the original programme material for this archive edition of Desert Island Discs]","").strip().replace("   "," ").replace("  ","") for x in range(len(df['synopsis']))]

df['cleaned'] = df['synopsis'].apply(_removeNonAscii)
df['cleaned'] = df.cleaned.apply(func = make_lower_case)
df['cleaned'] = df.cleaned.apply(func = remove_stop_words)
df['cleaned'] = df.cleaned.apply(func=remove_punctuation)
df['cleaned'] = df.cleaned.apply(func=remove_html)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/douglasbudge/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [34]:
#splitting the description into words
corpus = []
for words in df['cleaned']:
    corpus.append(words.split())

# Load in Google model
EMBEDDING_FILE = 'GoogleNews-vectors-negative300.bin.gz'
google_word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

# Training our corpus with Google Pretrained Model

google_model = Word2Vec(vector_size = 300, window=5, min_count = 2, workers = -1)
google_model.build_vocab(corpus)

#model.intersect_word2vec_format('./word2vec/GoogleNews-vectors-negative300.bin', lockf=1.0, binary=True)
#google_model.intersect_word2vec_format(EMBEDDING_FILE, lockf=1.0, binary=True)

google_model.train(corpus, total_examples=google_model.corpus_count, epochs = 5)

(0, 0)

In [35]:
# Generate the average word2vec for the each book description
def vectors(x):
    
    # Creating a list for storing the vectors (description into vectors)
    global word_embeddings
    word_embeddings = []

    # Reading the each book description 
    for line in df['cleaned']:
        avgword2vec = None
        count = 0
        for word in line.split():
            if word in google_model.wv.index_to_key:
                count += 1
                if avgword2vec is None:
                    avgword2vec = google_model.wv[word]
                else:
                    avgword2vec = avgword2vec + google_model.wv[word]
                
        if avgword2vec is not None:
            avgword2vec = avgword2vec / count
        
            word_embeddings.append(avgword2vec)


# Recommending the Top 5 similar books

def recommendations(title):
    
    # Calling the function vectors

    vectors(df)
    
    # finding cosine similarity for the vectors

    cosine_similarities = cosine_similarity(word_embeddings, word_embeddings)

    # taking the title and book image link and store in new data frame called books
    pods = df[['castaway', 'synopsis']]
    #Reverse mapping of the index
    indices = pd.Series(df.index, index = df['castaway']).drop_duplicates()
         
    idx = indices[title]
    sim_scores = list(enumerate(cosine_similarities[idx]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    sim_scores = sim_scores[1:6]
    book_indices = [i[0] for i in sim_scores]
    recommend = pods.iloc[book_indices]
    return recommend

In [36]:
recommendations('Claudia Rankine, poet')

Unnamed: 0,castaway,synopsis
563,"Bernardine Evaristo, writer",Bernardine Evaristo won the Booker Prize in 20...
614,"Neil Gaiman, writer",Neil Gaiman is a writer whose list of titles s...
611,"Joanne Harris, writer",Joanne Harris is a writer who is best known fo...
547,"Brian Cox, actor",Brian Cox CBE is a Scottish actor whose career...
612,"Dame Jo da Silva, engineer",Dame Jo da Silva is a structural engineer and ...


Much better, the attraction of the podcast is the castaways' stories. The tf-idf is effective at finding similar music and luxuries since it handles limited terms well but to pick up the context and meaning of the synopses we needed a more sophisticated vector model. I want the podcasts recommended based on the castaways so a word2vec model fine-tuned with my corpus gives pretty good suggestions.