In [1]:
import pandas as pd

data = pd.read_csv("desert-island-discs-episodes.csv", sep="\t", on_bad_lines='warn')
df=data[['Castaway','Job','Book','Luxury','Artist 1','Song 1','Artist 2','Song 2','Artist 3','Song 3']]
# Before I set up the processing pipeline, let's do a quick check on the nulls in the dataset
def print_nulls(column_name):
    print(df[column_name].isnull().value_counts())

[print_nulls(i) for i in df.columns]

Castaway
False    3388
Name: count, dtype: int64
Job
True     3234
False     154
Name: count, dtype: int64
Book
False    2942
True      446
Name: count, dtype: int64
Luxury
False    3180
True      208
Name: count, dtype: int64
Artist 1
False    3192
True      196
Name: count, dtype: int64
Song 1
False    3331
True       57
Name: count, dtype: int64
Artist 2
False    3185
True      203
Name: count, dtype: int64
Song 2
False    3331
True       57
Name: count, dtype: int64
Artist 3
False    3195
True      193
Name: count, dtype: int64
Song 3
False    3331
True       57
Name: count, dtype: int64


Skipping line 35: expected 26 fields, saw 28
Skipping line 42: expected 26 fields, saw 28
Skipping line 45: expected 26 fields, saw 42
Skipping line 59: expected 26 fields, saw 42
Skipping line 2235: expected 26 fields, saw 40
Skipping line 3131: expected 26 fields, saw 28
Skipping line 3161: expected 26 fields, saw 28
Skipping line 3166: expected 26 fields, saw 30



[None, None, None, None, None, None, None, None, None, None]

First we want to clean the data and then vectorize the fields of interest. There's lots of nulls in the jobs column but this is only a nice-to-have anyway.
We'll leave the nulls in but for data cleanliness we'll call them not-specified.

In [2]:
# We have so little data I want to retain as much as possible so I'll replace nans in job with 'not-specified'
df['Job']= df['Job'].fillna('not-specified')
#We can drop the classics since they are duplicates
df= df.drop(df.loc[df['Castaway'].str.contains('Classic')].index, axis=0)
#The first row has loaded the csv strangely so we can throw it out
df.drop(0,axis=0,inplace=True)
# Drop nans now
df.dropna(axis=0, inplace=True)

df.reset_index(drop=True, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Job']= df['Job'].fillna('not-specified')


Now it's time to start classifying the podcasts. To do this, I plan to:
1. Vectorize the songs and artists
2. Vectorize their professions, books and luxury items
3. Concatenate my two vectors
4. Query my vector space using a podcast embedded with the same model

NOTE: If this provides poor recommendations I'll have to add more context to the features - I could map synopsis of books, genres for songs and perhaps embed the wikipedia article of each castaway.

In [3]:
# Prepare the data for vectorizing

# Group columns
df['musicChoice']= df['Artist 1'] + ' ' + df['Song 1'] + ' ' + df['Artist 2'] + ' ' + df['Song 2'] + ' '+ df['Artist 3'] + ' ' + df['Song 3']
df['extraChoices']= df['Job'] + ' ' + df['Book'] + ' ' + df['Luxury']

In [4]:
# Initialize vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 3), min_df = 0, stop_words = "english")

# Create vector embeddings for each choice
tf_idf1 = tf.fit_transform(df['extraChoices'])
tf_idf2 = tf.fit_transform(df['musicChoice'])
combi_tf_idf= tf.fit_transform(df['musicChoice']+df['extraChoices'])
cosine_sim= linear_kernel(combi_tf_idf,combi_tf_idf)


Now we have the first pass at some vectors let's take a look at recommended podcasts based on similarity scores. I want to see the top three podcasts based on music choice, extras choices and then I'll combine the vectors and see the result. 

In [5]:
# Define function for getting recommendations
indices = pd.Series(df.index, index=df['Castaway']).drop_duplicates() 

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the Castaway chosen
    idx = indices[title]

    # Get the pairwsie similarity scores of all Castaways with that Castaway
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the podcasts based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 3 most similar Castaways - ignoring the 1st which will be the Castaway in question
    sim_scores = sim_scores[1:4]

    # Get the indices
    pod_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar podcasts
    return df.iloc[pod_indices]

Time to try it out...

In [6]:
get_recommendations('Helen Mirren')

Unnamed: 0,Castaway,Job,Book,Luxury,Artist 1,Song 1,Artist 2,Song 2,Artist 3,Song 3,musicChoice,extraChoices
1568,Delia Smith,not-specified,Autobiography by Saint Therese of Lisieux,Writing materials,Simon & Garfunkel,The Sound Of Silence,Benjamin Britten,Missa brevis in D major,Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Simon & Garfunkel The Sound Of Silence Benjami...,not-specified Autobiography by Saint Therese o...
1595,Malcolm Muggeridge,not-specified,Writings by Dr Samuel Johnson,Beehive,Wolfgang Amadeus Mozart,"Exsultate, jubilate",Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Antonín Dvořák,Cello Concerto in B Minor,"Wolfgang Amadeus Mozart Exsultate, jubilate To...",not-specified Writings by Dr Samuel Johnson Be...
1740,Anna Raeburn,not-specified,Daughter of the Earth by Agnes Smedley,Indian tea,Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Paul Robeson,Ballad for Americans,Erik Satie,Trois Gymnopédies No.1,Tomaso Albinoni & Remo Giazotto Adagio in G Mi...,not-specified Daughter of the Earth by Agnes S...


These look pretty close - at least Levi Roots does. Let's compare with Michael.

In [7]:
df.loc[df['Castaway']=='Helen Mirren']

Unnamed: 0,Castaway,Job,Book,Luxury,Artist 1,Song 1,Artist 2,Song 2,Artist 3,Song 3,musicChoice,extraChoices
1546,Helen Mirren,not-specified,The Bhagavad-Gita,Silk underwear,Max Bruch,Violin Concerto No. 1 in G Minor,Musical Youth,Pass The Dutchie,Tomaso Albinoni & Remo Giazotto,Adagio in G Minor for Organ & Strings,Max Bruch Violin Concerto No. 1 in G Minor Mus...,not-specified The Bhagavad-Gita Silk underwear


The recommendations are good but not great. The suggestions appear strong based on the Castaway's choices but maybe I need more information about the interviewee for the model to really work well... I'll try and source a new dataset with more information. I know BBC include a summary of the Castaway on each podcast page.