# Group assignment (Podcast dataset)
#### Minor: Communication in the Digital Society
#### Course: CCS 2
#### Tutorial group 1
#### Tutorial teacher: Isa van Leeuwen
#### Group members: Ada Shi (13558846), Elise Serra (13649078), Kyra Bernard (13990284)

## Task 1: Explore and preprocess the data
### a. Explore the data

In [1]:
# Load packages & the podcast dataset to the jupyter notebook

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import spacy
nlp = spacy.load("en_core_web_sm")
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
!pip install pyLDAvis
import gensim
from gensim import corpora
from gensim import models
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import math

df_podcast = pd.read_csv("poddf.csv")
df_podcast.head()



Unnamed: 0,index,Name,Rating_Volume,Rating,Genre,Description
0,0,Fresh Air,10188,4.46133,Arts,"Fresh Air from WHYY, the Peabody Award-winning..."
1,0,The Moth,10154,4.69982,Performing,"Since its launch in 1997, The Moth has present..."
2,0,99% Invisible,12303,4.8693,Design,"Design is everywhere in our lives, perhaps mos..."
3,0,iFanboy.com Comic Book Podcast,1335,4.79551,Visual,The iFanboy.com Comic Book Podcast is a weekly...
4,0,Myths and Legends,11128,4.88282,Literature,"Jason Weiser tells stories from myths, legends..."


In [2]:
# Check number of rows and columns

df_podcast.shape # there are 13632 rows and 6 columns

(13632, 6)

In [3]:
# Check columns

df_podcast.columns # there are 6 columns (index, Name, Rating_Volume, Rating, Genre, Description) in the original dataset

Index(['index', 'Name', 'Rating_Volume', 'Rating', 'Genre', 'Description'], dtype='object')

We will develop a knowledge-based recommender system, and we would like to recommend podcasts mainly based on the "Genre" column

In [4]:
# Check all genres in the dataset

df_podcast["Genre"].unique() # some genres overlap with each other, we will adjust in the pre-processing part

array(['Arts', 'Performing', 'Design', 'Visual', 'Literature', 'Food',
       'Fashion & Beauty', 'Society & Culture', 'Gadgets', 'Business',
       'Investing', 'Management & Marketing', 'Business News', 'Careers',
       'Shopping', 'Self-Help', 'Alternative Health', 'Comedy',
       'Sports & Recreation', 'Education', 'K–12', 'Language Courses',
       'Higher Education', 'Educational Technology', 'Training',
       'Tech News', 'Automotive', 'Video Games', 'Hobbies',
       'Games & Hobbies', 'Other Games', 'Aviation', 'Places & Travel',
       'Government & Organizations', 'Non-Profit', 'National', 'Regional',
       'Local', 'News & Politics', 'Kinds & Family', 'Social Sciences',
       'Health', 'Sexuality', 'Philosophy', 'Fitness & Nutrition',
       'Music', 'TV & Film', 'Science & Medicine ', 'History',
       'Christianity', 'Spirituality', 'Buddhism',
       'Religion & Spirituality', 'Other', 'Hinduism', 'Judaism', 'Islam',
       'Natural Sciences', 'Medicine', 'Personal 

In [5]:
# Count each genre in the column

df_podcast["Genre"].value_counts() # there are 68 types of genres

Genre
Business News              249
Investing                  245
Comedy                     244
Tech News                  243
Places & Travel            242
                          ... 
Arts                        35
Games & Hobbies             23
Health                      16
Religion & Spirituality     11
Not Found                    1
Name: count, Length: 68, dtype: int64

In [6]:
# Check datatypes for all columns

df_podcast.dtypes # except for the column "index", the datatypes of the rest of the columns are all object 

#the datatypes of columns "Rating_Volume" and "Rating" are wrong ("Rating_Volume" should be int64 and "Rating" should be float64)

index             int64
Name             object
Rating_Volume    object
Rating           object
Genre            object
Description      object
dtype: object

### b. Pre-processing and feature engineering

Although the "Rating_Volume" and "Rating" columns will not be used as main input for our recommender system, we will sort the recommended podcasts following the order of the rating. Therefore, we will first correct the datatypes for these columns.

In [7]:
# Correction

## Replace "Not Found" values with NaN
df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].replace("Not Found", np.nan)
df_podcast["Rating"] = df_podcast["Rating"].replace("Not Found", np.nan)

## Change datatypes
df_podcast["Rating"] = df_podcast["Rating"].astype(float)
df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype("Int64") # convert NaN to nullable integer

##### Modification on "Rating_Volume" column
* We wanted to convert the datatype of "Rating_Volume" to int64 via the code as shown below:
* df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype(int)
* However, it produced an error "ValueError: invalid literal for int() with base 10: 'Not Found'"

##### Modification on "Rating_Volume" column
* Similarly, we wanted to convert the datatype of "Rating" to float64 via the code as shown below:
* df_podcast["Rating"] = df_podcast["Rating"].astype(float)
* the output showed an error "ValueError: could not convert string to float: 'Not Found'"

In [8]:
df_podcast.dtypes # now the datatypes changed to [Rating_Volume] - "Int64" and [Rating] - "float64"

index              int64
Name              object
Rating_Volume      Int64
Rating           float64
Genre             object
Description       object
dtype: object

##### Explanation for the code that converts NaN to nullable integer
* After replacing "Not Found" with NaN, we wanted to convert "Rating_Volume" to int64 via the code below:
* df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype(int)
* However, it produced an error "ValueError: cannot convert NA to integer" because NaN is a float
* To solve this problem, instead of trying to convert all values under the column "Rating_Volume" to int64, we converted them to nullable integer type ("Int64") - a datatype that allows the storage of both regular integers and missing values (need to find literature)

In [9]:
# Check for missing values

df_podcast.isna().sum() # there are 1887 missing values under the columns "Rating_Volume" and "Rating" (because there are NaN)

index               0
Name                0
Rating_Volume    1887
Rating           1887
Genre               0
Description         0
dtype: int64

Since the numbers of ratings submitted and average rating value for the podcasts are inconsistent, we can improve the sorting by adopting our algorithms to the IMDB scoring

In [10]:
# Compute weighted rating

C = df_podcast["Rating"].mean() # calculate the mean of all average rating values
m = df_podcast["Rating_Volume"].quantile(0.90) # calculate the minimum number of rating required to be included

def weighted_rating (x, m=m, C=C):
    v = x["Rating_Volume"] # v = number of ratings submitted
    R = x["Rating"] # R = average rating value
    return (v/(v+m)*R) + (m/(m+v)*C) # IMDB formula

df_podcast["Score"] = df_podcast.apply(weighted_rating, axis = 1)

In [11]:
df_podcast.dtypes # the datatype for the new created column is incorrect, it should be "float64"

index              int64
Name              object
Rating_Volume      Int64
Rating           float64
Genre             object
Description       object
Score             object
dtype: object

In [12]:
# Correction

df_podcast["Score"] = pd.to_numeric(df_podcast["Score"], errors = "coerce") # replace non-numeric values to NaN and correct the datatype
df_podcast.dtypes

index              int64
Name              object
Rating_Volume      Int64
Rating           float64
Genre             object
Description       object
Score            float64
dtype: object

#### Pre-process the "Genre" column
We will first clean and standardize the genres

In [13]:
genre_list = df_podcast["Genre"].tolist() # transform text-column into a list

genre_list_lower = [text.lower() for text in genre_list] # lowercasing

mystopwords = stopwords.words("english") # removing stopwords
genre_without_stopwords = [" ".join([w for w in text.split() if w not in mystopwords]) for text in genre_list_lower]

nlp = spacy.load("en_core_web_sm") 
lemmatized_genre = [" ".join([w.lemma_ for w in nlp(text)]) for text in genre_without_stopwords] # lemmatization

tokenizer = RegexpTokenizer(r'\w+')
genre_without_punctuations = [tokenizer.tokenize(text) for text in lemmatized_genre] # remove punctuations

#### Topic modelling of the "Genre" column
We'd like to combine similar genres with the help of topic modelling. 

Justification for setting (k=) 20 topics:
* Due to the non-deterministic nature, the design of no. of topics is very subjective and subtle
* we plan to combine quantitative (the matrix) and qualitative (researcher observation) methods to decide b/w precision and recall
* first, we set k=20 asking the algorithms to filter some topics
* second, we will manually compute (< or =) 20 genres that will be used in our recommender system

In [14]:
# LDA implementation with CountVectorizer (1)

raw_m1 = genre_without_punctuations
id2word_m1 = corpora.Dictionary(raw_m1)   # assign a token_id to each word
ldacorpus_m1 = [id2word_m1.doc2bow(text) for text in raw_m1] # make a corpus (tuple) for word_id and word count

lda_m1 = models.LdaModel(ldacorpus_m1, id2word=id2word_m1, num_topics=20) # apply the CountModel on the corpus
lda_m1.print_topics()

# LDA implementation with TfidfVectorizer (2)

raw_m2 = genre_without_punctuations
id2word_m2 = corpora.Dictionary(raw_m2)   # assign a token_id to each word
ldacorpus_m2 = [id2word_m2.doc2bow(text) for text in raw_m2] # make a corpus (tuple) for word_id and word count

tfidfcorpus_m2 = models.TfidfModel(ldacorpus_m2) # train the Tfidfmodel on the corpus

lda_m2 = models.ldamodel.LdaModel(corpus=tfidfcorpus_m2[ldacorpus_m2],id2word=id2word_m2,num_topics=20) # apply the model
lda_m2.print_topics(num_words=5)

[(0,
  '0.486*"perform" + 0.024*"career" + 0.020*"business" + 0.017*"news" + 0.012*"fashion"'),
 (1,
  '0.227*"art" + 0.204*"perform" + 0.071*"business" + 0.061*"visual" + 0.050*"news"'),
 (2,
  '0.163*"literature" + 0.153*"perform" + 0.114*"marketing" + 0.114*"management" + 0.082*"career"'),
 (3,
  '0.985*"news" + 0.005*"business" + 0.001*"perform" + 0.001*"marketing" + 0.001*"management"'),
 (4,
  '0.491*"society" + 0.491*"culture" + 0.001*"food" + 0.001*"perform" + 0.001*"business"'),
 (5,
  '0.391*"business" + 0.254*"literature" + 0.144*"news" + 0.052*"food" + 0.048*"career"'),
 (6,
  '0.453*"management" + 0.453*"marketing" + 0.020*"perform" + 0.002*"beauty" + 0.002*"fashion"'),
 (7,
  '0.351*"visual" + 0.221*"fashion" + 0.221*"beauty" + 0.023*"perform" + 0.012*"invest"'),
 (8,
  '0.484*"help" + 0.484*"self" + 0.008*"perform" + 0.005*"art" + 0.001*"business"'),
 (9,
  '0.152*"fashion" + 0.152*"beauty" + 0.122*"design" + 0.093*"business" + 0.074*"career"'),
 (10,
  '0.663*"career" +

In [15]:
# Model evaluation

cm1 = models.CoherenceModel(model=lda_m1, corpus=ldacorpus_m1 , dictionary=id2word_m1, coherence='u_mass')  
ch1 = cm1.get_coherence()
cm2 = models.CoherenceModel(model=lda_m2, corpus=ldacorpus_m2, dictionary= id2word_m2, coherence='u_mass')  
ch2 = cm2.get_coherence()

print(f"Coherence of naive model = {ch1}\nCoherence of tfidf model = {ch2}")

## Based on findings, we decide to use the tfidf model because it has the greatest coherence value

Coherence of naive model = -23.086067938776385
Coherence of tfidf model = -23.071693967925533


In [16]:
lda_m2.top_topics(tfidfcorpus_m2[ldacorpus_m2])

[([(0.48432368, 'help'),
   (0.48432297, 'self'),
   (0.008123085, 'perform'),
   (0.0047647874, 'art'),
   (0.0010685693, 'business'),
   (0.00086257956, 'news'),
   (0.0004010741, 'design'),
   (0.00020951968, 'visual'),
   (0.00020951967, 'literature'),
   (0.00020951967, 'food'),
   (0.00020951963, 'career'),
   (0.00020951961, 'fashion'),
   (0.00020951961, 'beauty'),
   (0.00020951958, 'invest'),
   (0.00020951957, 'management'),
   (0.00020951957, 'marketing'),
   (0.00020951955, 'tv'),
   (0.00020951955, 'music'),
   (0.00020951955, 'medicine'),
   (0.00020951955, 'film')],
  -22.9021808176295),
 ([(0.98509544, 'news'),
   (0.0051441547, 'business'),
   (0.0005595381, 'perform'),
   (0.00053274765, 'marketing'),
   (0.0005327472, 'management'),
   (0.000311862, 'fashion'),
   (0.0003118607, 'beauty'),
   (0.00024716486, 'invest'),
   (0.00024716475, 'visual'),
   (0.0002471637, 'literature'),
   (0.00011283144, 'politic'),
   (0.00010718496, 'tech'),
   (9.097737e-05, 'food'),


In [17]:
# Visualization

vis_data = gensimvis.prepare(lda_m2,ldacorpus_m2,id2word_m2)
pyLDAvis.display(vis_data)

According to the figure shown above, many corpuses overlap with each other. Because the topic modelling is solely based on the genre column, there's not enough texts for the modelling. Therefore, the modelling is not informative and we decide to manually compute some genres

In [18]:
# Replace the column with clean strings

df_podcast["genre"] = genre_without_punctuations
df_podcast["genre"] = df_podcast["genre"].apply(str)
df_podcast["genre"] = df_podcast["genre"].str.strip('[]').str.replace("'", "")

In [42]:
# Manually compute broader categories

def categorize(string):
    if "news" in string or "national" in string or "regional" in string or "local" in string or "politic" in string:
        return "News"
    elif any(word in string for word in ["game"]):
        return "Game"
    elif any(word in string for word in ["science", "philosophy"]):
        return "Science"
    elif any(word in string for word in ["education", "course", "school", "language", "train"]):
        return "Education"
    elif any(word in string for word in ["spirituality", "religion", "buddhism", "hinduism", "judaism", "christianity", "islam"]):
        return "Religion"
    elif any(word in string for word in ["aviation", "tech", "technology", "automotive", "software"]):
        return "Technology"
    elif any(word in string for word in ["business", "management", "professional", "marketing", "invest", "career"]):
        return "Business"
    elif any(word in string for word in ["health", "help", "food", "sport", "outdoor", "medicine", "nutrition"]):
        return "Health"
    elif any(word in string for word in ["art", "perform", "design", "visual", "comedy", "fashion", "music", "film", "literature"]):
        return "Art"
    elif any(word in string for word in ["place", "sexuality", "society", "social", "family", "history"]):
        return "Society"
    else:
        return "Others"
    
df_podcast["Interest"] = df_podcast["genre"].apply(categorize)

In [43]:
df_podcast["Interest"].unique() # there are 11 large categories

array(['Art', 'Health', 'Society', 'Others', 'Business', 'News',
       'Education', 'Technology', 'Game', 'Science', 'Religion'],
      dtype=object)

## Task 2.   Create a knowledge-based recommender system

In [63]:
def knowledge_based_recommender(df_podcast):

    df_podcast = df_podcast[df_podcast['Interest'].notna()] # filter missing values
    df_podcast['Interest'] = df_podcast['Interest'].str.lower() # lowercasing

    print(f"What type of content are you interested in? \n\nYou can choose from the following:\n\n{set(df_podcast['Interest'])}")
    interest = input().lower()
    
    df_podcast = df_podcast[df_podcast['Genre'].notna()] # filter missing values
    df_podcast['Genre'] = df_podcast['Genre'].str.lower() # lowercasing
    
    if interest == "health":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='health', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)

    elif interest == "society":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='society', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "science":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='science', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "technology":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='technology', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "news":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='news', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "art":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='art', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "game":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='game', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "business":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='business', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "education":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='education', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "religion":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='religion', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    elif interest == "others":
        print(f"What specific genre do you like?\n\nYou can choose from the following:\n\n{df_podcast.loc[df_podcast['Interest']=='others', 'Genre'].unique()}")
        genre = input().lower()
        genres = df_podcast[(df_podcast['Genre'] == genre)]
        recommend_podcasts = genres.sort_values('Score', ascending=False)
        return recommend_podcasts[['Name', 'Genre', 'Score']].head(5)
    
    else:
        return None
    

Our knowledge-based recommender system has two levels
1. We ask for personal interest of the user ("Interest" column)
2. Under each category of personal interest, we filter relevant genres and ask for user's favorite one ("Genre" column)

In [None]:
knowledge_based_recommender(df_podcast)

What type of content are you interested in? 

You can choose from the following:

{'health', 'society', 'science', 'technology', 'news', 'art', 'game', 'business', 'education', 'religion', 'others'}
