# Group assignment (Podcast dataset)
#### Minor: Communication in the Digital Society
#### Course: CCS 2
#### Tutorial group 1
#### Tutorial teacher: Isa van Leeuwen
#### Group members: Ada Shi (13558846), Elise Serra (13649078), Kyra Bernard (13990284)

## Code for exploring data analysis
#### In this part, we will explore the dataset by:
1. checking all columns and their datatypes and checking for missing values
2. following pre-processing steps that are learned in the first week of this course (eg. lowercasing, tokenization, stop words removal/pruning, lemmatization, and N-grams)
3. conducting an inductive analysis by adopting techniques such as CountVectorizer/TfidfVectorizer and topic modelling

#### 1. Data exploration (learned in CCS 1):

In [1]:
# Load packages & the podcast dataset to the jupyter notebook

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import spacy

import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

!pip install pyLDAvis
import gensim
from gensim import corpora
from gensim import models

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import WordEmbeddingSimilarityIndex

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import numpy as np
import random
from glob import glob
from string import punctuation

import random
from dateutil import parser
from tqdm import tqdm

random.seed(2022)
np.random.seed(2022)

df_podcast = pd.read_csv("poddf.csv")



In [2]:
# Check number of rows and columns
df_podcast.shape # there are 13632 rows and 6 columns

(13632, 6)

In [3]:
# Check columns

df_podcast.columns # there are 6 columns (index, Name, Rating_Volume, Rating, Genre, Description) in the original dataset

Index(['index', 'Name', 'Rating_Volume', 'Rating', 'Genre', 'Description'], dtype='object')

In [4]:
# Check datatypes for all columns

df_podcast.dtypes # except for the column "index", the datatypes of the rest of the columns are all object 

#the datatypes of columns "Rating_Volume" and "Rating" are wrong ("Rating_Volume" should be int64 and "Rating" should be float64)

index             int64
Name             object
Rating_Volume    object
Rating           object
Genre            object
Description      object
dtype: object

#### (This part is not important for this assignment but good for personal understanding)

##### Modification on "Rating_Volume" column
* We wanted to convert the datatype of "Rating_Volume" to int64 via the code as shown below:
* df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype(int)
* However, it produced an error "ValueError: invalid literal for int() with base 10: 'Not Found'"

##### Modification on "Rating_Volume" column
* Similarly, we wanted to convert the datatype of "Rating" to float64 via the code as shown below:
* df_podcast["Rating"] = df_podcast["Rating"].astype(float)
* the output showed an error "ValueError: could not convert string to float: 'Not Found'"

In [5]:
# Correction

## Replace "Not Found" values with NaN
df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].replace("Not Found", np.nan)
df_podcast["Rating"] = df_podcast["Rating"].replace("Not Found", np.nan)

## Change datatypes
df_podcast["Rating"] = df_podcast["Rating"].astype(float)
df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype("Int64") # convert NaN to nullable integer


In [6]:
df_podcast.dtypes # now the datatypes changed to [Rating_Volume] - "Int64" and [Rating] - "float64"

index              int64
Name              object
Rating_Volume      Int64
Rating           float64
Genre             object
Description       object
dtype: object

#### (This part is not important for this assignment but good for personal understanding)

##### Explanation for the code that converts NaN to nullable integer
* After replacing "Not Found" with NaN, we wanted to convert "Rating_Volume" to int64 via the code below:
* df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype(int)
* However, it produced an error "ValueError: cannot convert NA to integer" because NaN is a float
* To solve this problem, instead of trying to convert all values under the column "Rating_Volume" to int64, we converted them to nullable integer type ("Int64") - a datatype that allows the storage of both regular integers and missing values (need to find literature)

In [7]:
# Check for missing values

df_podcast.isna().sum() # there are 1887 missing values under the columns "Rating_Volume" and "Rating" (because there are NaN)

index               0
Name                0
Rating_Volume    1887
Rating           1887
Genre               0
Description         0
dtype: int64

In [8]:
df_podcast["Description"] # there are 13631 sets of data for the description, and they are all non-numeric

0        Fresh Air from WHYY, the Peabody Award-winning...
1        Since its launch in 1997, The Moth has present...
2        Design is everywhere in our lives, perhaps mos...
3        The iFanboy.com Comic Book Podcast is a weekly...
4        Jason Weiser tells stories from myths, legends...
                               ...                        
13627    Puromac es una conversación sobre todo el mund...
13628    AVexcel is your guide to the best in home thea...
13629    Stay current with IT news on vendor moves, new...
13630    AVexcel is your guide to the best in home thea...
13631    Stay current with IT news on vendor moves, new...
Name: Description, Length: 13632, dtype: object

#### Specification:
* For this assignment, we are most interested in the last objective column "Description" (there's no need to change anything in the first stage because the datatype is correct and there's no missing value)
* to develop our own recommender system, we will first, transform this column into a list and then pre-process the data with knowledge learned from week 1 (eg. lowercasing, stopwords removal, lemmatization/stemming, tokenization, etc.)


#### 2. Pre-processing (learned in CCS 2, week 1):

In [9]:
# Transform text-column into a list

text_list = df_podcast["Description"].tolist()

In [10]:
# Lowercasing

text_list_lower = [text.lower() for text in text_list]

In [11]:
# Removing stopwords

mystopwords = stopwords.words("english")
text_without_stopwords = [" ".join([w for w in text.split() if w not in mystopwords]) for text in text_list_lower]

In [12]:
# Lemmatization

nlp = spacy.load("en_core_web_sm")
lemmatized_text = [" ".join([w.lemma_ for w in nlp(text)]) for text in text_without_stopwords]

In [13]:
# Tokens removed punctuations

tokenizer = RegexpTokenizer(r'\w+')
text_without_punctuations = [tokenizer.tokenize(text) for text in lemmatized_text]

In [14]:
# Remove "s"

tokens_list = [([w for w in text if w != "s"]) for text in text_without_punctuations]

#### Specification:
* The pre-processing follows the order as shown above because we want to keep standard tokens without punctuations
* after lowercasing, we remove stopwords that are not informative
* and then the lemmatization transforms each token to word can be found in dictionaries
* after lemmatizing, some tokens are in the form of "contraction expansion eg. it's -> it 's", so we seperate the pruning process into 2 steps (removing stopwords and removing punctuations) and remove punctuations after lemmatization
* in the final pre-processing step,we removed a particular token "s" because we observed many "s" in the output which are not informative

#### 3. Inductive analysis:
* Inspect the similarity of words among each podcast
* Topic modelling

##### Part 1: Check of similarity among each podcast

In [15]:
# Preparation for TfidfVectorizer in a sparse format

flattened_tokens = [token for sublist in tokens_list for token in sublist] # convert the list of lists into one list
Vec = TfidfVectorizer(max_df = .75, min_df = 2) #initilize the vectorizer (words that occur in more than 75% or less than n = 2 documents are removed)
Vec_fit = Vec.fit_transform(flattened_tokens) #fit the vectorizer and transform the documents in one go


In [16]:
# Inspect the shape of the sparse matrix

Vec_fit.shape # (number of documents = 484494, number of counted terms = 17638)

(484494, 17638)

In [17]:
# Descriptions of the sparse matrix

print("Number of non-zero elements:", Vec_fit.sum())
print("Total number of elements:", Vec_fit.shape[0] * Vec_fit.shape[1])

# compute the sparsity of the matrix: w the proportion of zero elements in the matrix
print("Sparsity:", 1 - Vec_fit.sum() / (Vec_fit.shape[0] * Vec_fit.shape[1])) # the percentage of sparsity is very high

Number of non-zero elements: 456622.0
Total number of elements: 8545505172
Sparsity: 0.9999465658272028


In [20]:
# Apply soft-cosine similarity

fasttext_model300 = api.load('fasttext-wiki-news-subwords-300') # load an embedding model

pod1 = ' '.join(tokens_list[0])
pod2 = ' '.join(tokens_list[1])
pod3 = ' '.join(tokens_list[2])
dictionary = corpora.Dictionary([simple_preprocess(pod) for pod in [pod1, pod2, pod3]]) #initialize a Dictionary. This step assigns a token_id to each word
bag_of_words_vectors = [ dictionary.doc2bow(simple_preprocess(pod)) for pod in [pod1, pod2, pod3]] # represent each podcast by (token_id, token_count) tuples

similarity_index = WordEmbeddingSimilarityIndex(fasttext_model300)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary) # Build a term similarity matrix and compute the Soft Cosine Measure

#between pod1 and pod2
scm_pod1_pod2 = similarity_matrix.inner_product(bag_of_words_vectors[0], bag_of_words_vectors[1], normalized=(True, True))

#between pod1 and pod3
scm_pod1_pod3 = similarity_matrix.inner_product(bag_of_words_vectors[0], bag_of_words_vectors[2], normalized=(True, True))

#between pod2 and pod3
scm_pod2_pod3 = similarity_matrix.inner_product(bag_of_words_vectors[1], bag_of_words_vectors[2], normalized=(True, True))

print(f"SCM between:\npod1 <-> pod2: {scm_pod1_pod2:.2f}\npod1 <-> pod3: {scm_pod1_pod3:.2f}\npod2 <-> pod3: {scm_pod2_pod3:.2f}")

100%|███████████████████████████████████████████| 93/93 [00:07<00:00, 12.69it/s]

SCM between:
pod1 <-> pod2: 0.08
pod1 <-> pod3: 0.12
pod2 <-> pod3: 0.16





##### Analysis
* according to the result, the percentage of sparsity in the matrix is approximately equal to 1, which is very high
* it shows that each podcast contains very different words
* we predict that there are (many) different topics based on the observation of diverse distribution of words
* we also used soft-cosine similarity to randomly examine the similarity among pod1, pod2, and pod3
* the results show that each podcast has different degrees of similarity with other podcasts

##### Part 2: Topic modelling

In [21]:
# LDA implementation with CountVectorizer (1)

raw_m1 = tokens_list
id2word_m1 = corpora.Dictionary(raw_m1)   # assign a token_id to each word
ldacorpus_m1 = [id2word_m1.doc2bow(text) for text in raw_m1] # make a corpus (tuple) for word_id and word count

lda_m1 = models.LdaModel(ldacorpus_m1, id2word=id2word_m1, num_topics=50) # apply the CountModel on the corpus
lda_m1.print_topics()

[(9,
  '0.055*"war" + 0.037*"00" + 0.025*"star" + 0.020*"religious" + 0.019*"ghost" + 0.017*"history" + 0.017*"fight" + 0.015*"strange" + 0.014*"crime" + 0.011*"story"'),
 (41,
  '0.063*"de" + 0.027*"la" + 0.024*"podcast" + 0.019*"en" + 0.018*"ben" + 0.018*"amateur" + 0.016*"que" + 0.014*"el" + 0.012*"scott" + 0.012*"al"'),
 (26,
  '0.082*"rick" + 0.047*"00pm" + 0.022*"gun" + 0.021*"digital" + 0.016*"george" + 0.015*"mostly" + 0.011*"safety" + 0.010*"trek" + 0.009*"court" + 0.009*"off"'),
 (0,
  '0.024*"life" + 0.019*"spiritual" + 0.010*"podcast" + 0.009*"travel" + 0.009*"people" + 0.008*"experience" + 0.008*"live" + 0.007*"address" + 0.007*"animal" + 0.007*"change"'),
 (36,
  '0.015*"adam" + 0.013*"especially" + 0.013*"life" + 0.013*"big" + 0.012*"day" + 0.012*"o" + 0.012*"podcast" + 0.012*"wild" + 0.011*"doug" + 0.011*"surround"'),
 (17,
  '0.020*"lose" + 0.019*"mystery" + 0.019*"half" + 0.018*"usa" + 0.017*"radio" + 0.017*"version" + 0.016*"can" + 0.013*"not" + 0.013*"over" + 0.012*

In [22]:
# LDA implementation with TfidfVectorizer (2)

raw_m2 = tokens_list
id2word_m2 = corpora.Dictionary(raw_m2)   # assign a token_id to each word
ldacorpus_m2 = [id2word_m2.doc2bow(text) for text in raw_m2] # make a corpus (tuple) for word_id and word count

tfidfcorpus_m = models.TfidfModel(ldacorpus_m2) # train the Tfidfmodel on the corpus

lda_m2 = models.ldamodel.LdaModel(corpus=tfidfcorpus_m[ldacorpus_m2],id2word=id2word_m2,num_topics=50) # apply the model
lda_m2.print_topics(num_words=5)

[(15,
  '0.022*"debate" + 0.018*"athlete" + 0.011*"mental" + 0.011*"fm" + 0.009*"round"'),
 (35,
  '0.012*"fish" + 0.008*"president" + 0.008*"theater" + 0.007*"joke" + 0.006*"artificial"'),
 (19,
  '0.023*"apple" + 0.008*"twit" + 0.008*"com" + 0.007*"hunt" + 0.007*"watch"'),
 (6,
  '0.010*"faith" + 0.009*"desert" + 0.007*"eight" + 0.006*"prx" + 0.006*"radiotopia"'),
 (3,
  '0.008*"radio" + 0.007*"news" + 0.007*"technology" + 0.007*"tech" + 0.006*"live"'),
 (12, '0.021*"de" + 0.010*"la" + 0.010*"consist" + 0.007*"en" + 0.007*"el"'),
 (18,
  '0.018*"buddhist" + 0.012*"doctor" + 0.008*"feedback" + 0.007*"patreon" + 0.007*"nick"'),
 (5,
  '0.012*"nature" + 0.007*"philosophical" + 0.006*"formerly" + 0.005*"progressive" + 0.004*"european"'),
 (44,
  '0.007*"park" + 0.007*"interview" + 0.007*"podcast" + 0.006*"lecture" + 0.006*"audio"'),
 (4,
  '0.009*"record" + 0.008*"car" + 0.006*"star" + 0.006*"3" + 0.006*"adventure"'),
 (27,
  '0.014*"weekday" + 0.007*"lie" + 0.006*"pain" + 0.005*"ahead" 

In [23]:
# LDA implementation with N-grams (3)

clean_list = [' '.join(tokens) for tokens in tokens_list] # this step seems to be repetitive but the purpose is to keep the consistency of tokens (afte pre-processing)
pods_bigrams = [["_".join(tup) for tup in nltk.ngrams(text.split(),2)] for text in clean_list] # creates bigrams

pods_uniandbigrams = []
for a,b in zip([text.split() for text in clean_list],pods_bigrams):
    pods_uniandbigrams.append(a + b) # we want both unigrams and bigrams in the feature set

id2word_m3 = corpora.Dictionary(pods_uniandbigrams)
id2word_m3.filter_extremes(no_below=5, no_above=0.5)

ldacorpus_m3 = [id2word_m3.doc2bow(text) for text in pods_uniandbigrams]
tfidfcorpus_m3 = models.TfidfModel(ldacorpus_m3)

lda_m3 = models.ldamodel.LdaModel(corpus=tfidfcorpus_m3[ldacorpus_m3],id2word=id2word_m3,num_topics=50)
lda_m3.print_topics(num_words=5)

[(38,
  '0.015*"chicago" + 0.014*"teaching" + 0.010*"addiction" + 0.010*"high_quality" + 0.009*"quality"'),
 (18,
  '0.009*"medical" + 0.009*"health" + 0.008*"podcast" + 0.007*"software" + 0.007*"coach"'),
 (40,
  '0.022*"movie" + 0.015*"podcaster" + 0.011*"smith" + 0.010*"tv" + 0.010*"podcast_cover"'),
 (34,
  '0.018*"mountain" + 0.013*"humanity" + 0.013*"train" + 0.011*"california" + 0.011*"attorney"'),
 (45,
  '0.008*"journal" + 0.006*"race" + 0.006*"exploration" + 0.006*"history" + 0.006*"book"'),
 (29,
  '0.013*"war" + 0.011*"zen" + 0.010*"star_war" + 0.009*"star" + 0.008*"lose"'),
 (43,
  '0.017*"die" + 0.014*"happening" + 0.013*"united" + 0.012*"rob" + 0.011*"kyle"'),
 (21,
  '0.016*"religion" + 0.016*"faith" + 0.014*"social" + 0.013*"researcher" + 0.012*"college"'),
 (39,
  '0.011*"english" + 0.010*"hbo" + 0.009*"a_m" + 0.009*"academy" + 0.009*"ancient"'),
 (20,
  '0.021*"fm" + 0.012*"revolution" + 0.010*"stephen" + 0.010*"radio" + 0.010*"explore_world"'),
 (33,
  '0.019*"compu

##### Justification for setting (k=) 50 topics
* Due to the non-deterministic nature, the design of no. of topics is very subjective and subtle
* we combined quantitative (the matrix) and qualitative (researcher observation) methods to decide b/w precision and recall
* k = 50 (we consider the amount can more or less reach a balance b/w precision and recall)

In [24]:
# Model evaluation

cm1 = models.CoherenceModel(model=lda_m1, corpus=ldacorpus_m1 , dictionary=id2word_m1, coherence='u_mass')  
ch1 = cm1.get_coherence()
cm2 = models.CoherenceModel(model=lda_m2, corpus=ldacorpus_m2, dictionary= id2word_m2, coherence='u_mass')  
ch2 = cm2.get_coherence()
#cm3 = models.CoherenceModel(model=lda_m3, corpus=ldacorpus_m3, coherence='u_mass')
cm3 = models.CoherenceModel(model=lda_m3, corpus=tfidfcorpus_m3[ldacorpus_m3], dictionary= id2word_m3, coherence='u_mass')
ch3 = cm3.get_coherence()

print(f"Coherence of naive model = {ch1}\nCoherence of tfidf model = {ch2}\nCoherence of bigram and unigram model = {ch3}")

## Based on findings, we decide to use the tfidf model because it has the greatest coherence value

Coherence of naive model = -6.851227441836759
Coherence of tfidf model = -14.389097066215436
Coherence of bigram and unigram model = -12.919242687658983


##### Justification for using tfidf model
* the coherence of native model is about 6.851, the coherence of tfidf model is about 14.389, the coherence of bigram and unigram model is about 12.919 
* since the greater the number, the better is coherence score
* tfidf model has the largest number, therefore, it has the greatest coherence

In [25]:
# Visualization

vis_data = gensimvis.prepare(lda_m2,ldacorpus_m2,id2word_m2)
pyLDAvis.display(vis_data)