# Group assignment (Podcast dataset)
#### Minor: Communication in the Digital Society
#### Course: CCS 2
#### Tutorial group 1
#### Tutorial teacher: Isa van Leeuwen
#### Group members: Ada Shi (13558846), 

## Code for exploring data analysis
#### In this part, we will explore the dataset by:
* checking all columns and their datatypes and checking for missing values
* following pre-processing steps that are learned in the first week of this course (eg. lowercasing, tokenization, stop words removal/pruning, lemmatization, and N-grams)

#### Data exploration (learned in CCS 1):
##### (this part is not important for this assignment but good for personal understanding)

In [1]:
# Load packages & the podcast dataset to the jupyter notebook

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import spacy

import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

df_podcast = pd.read_csv("poddf.csv")

In [2]:
# Check columns

df_podcast.columns # there are 6 columns in the original dataset

Index(['index', 'Name', 'Rating_Volume', 'Rating', 'Genre', 'Description'], dtype='object')

In [3]:
# Check datatypes for all columns

df_podcast.dtypes # except for the column "index", the datatypes of the rest of the columns are all object 

#the datatypes of columns "Rating_Volume" and "Rating" are wrong ("Rating_Volume" should be int64 and "Rating" should be float64)

index             int64
Name             object
Rating_Volume    object
Rating           object
Genre            object
Description      object
dtype: object

##### Explanations

##### Modification on "Rating_Volume" column
* We wanted to convert the datatype of "Rating_Volume" to int64 via the code as shown below:
* df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype(int)
* However, it produced an error "ValueError: invalid literal for int() with base 10: 'Not Found'"

##### Modification on "Rating_Volume" column
* Similarly, we wanted to convert the datatype of "Rating" to float64 via the code as shown below:
* df_podcast["Rating"] = df_podcast["Rating"].astype(float)
* the output showed an error "ValueError: could not convert string to float: 'Not Found'"

In [4]:
# Correction

## Replace "Not Found" values with NaN
df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].replace("Not Found", np.nan)
df_podcast["Rating"] = df_podcast["Rating"].replace("Not Found", np.nan)

## Change datatypes
df_podcast["Rating"] = df_podcast["Rating"].astype(float)
df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype("Int64") # convert NaN to nullable integer

##### Explanation for the code that converts NaN to nullable integer
* After replacing "Not Found" with NaN, we wanted to convert "Rating_Volume" to int64 via the code below:
* df_podcast["Rating_Volume"] = df_podcast["Rating_Volume"].astype(int)
* However, it produced an error "ValueError: cannot convert NA to integer" because NaN is a float
* To solve this problem, instead of trying to convert all values under the column "Rating_Volume" to int64, we converted them to nullable integer type ("Int64") - a datatype that allows the storage of both regular integers and missing values (need to find literature)

In [5]:
df_podcast.dtypes # now the datatypes changed to [Rating_Volume] - "Int64" and [Rating] - "float64"

index              int64
Name              object
Rating_Volume      Int64
Rating           float64
Genre             object
Description       object
dtype: object

In [6]:
# Check for missing values

df_podcast.isna().sum() # there are 1887 missing values under the columns "Rating_Volume" and "Rating" (because there are NaN)

index               0
Name                0
Rating_Volume    1887
Rating           1887
Genre               0
Description         0
dtype: int64

#### Specification:
* For this assignment, we are most interested in the last objective column "Description"
* to develop our own recommender system, we will first, transform this column into a list and then pre-process the data with knowledge learned from week 1 (eg. lowercasing, stopwords removal, lemmatization/stemming, tokenization, etc.)
* second, we will adopt a bottom-up (inductive) approach for the text analysis that is learned from week 3 (eg. CountVectorizer/TfidVectorizer, Cosine similarity/Soft-cosine similarity, etc.)

#### 1. Pre-processing (learned in CCS 2, week 1):

In [7]:
# Transform text-column into a list

text_list = df_podcast["Description"].tolist()

In [8]:
# Lowercasing

text_list_lower = [text.lower() for text in text_list]

In [9]:
# Removing stopwords

mystopwords = stopwords.words("english")
text_without_stopwords = [" ".join([w for w in text.split() if w not in mystopwords]) for text in text_list_lower]

In [10]:
# Lemmatization

nlp = spacy.load("en_core_web_sm")
lemmatized_text = [" ".join([w.lemma_ for w in nlp(text)]) for text in text_without_stopwords]

In [20]:
# Tokens removed punctuations

tokenizer = RegexpTokenizer(r'\w+')
text_without_punctuations = [tokenizer.tokenize(text) for text in lemmatized_text]

In [18]:
# Remove "s"

tokens = [([w for w in text if w != "s"]) for text in text_without_punctuations]

#### Specification:
* The pre-processing follows the order as shown above because we want to keep standard tokens without any punctuations
* after lowercasing, we remove stopwords that are not informative
* and then the lemmatization transforms each token to word can be found in dictionaries
* after lemmatizing, some tokens are in the form of "contraction expansion eg. it's -> it 's", so we seperate the pruning process into 2 steps (removing stopwords and removing punctuations) and remove punctuations after lemmatization
* in the final pre-processing step,we removed a particular token "s" because we observed many "s" in the output while inspecting previous steps, this letter left after "contraction expansion" is not informative

#### 2. Inductive analysis (learned in CCS 2, week 3):
#### Part 1: Vectorization
* In this part, we will use the Tfidfvectorizer in a sparse format
* the aim is to analyze the data from the bottom of texts up to features