# EDA Notebook

Author: Finian O'Neill
Purpose: Perform exploratory data analysis on each dataset provided by Kaggle for the "Predict Podcast Listening Time" competition.

### Setup

In [1]:
import pandas as pd

In [14]:
def value_counts_with_percentage(df, column_name):
    # get value counts
    counts = df[column_name].value_counts()
    
    # calculate percentages
    percentages = df[column_name].value_counts(normalize=True) * 100
    
    # combine into a DataFrame
    result = pd.DataFrame({
        'Count': counts,
        'Percentage': percentages.round(2)
    })
    
    # add % symbol to percentage column
    result['Percentage'] = result['Percentage'].astype(str) + '%'
    
    return result

### Data Import

In [2]:
# load train dataset and expect
train_df = pd.read_csv('data/train.csv')
train_df.head()

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,Mystery Matters,Episode 98,,True Crime,74.81,Thursday,Night,,0.0,Positive,31.41998
1,1,Joke Junction,Episode 26,119.8,Comedy,66.95,Saturday,Afternoon,75.95,2.0,Negative,88.01241
2,2,Study Sessions,Episode 16,73.9,Education,69.97,Tuesday,Evening,8.97,0.0,Negative,44.92531
3,3,Digital Digest,Episode 45,67.17,Technology,57.22,Monday,Morning,78.7,2.0,Positive,46.27824
4,4,Mind & Body,Episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3.0,Neutral,75.61031


In [3]:
# print the describe
train_df.describe()

Unnamed: 0,id,Episode_Length_minutes,Host_Popularity_percentage,Guest_Popularity_percentage,Number_of_Ads,Listening_Time_minutes
count,750000.0,662907.0,750000.0,603970.0,749999.0,750000.0
mean,374999.5,64.504738,59.859901,52.236449,1.348855,45.437406
std,216506.495284,32.969603,22.873098,28.451241,1.15113,27.138306
min,0.0,0.0,1.3,0.0,0.0,0.0
25%,187499.75,35.73,39.41,28.38,0.0,23.17835
50%,374999.5,63.84,60.05,53.58,1.0,43.37946
75%,562499.25,94.07,79.53,76.6,2.0,64.81158
max,749999.0,325.24,119.46,119.91,103.91,119.97


In [8]:
# define categorical and numerical columns
numerical_columns = ['Episode_Length_minutes', 'Host_Popularity_percentage', 
                    'Guest_Popularity_percentage', 'Number_of_Ads', 
                    'Listening_Time_minutes']

exclude_columns = ['id']

# get categorical columns by filtering out numerical columns
categorical_columns = [col for col in train_df.columns if (col not in numerical_columns) and (col not in exclude_columns)]

In [15]:
# print values in categorical columns
for col in categorical_columns:
    print('Currently displaying the values in the column {col}...'.format(col=col))
    print('There are a total of {n} unique categories in this column...'.format(n=len(train_df[col].value_counts())))
    print(value_counts_with_percentage(df=train_df, column_name=col))
    print('-' * 50)

Currently displaying the values in the column Podcast_Name...
There are a total of 48 unique categories in this column...
                     Count Percentage
Podcast_Name                         
Tech Talks           22847      3.05%
Sports Weekly        20053      2.67%
Funny Folks          19635      2.62%
Tech Trends          19549      2.61%
Fitness First        19488       2.6%
Business Insights    19480       2.6%
Style Guide          19364      2.58%
Game Day             19272      2.57%
Melody Mix           18889      2.52%
Criminal Minds       17735      2.36%
Finance Focus        17628      2.35%
Detective Diaries    17452      2.33%
Crime Chronicles     17374      2.32%
Athlete's Arena      17327      2.31%
Fashion Forward      17280       2.3%
Tune Time            17254       2.3%
Business Briefs      17012      2.27%
Lifestyle Lounge     16661      2.22%
True Crime Stories   16373      2.18%
Sports Central       16191      2.16%
Digital Digest       16171      2.16%
Humo

#### Observations in Regards to Categorical Columns:

1) Podcast_Name --> This feature has 48 different categories and the most frequently appearing category only comprises ~3% of the observations. This particular feature likely should be encoded using either: frequency encoding or clustering similar categories based on semantics to reduce the dimensionality.