Categorical Encoding:
One-hot encode Genre, Publication_Day, and Episode_Sentiment.

Label encode or target encode Podcast_Name and Episode_Title (if used directly) or extract features from them (see below).

Text Features:
Extract keywords or sentiment scores from Episode_Title using NLP tools (e.g., TF-IDF, word embeddings).

Derive title length or keyword presence (e.g., "exclusive," "interview") as features.

Time-Based Features:
From Publication_Time, extract hour of day, AM/PM, or time-of-day bins (e.g., morning, evening).

Create a binary feature for weekend vs. weekday from Publication_Day.

Interaction Features:
Combine Host_Popularity_percentage and Guest_Popularity_percentage (e.g., average or product) to capture combined influence.

Create a ratio of Number_of_Ads to Episode_Length_minutes to measure ad density.

Scaling:
Standardize numerical features (Episode_Length_minutes, Host_Popularity_percentage, etc.) using StandardScaler or MinMaxScaler.



In [6]:
df = pd.read_csv('preprocessed_podcast_data.csv')
df

Unnamed: 0,id,Podcast_Name,Episode_Title,Episode_Length_minutes,Genre,Host_Popularity_percentage,Publication_Day,Publication_Time,Guest_Popularity_percentage,Number_of_Ads,Episode_Sentiment,Listening_Time_minutes
0,0,mystery matters,episode 98,63.84,True Crime,74.81,Thursday,Night,53.58,0,Positive,31.41998
1,1,joke junction,episode 26,119.80,Comedy,66.95,Saturday,Afternoon,75.95,2,Negative,88.01241
2,2,study sessions,episode 16,73.90,Education,69.97,Tuesday,Evening,8.97,0,Negative,44.92531
3,3,digital digest,episode 45,67.17,Technology,57.22,Monday,Morning,78.70,2,Positive,46.27824
4,4,mind body,episode 86,110.51,Health,80.07,Monday,Afternoon,58.68,3,Neutral,75.61031
...,...,...,...,...,...,...,...,...,...,...,...,...
749995,749995,learning lab,episode 25,75.66,Education,69.36,Saturday,Morning,53.58,0,Negative,56.87058
749996,749996,business briefs,episode 21,75.75,Business,35.21,Saturday,Night,53.58,2,Neutral,45.46242
749997,749997,lifestyle lounge,episode 51,30.98,Lifestyle,78.58,Thursday,Morning,84.89,0,Negative,15.26000
749998,749998,style guide,episode 47,108.98,Lifestyle,45.39,Thursday,Morning,93.27,0,Negative,100.72939


In [7]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer

# Load preprocessed dataset
df = pd.read_csv('preprocessed_podcast_data.csv')
print("Columns in preprocessed dataset:", df.columns.tolist())
print("Publication_Time unique values:", df['Publication_Time'].unique())

# 1. Categorical Encoding
# Encode all categorical columns, including Publication_Time
cat_cols = [col for col in ['Genre', 'Publication_Day', 'Episode_Sentiment', 'Publication_Time'] 
            if col in df.columns]
if cat_cols:
    ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    ohe_features = ohe.fit_transform(df[cat_cols])
    ohe_feature_names = ohe.get_feature_names_out(cat_cols)
    df_ohe = pd.DataFrame(ohe_features, columns=ohe_feature_names, index=df.index)
    df = pd.concat([df, df_ohe], axis=1)
    df = df.drop(cat_cols, axis=1)

# Target encode Podcast_Name
if 'Podcast_Name' in df.columns:
    podcast_mean_listening = df.groupby('Podcast_Name')['Listening_Time_minutes'].mean()
    df['Podcast_Name_Encoded'] = df['Podcast_Name'].map(podcast_mean_listening)
    df = df.drop('Podcast_Name', axis=1)

# 2. Text Features from Episode_Title
if 'Episode_Title' in df.columns:
    df['Title_Length'] = df['Episode_Title'].apply(lambda x: len(x.split()))
    keywords = ['interview', 'exclusive', 'special', 'guest']
    for kw in keywords:
        df[f'Title_Has_{kw}'] = df['Episode_Title'].str.contains(kw, case=False, na=False).astype(int)
    vectorizer = CountVectorizer(max_features=50, stop_words='english')
    title_bow = vectorizer.fit_transform(df['Episode_Title'])
    bow_df = pd.DataFrame(title_bow.toarray(), 
                          columns=[f'Title_BOW_{f}' for f in vectorizer.get_feature_names_out()],
                          index=df.index)
    df = pd.concat([df, bow_df], axis=1)
    df = df.drop('Episode_Title', axis=1)

# 3. Create Is_Weekend from encoded Publication_Day columns
if any(col.startswith('Publication_Day_') for col in df.columns):
    weekend_cols = [col for col in df.columns if 'Publication_Day_Saturday' in col or 'Publication_Day_Sunday' in col]
    if weekend_cols:
        df['Is_Weekend'] = df[weekend_cols].sum(axis=1).astype(int)
    else:
        df['Is_Weekend'] = 0
else:
    df['Is_Weekend'] = 0

# 4. Interaction Features
df['Combined_Popularity'] = (df['Host_Popularity_percentage'] + df['Guest_Popularity_percentage']) / 2
df['Ad_Density'] = df['Number_of_Ads'] / df['Episode_Length_minutes'].replace(0, 1)

# 5. Scaling Numerical Features
num_cols = [col for col in ['Episode_Length_minutes', 'Host_Popularity_percentage', 
                            'Guest_Popularity_percentage', 'Number_of_Ads', 
                            'Podcast_Name_Encoded', 'Title_Length', 
                            'Combined_Popularity', 'Ad_Density'] 
            if col in df.columns]
if num_cols:
    scaler = StandardScaler()
    df[num_cols] = scaler.fit_transform(df[num_cols])

# 6. Feature Selection
if 'id' in df.columns:
    df = df.drop('id', axis=1)

# Verify no non-numeric columns
print("Data types after feature engineering:\n", df.dtypes)
non_numeric_cols = df.select_dtypes(include=['object', 'category']).columns
if non_numeric_cols.any():
    print("Warning: Non-numeric columns found:", non_numeric_cols.tolist())
else:
    print("All columns are numeric or encoded.")

# Save engineered dataset
df.to_csv('engineered_podcast_data.csv', index=False)
print("Engineered dataset saved as 'engineered_podcast_data.csv'")
print("Final columns:", df.columns.tolist())

Columns in preprocessed dataset: ['id', 'Podcast_Name', 'Episode_Title', 'Episode_Length_minutes', 'Genre', 'Host_Popularity_percentage', 'Publication_Day', 'Publication_Time', 'Guest_Popularity_percentage', 'Number_of_Ads', 'Episode_Sentiment', 'Listening_Time_minutes']
Publication_Time unique values: ['Night' 'Afternoon' 'Evening' 'Morning']
Data types after feature engineering:
 Episode_Length_minutes         float64
Host_Popularity_percentage     float64
Guest_Popularity_percentage    float64
Number_of_Ads                  float64
Listening_Time_minutes         float64
                                ...   
Title_BOW_99                     int64
Title_BOW_episode                int64
Is_Weekend                       int64
Combined_Popularity            float64
Ad_Density                     float64
Length: 88, dtype: object
All columns are numeric or encoded.
Engineered dataset saved as 'engineered_podcast_data.csv'
Final columns: ['Episode_Length_minutes', 'Host_Popularity_percent

In [9]:
df['Listening_Time_minutes'].describe()

count    750000.000000
mean         45.437406
std          27.138306
min           0.000000
25%          23.178350
50%          43.379460
75%          64.811580
max         119.970000
Name: Listening_Time_minutes, dtype: float64