Handle Missing Values: Check for missing data in all columns. Impute numerical columns (e.g., Host_Popularity_percentage, Guest_Popularity_percentage, Listening_Time_minutes) with median or mean. For categorical columns (e.g., Podcast_Name, Genre), use mode or a placeholder like "Unknown."

Remove Duplicates: Drop duplicate rows based on id to ensure unique episodes.

Data Type Conversion: Ensure correct data types (e.g., Episode_Length_minutes as float, Publication_Day as categorical, Publication_Time as datetime).

Outlier Detection: Identify and cap outliers in numerical columns (e.g., Listening_Time_minutes, Number_of_Ads) using IQR or z-score methods.

Text Cleaning: Standardize text in Podcast_Name and Episode_Title (lowercase, remove special characters) for consistency.


In [19]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# 1. Load the Dataset
# Replace 'dataset.csv' with your file path
df = pd.read_csv('train.csv')

# Display basic info to understand structure
print("Dataset info:\n", df.info())
print("Publication_Time sample values:\n", df['Publication_Time'].head(10))
print("Publication_Time unique values:\n", df['Publication_Time'].unique())

# 2. Handle Missing Values
# Numerical columns
num_cols = ['Episode_Length_minutes', 'Host_Popularity_percentage', 
            'Guest_Popularity_percentage', 'Number_of_Ads', 'Listening_Time_minutes']
imputer_num = SimpleImputer(strategy='median')
df[num_cols] = imputer_num.fit_transform(df[num_cols])

# Categorical columns (now including Publication_Time)
cat_cols = ['Podcast_Name', 'Episode_Title', 'Genre', 'Publication_Day', 
            'Episode_Sentiment', 'Publication_Time']
imputer_cat = SimpleImputer(strategy='constant', fill_value='Unknown')
df[cat_cols] = imputer_cat.fit_transform(df[cat_cols])

# Verify no missing values remain
print("Missing values after imputation:\n", df.isnull().sum())

# 3. Remove Duplicates
# Drop duplicates based on 'id'
df = df.drop_duplicates(subset='id', keep='first')
print("Shape after removing duplicates:", df.shape)

# 4. Data Type Conversion
# Numerical columns
df['Episode_Length_minutes'] = df['Episode_Length_minutes'].astype(float)
df['Host_Popularity_percentage'] = df['Host_Popularity_percentage'].astype(float)
df['Guest_Popularity_percentage'] = df['Guest_Popularity_percentage'].astype(float)
df['Number_of_Ads'] = df['Number_of_Ads'].astype(int)
df['Listening_Time_minutes'] = df['Listening_Time_minutes'].astype(float)

# Categorical columns
df['Publication_Day'] = df['Publication_Day'].astype('category')
df['Genre'] = df['Genre'].astype('category')
df['Episode_Sentiment'] = df['Episode_Sentiment'].astype('category')
df['Publication_Time'] = df['Publication_Time'].astype('category')

# 5. Outlier Detection
# Cap outliers using IQR for numerical columns
for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

# Verify outlier handling
print("Stats after outlier capping:\n", df[num_cols].describe())

# 6. Text Cleaning
# Standardize text columns: lowercase, remove special characters
df['Podcast_Name'] = df['Podcast_Name'].str.lower().str.replace(r'[^a-z0-9\s]', '', regex=True)
df['Episode_Title'] = df['Episode_Title'].str.lower().str.replace(r'[^a-z0-9\s]', '', regex=True)

# Save preprocessed dataset
df.to_csv('preprocessed_podcast_data.csv', index=False)
print("Preprocessed data saved as 'preprocessed_podcast_data.csv'")
print("Final columns:", df.columns.tolist())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           750000 non-null  int64  
 1   Podcast_Name                 750000 non-null  object 
 2   Episode_Title                750000 non-null  object 
 3   Episode_Length_minutes       662907 non-null  float64
 4   Genre                        750000 non-null  object 
 5   Host_Popularity_percentage   750000 non-null  float64
 6   Publication_Day              750000 non-null  object 
 7   Publication_Time             750000 non-null  object 
 8   Guest_Popularity_percentage  603970 non-null  float64
 9   Number_of_Ads                749999 non-null  float64
 10  Episode_Sentiment            750000 non-null  object 
 11  Listening_Time_minutes       750000 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 68.7+ MB
