In [30]:
import pandas as pd

# 1. Load the DataFrame
df = pd.read_excel(
    "data/WF 3 F1-R12 - Great Britain.xlsx",
    sheet_name="Worksheet",
    header=5
)

# 2. Define the columns to check and sort by
subset_cols = ['TV-Channel', 'Channel ID', 'Start', 'End', 'Region', 'Market', 'Duration' , 'Combined' , 'Broadcaster' , 'Program Description' , 'Program Title' ,'TVR% 3+' ,'Aud Metered (000s) 3+','Start (UTC)'  ]

# --- CRITICAL FIX: CONVERT TIME COLUMNS TO STRING (OBJECT) ---
# This resolves the 'datetime.datetime' vs 'datetime.time' comparison error.
for col in ['Start', 'End']:
    # The .apply(str) method is the most robust way to force all mixed contents to string.
    print(f"Standardizing column '{col}' to string type using .apply(str).")
    df[col] = df[col].apply(str)

# --- Secondary Fix (Addressing the Deprecation Warning and other 'category' issues) ---
# Ensure other non-time columns are not problematic 'category' dtypes
for col in subset_cols:
    # Using the modern isinstance check to address the DeprecationWarning
    if isinstance(df[col].dtype, pd.CategoricalDtype):
        print(f"Converting column '{col}' from 'category' to 'object' for sorting.")
        df[col] = df[col].astype('object')

# 3. Find all duplicate rows
duplicate_mask = df.duplicated(subset=subset_cols, keep=False)
duplicate_rows = df[duplicate_mask].copy()

# 4. Sort the DataFrame (This should now work reliably)
duplicate_rows = duplicate_rows.sort_values(by=subset_cols)

# Display the results
print(f"\nTotal duplicate rows found: {duplicate_rows.shape[0]}")
print("\nFirst 10 duplicate rows (based on TV-Channel, Channel ID, Start, End):")

# Columns for output display
display_cols = subset_cols + ['Date', 'Program Title']
print(duplicate_rows[display_cols].head(10))

Standardizing column 'Start' to string type using .apply(str).
Standardizing column 'End' to string type using .apply(str).

Total duplicate rows found: 116

First 10 duplicate rows (based on TV-Channel, Channel ID, Start, End):
                      TV-Channel  Channel ID     Start       End  Region  \
1674  Arenasport 1 (Pan Balkans)      5105.0  13:30:00  14:30:00  Europe   
1676  Arenasport 1 (Pan Balkans)      5105.0  13:30:00  14:30:00  Europe   
1219           Art Sport 6 (KOS)     39130.0  07:00:00  10:45:00  Europe   
1225           Art Sport 6 (KOS)     39130.0  07:00:00  10:45:00  Europe   
1815                DAZN 5 (PRT)     23292.0  21:30:00  22:00:00  Europe   
1821                DAZN 5 (PRT)     23292.0  21:30:00  22:00:00  Europe   
2046          El Garage TV (ESP)      6655.0  10:00:00  10:30:00  Europe   
2050          El Garage TV (ESP)      6655.0  10:00:00  10:30:00  Europe   
2054          El Garage TV (ESP)      6655.0  10:00:00  10:30:00  Europe   
2058       

In [34]:
import pandas as pd
from pandas.api.types import CategoricalDtype

# 1. Load the DataFrame
df = pd.read_excel(
    "data/WF 3 F1-R12 - Great Britain.xlsx",
    sheet_name="Worksheet",
    header=5
)

# 2. Define the columns to check and sort by
# Includes all columns you requested + 'End (UTC)'
subset_cols = [
    'TV-Channel', 'Channel ID', 'Start', 'End', 'Region', 'Market', 
    'Duration', 'Combined', 'Broadcaster', 'Program Description', 
    'Program Title', 'TVR% 3+', 'Aud Metered (000s) 3+', 'Start (UTC)', 
    'End (UTC)', 'Day' # <-- NEW COLUMN ADDED HERE
]

# --- CRITICAL FIX: CONVERT ALL POTENTIAL TIME COLUMNS TO STRING (OBJECT) ---
# We now include 'Start (UTC)' and 'End (UTC)' in the time standardization list.
    time_cols_to_standardize = ['Start', 'End', 'Start (UTC)', 'End (UTC)']

for col in time_cols_to_standardize:
    # The .apply(str) method is the most robust way to force all mixed contents to string.
    print(f"Standardizing column '{col}' to string type using .apply(str).")
    df[col] = df[col].apply(str)

# --- Secondary Fix (Addressing 'category' issues) ---
# Ensure other columns are not problematic 'category' dtypes
for col in subset_cols:
    # Using the modern isinstance check to address the DeprecationWarning
    if isinstance(df[col].dtype, CategoricalDtype):
        print(f"Converting column '{col}' from 'category' to 'object' for sorting.")
        df[col] = df[col].astype('object')

# 3. Find all duplicate rows
duplicate_mask = df.duplicated(subset=subset_cols, keep=False)
duplicate_rows = df[duplicate_mask].copy()

# 4. Sort the DataFrame (This should now work reliably)
duplicate_rows = duplicate_rows.sort_values(by=subset_cols)

# Display the results
print(f"\nTotal duplicate rows found: {duplicate_rows.shape[0]}")
print(f"Duplicates based on {len(subset_cols)} criteria.")

# Columns for output display
display_cols_for_head = [col for col in subset_cols if col not in ['Program Description', 'Program Title']] + ['Date']
print(f"\nFirst 10 duplicate rows based on {len(subset_cols)} criteria:")
print(duplicate_rows[display_cols_for_head].head(10))

Standardizing column 'Start' to string type using .apply(str).
Standardizing column 'End' to string type using .apply(str).
Standardizing column 'Start (UTC)' to string type using .apply(str).
Standardizing column 'End (UTC)' to string type using .apply(str).

Total duplicate rows found: 31
Duplicates based on 16 criteria.

First 10 duplicate rows based on 16 criteria:
     TV-Channel  Channel ID Start  End  Region       Market Duration Combined  \
2973     Canal+         NaN   nan  nan  Global  Switzerland      NaN      NaN   
2974     Canal+         NaN   nan  nan  Global  Switzerland      NaN      NaN   
2975     Canal+         NaN   nan  nan  Global  Switzerland      NaN      NaN   
2976     Canal+         NaN   nan  nan  Global  Switzerland      NaN      NaN   
2977     Canal+         NaN   nan  nan  Global  Switzerland      NaN      NaN   
2978     Canal+         NaN   nan  nan  Global  Switzerland      NaN      NaN   
2979     Canal+         NaN   nan  nan  Global  Switzerland  

Problem 1: The Initial AttributeError
Error: AttributeError: 'DataFrame' object has no attribute 'dtype'

Cause: You mistakenly tried to call the DataFrame attribute .dtypes as a method: df.dtype().

Solution: We corrected the syntax to access the data type attribute: df.dtypes.

Problem 2: The Cascading TypeError (Unordered Categories)
Error: TypeError: 'values' is not ordered, please explicitly specify the categories order...

Cause: The sort_values() function implicitly attempts to convert object columns to an ordered category dtype for efficient sorting. This conversion failed because one or more of your sorting columns (like TV-Channel or Start) were already, or were converted to, an unordered category type.

Solution: We added a step to explicitly check for and convert any category dtype columns in our subset_cols to the general object (string) type. Standard strings are always sortable, bypassing the "unordered category" rule.

Problem 3: The Root Cause TypeError (Mixed Datetime Types)
Error: TypeError: '<' not supported between instances of 'datetime.datetime' and 'datetime.time'

Cause: This was the most critical issue, which was hidden by the Problem 2 error. It revealed that the Start and/or End columns contained a mix of data:

Full timestamp objects (datetime.datetime)

Time-only objects (datetime.time)

Potentially NaNs or strings.

Python cannot compare these mixed types directly, causing the sort operation to fail at its most fundamental level.

Solution (Final Fix): We used the most robust conversion method to force consistency:

Python

df[col] = df[col].apply(str)
This line guarantees that every value in the Start and End columns is converted into a sortable string (e.g., 2025-07-04 00:00:00 or 12:30:00), which allows sort_values() to complete the operation using string comparison.