fix: Multiple fixes for handling different data types in pandas columns analysis #19

OlegWock · 2025-11-06T12:39:38Z

Previously trying to analyze DataFrame (for stats displayed in data table headers) which contained complex numbers or binary data caused unhandled exception which made Deepnote fallback to use default Pandas output. This PR fixes that by handling both binary and complex numbers, as well as introduces more tests for column analysis functions

Summary by CodeRabbit

Bug Fixes
- More reliable handling of binary fields; safer min/max and histogram behavior for mixed, non-comparable or empty columns.
Improvements
- Datetime and timedelta are now treated alongside numeric types where appropriate.
- Safer string conversion for unusual or binary values and clearer binary string representation.
Tests
- Added extensive unit tests covering column analysis, histograms, categories, binary data, and many edge cases.
Documentation
- Added illustrative test-run examples.

…n analysis

linear · 2025-11-06T12:39:41Z

BLU-5125 Varbinary breaks output on Dataframe

BLU-4744 Pandas DataFrames with complex number column break data table

coderabbitai · 2025-11-06T12:39:47Z

📝 Walkthrough

Walkthrough

Replaces prior numeric-only type checks with a refined numeric/temporal taxonomy in pandas utilities: adds safe_convert_to_string, is_type_datetime_or_timedelta, is_numeric_or_temporal, is_pure_numeric; removes _is_type_number. Analyzer logic now uses these helpers for histogram, min/max, categories, and color-scale decisions, with additional guards and exception handling. cast_objects_to_string uses safe_convert_to_string. PySpark records serialization switches BinaryType handling to slicing a Column and converting sliced bytes to a Python string representation via a local UDF. Tests and fixtures updated to include binary data and many edge cases; no public APIs changed.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant DF as DataFrame
    participant Analyzer as analyze_columns
    participant Utils as pandas.utils
    participant Binner as histogram/categories
    participant Output as Result

    DF->>Analyzer: provide dataframe
    Analyzer->>Utils: query dtype (is_numeric_or_temporal / is_type_datetime_or_timedelta / is_pure_numeric)
    note right of Utils `#EFEFEF`: safe_convert_to_string used for element casting on errors
    Utils-->>Analyzer: dtype classification
    alt numeric or temporal
        Analyzer->>Binner: compute histogram / min / max (datetime-aware)
        Binner-->>Analyzer: histogram / min / max
    else non-numeric/object
        Analyzer->>Utils: cast elements to string (safe_convert_to_string)
        Analyzer->>Binner: compute categories / counts
        Binner-->>Analyzer: categories
    end
    Analyzer->>Output: assemble per-column stats (color-scale, budgets, guards)
    Output-->>DF: return analysis result

sequenceDiagram
    autonumber
    participant Schema as PySpark StructField
    participant Selector as to_records
    participant SparkFn as F
    participant UDF as binary_to_string_repr
    participant Output as Records

    Schema->>Selector: encounter BinaryType field
    Selector->>SparkFn: F.substring(F.col(field.name), 1, max_binary_bytes)
    SparkFn-->>Selector: Column expression (sliced bytes)
    Selector->>UDF: apply binary_to_string_repr(sliced_col)
    UDF-->>Selector: Python string representation (e.g., b'hello')
    Selector->>Output: include binary string in records

Suggested reviewers

mfranczel
Artmann

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.81% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing handling of multiple data types (binary, complex) in pandas column analysis.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ca22a35 and 771cc2b.

📒 Files selected for processing (1)

deepnote_toolkit/ocelots/pyspark/implementation.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Test - Python 3.13
GitHub Check: Test - Python 3.11
GitHub Check: Test - Python 3.9
GitHub Check: Test - Python 3.10
GitHub Check: Build and push artifacts for Python 3.13
GitHub Check: Build and push artifacts for Python 3.11
GitHub Check: Build and push artifacts for Python 3.9
GitHub Check: Build and push artifacts for Python 3.12
GitHub Check: Build and push artifacts for Python 3.10

🔇 Additional comments (2)

deepnote_toolkit/ocelots/pyspark/implementation.py (2)

235-249: LGTM – Past type-hint concern resolved.

The function correctly handles both bytes and bytearray (addressing PySpark version differences), normalizes via bytes(), and produces Python's string repr. The UDF is created once before select_column, avoiding redundant instantiation per column.

259-264: LGTM – Binary slice length now correct.

The calculation accounts for worst-case string repr expansion (4 chars per byte) and the fix from the previous review is properly applied. Slicing before UDF conversion keeps output under MAX_STRING_CELL_LENGTH.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-06T12:40:22Z

📦 Python package built successfully!

Version: 1.1.0.dev7+e9e527c
Wheel: deepnote_toolkit-1.1.0.dev7+e9e527c-py3-none-any.whl

Install:

pip install "deepnote-toolkit @ https://deepnote-staging-runtime-artifactory.s3.amazonaws.com/deepnote-toolkit-packages/1.1.0.dev7%2Be9e527c/deepnote_toolkit-1.1.0.dev7%2Be9e527c-py3-none-any.whl"

codecov · 2025-11-06T12:41:19Z

Codecov Report

❌ Patch coverage is 80.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.16%. Comparing base (67a97b4) to head (771cc2b).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
deepnote_toolkit/ocelots/pandas/analyze.py	76.92%	3 Missing ⚠️
deepnote_toolkit/ocelots/pyspark/implementation.py	62.50%	3 Missing ⚠️
deepnote_toolkit/ocelots/pandas/utils.py	89.47%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #19      +/-   ##
==========================================
- Coverage   76.67%   75.16%   -1.52%     
==========================================
  Files          99       99              
  Lines        5488     5620     +132     
  Branches      751      783      +32     
==========================================
+ Hits         4208     4224      +16     
- Misses       1280     1396     +116

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

deepnote-bot · 2025-11-06T12:47:19Z

🚀 Review App Deployment Started

📝 Description	🌐 Link / Info
🌍 Review application	ra-19
🔑 Sign-in URL	Click to sign-in
📊 Application logs	View logs
🔄 Actions	Click to redeploy
🚀 ArgoCD deployment	View deployment
⏰ Last deployed	2025-11-06 22:41:56 (UTC)
📜 Deployed commit	`fee1f1ccc465d293b4175a83bab16c18cc5fd6ab`
🛠️ Toolkit version	`e9e527c`

OlegWock · 2025-11-06T13:40:03Z

Can be tested with DataFrame like this — toolkit should be able to calculate stats for it (i.e. you'll see normal data table with column stats in the RA)

import pandas as pd
import numpy as np

binary_data = {
    'Feature1': np.random.randint(0, 2, 200),
    'Feature2': np.random.randint(0, 2, 200),
    'BinaryColumn': np.random.choice([b'\x80\x81\x82', b'\x83\x84\x85', b'\x86\x87\x88', b'\x89\x8A\x8B', b'\x8C\x8D\x8E'], 200),
    'ComplexColumn': [complex(np.random.randint(0, 10), np.random.randint(0, 10)) for _ in range(200)]
}

test_df = pd.DataFrame(binary_data)

test_df

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3dba348 and ca22a35.

📒 Files selected for processing (1)

deepnote_toolkit/ocelots/pyspark/implementation.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Test - Python 3.9
GitHub Check: Test - Python 3.10
GitHub Check: Test - Python 3.13
GitHub Check: Test - Python 3.11
GitHub Check: Test - Python 3.12
GitHub Check: Build and push artifacts for Python 3.10
GitHub Check: Build and push artifacts for Python 3.13
GitHub Check: Build and push artifacts for Python 3.12
GitHub Check: Build and push artifacts for Python 3.11
GitHub Check: Build and push artifacts for Python 3.9

deepnote_toolkit/ocelots/pyspark/implementation.py

fix: Multiple fixes for handling different data types in pandas colum…

9d44b3d

…n analysis

coderabbitai bot previously approved these changes Nov 6, 2025

View reviewed changes

Fix tests

f8dda7d

OlegWock dismissed coderabbitai[bot]’s stale review via f8dda7d November 6, 2025 13:18

coderabbitai bot previously approved these changes Nov 6, 2025

View reviewed changes

OlegWock marked this pull request as ready for review November 6, 2025 13:39

OlegWock requested a review from a team as a code owner November 6, 2025 13:39

m1so previously approved these changes Nov 6, 2025

View reviewed changes

chore: Improve test coverage

4f53ec6

m1so dismissed stale reviews from coderabbitai[bot] and themself via 4f53ec6 November 6, 2025 21:38

coderabbitai bot previously approved these changes Nov 6, 2025

View reviewed changes

fix: Don't show bytes as base64 encoded strings

3dba348

m1so dismissed coderabbitai[bot]’s stale review via 3dba348 November 6, 2025 21:56

coderabbitai bot previously approved these changes Nov 6, 2025

View reviewed changes

fix: Use the same logic for Spark

ca22a35

m1so dismissed coderabbitai[bot]’s stale review via ca22a35 November 6, 2025 22:21

coderabbitai bot requested changes Nov 6, 2025

View reviewed changes

deepnote_toolkit/ocelots/pyspark/implementation.py Outdated Show resolved Hide resolved

deepnote_toolkit/ocelots/pyspark/implementation.py Show resolved Hide resolved

chore: Incorporate PR review suggestions

771cc2b

coderabbitai bot approved these changes Nov 6, 2025

View reviewed changes

m1so self-requested a review November 6, 2025 22:39

m1so approved these changes Nov 6, 2025

View reviewed changes

m1so merged commit 5bf9888 into main Nov 6, 2025
33 of 34 checks passed

m1so deleted the oleh/blu-5125-and-blu-4744 branch November 6, 2025 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Multiple fixes for handling different data types in pandas columns analysis #19

fix: Multiple fixes for handling different data types in pandas columns analysis #19

Uh oh!

OlegWock commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

linear bot commented Nov 6, 2025

Uh oh!

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Suggested reviewers

Uh oh!

github-actions bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

deepnote-bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

OlegWock commented Nov 6, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Multiple fixes for handling different data types in pandas columns analysis #19

fix: Multiple fixes for handling different data types in pandas columns analysis #19

Uh oh!

Conversation

OlegWock commented Nov 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear bot commented Nov 6, 2025

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Suggested reviewers

Pre-merge checks

Uh oh!

github-actions bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

deepnote-bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OlegWock commented Nov 6, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OlegWock commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

github-actions bot commented Nov 6, 2025 •

edited

Loading

codecov bot commented Nov 6, 2025 •

edited

Loading

deepnote-bot commented Nov 6, 2025 •

edited

Loading