Skip to content

Conversation

@OlegWock
Copy link
Member

@OlegWock OlegWock commented Nov 6, 2025

Previously trying to analyze DataFrame (for stats displayed in data table headers) which contained complex numbers or binary data caused unhandled exception which made Deepnote fallback to use default Pandas output. This PR fixes that by handling both binary and complex numbers, as well as introduces more tests for column analysis functions

Summary by CodeRabbit

  • Bug Fixes

    • More reliable handling of binary fields; safer min/max and histogram behavior for mixed, non-comparable or empty columns.
  • Improvements

    • Datetime and timedelta are now treated alongside numeric types where appropriate.
    • Safer string conversion for unusual or binary values and clearer binary string representation.
  • Tests

    • Added extensive unit tests covering column analysis, histograms, categories, binary data, and many edge cases.
  • Documentation

    • Added illustrative test-run examples.

@linear
Copy link

linear bot commented Nov 6, 2025

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 6, 2025

📝 Walkthrough

Walkthrough

Replaces prior numeric-only type checks with a refined numeric/temporal taxonomy in pandas utilities: adds safe_convert_to_string, is_type_datetime_or_timedelta, is_numeric_or_temporal, is_pure_numeric; removes _is_type_number. Analyzer logic now uses these helpers for histogram, min/max, categories, and color-scale decisions, with additional guards and exception handling. cast_objects_to_string uses safe_convert_to_string. PySpark records serialization switches BinaryType handling to slicing a Column and converting sliced bytes to a Python string representation via a local UDF. Tests and fixtures updated to include binary data and many edge cases; no public APIs changed.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant DF as DataFrame
    participant Analyzer as analyze_columns
    participant Utils as pandas.utils
    participant Binner as histogram/categories
    participant Output as Result

    DF->>Analyzer: provide dataframe
    Analyzer->>Utils: query dtype (is_numeric_or_temporal / is_type_datetime_or_timedelta / is_pure_numeric)
    note right of Utils `#EFEFEF`: safe_convert_to_string used for element casting on errors
    Utils-->>Analyzer: dtype classification
    alt numeric or temporal
        Analyzer->>Binner: compute histogram / min / max (datetime-aware)
        Binner-->>Analyzer: histogram / min / max
    else non-numeric/object
        Analyzer->>Utils: cast elements to string (safe_convert_to_string)
        Analyzer->>Binner: compute categories / counts
        Binner-->>Analyzer: categories
    end
    Analyzer->>Output: assemble per-column stats (color-scale, budgets, guards)
    Output-->>DF: return analysis result
Loading
sequenceDiagram
    autonumber
    participant Schema as PySpark StructField
    participant Selector as to_records
    participant SparkFn as F
    participant UDF as binary_to_string_repr
    participant Output as Records

    Schema->>Selector: encounter BinaryType field
    Selector->>SparkFn: F.substring(F.col(field.name), 1, max_binary_bytes)
    SparkFn-->>Selector: Column expression (sliced bytes)
    Selector->>UDF: apply binary_to_string_repr(sliced_col)
    UDF-->>Selector: Python string representation (e.g., b'hello')
    Selector->>Output: include binary string in records
Loading

Suggested reviewers

  • mfranczel
  • Artmann

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.81% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing handling of multiple data types (binary, complex) in pandas column analysis.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ca22a35 and 771cc2b.

📒 Files selected for processing (1)
  • deepnote_toolkit/ocelots/pyspark/implementation.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Test - Python 3.13
  • GitHub Check: Test - Python 3.11
  • GitHub Check: Test - Python 3.9
  • GitHub Check: Test - Python 3.10
  • GitHub Check: Build and push artifacts for Python 3.13
  • GitHub Check: Build and push artifacts for Python 3.11
  • GitHub Check: Build and push artifacts for Python 3.9
  • GitHub Check: Build and push artifacts for Python 3.12
  • GitHub Check: Build and push artifacts for Python 3.10
🔇 Additional comments (2)
deepnote_toolkit/ocelots/pyspark/implementation.py (2)

235-249: LGTM – Past type-hint concern resolved.

The function correctly handles both bytes and bytearray (addressing PySpark version differences), normalizes via bytes(), and produces Python's string repr. The UDF is created once before select_column, avoiding redundant instantiation per column.


259-264: LGTM – Binary slice length now correct.

The calculation accounts for worst-case string repr expansion (4 chars per byte) and the fix from the previous review is properly applied. Slicing before UDF conversion keeps output under MAX_STRING_CELL_LENGTH.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

📦 Python package built successfully!

  • Version: 1.1.0.dev7+e9e527c
  • Wheel: deepnote_toolkit-1.1.0.dev7+e9e527c-py3-none-any.whl
  • Install:
    pip install "deepnote-toolkit @ https://deepnote-staging-runtime-artifactory.s3.amazonaws.com/deepnote-toolkit-packages/1.1.0.dev7%2Be9e527c/deepnote_toolkit-1.1.0.dev7%2Be9e527c-py3-none-any.whl"

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.16%. Comparing base (67a97b4) to head (771cc2b).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
deepnote_toolkit/ocelots/pandas/analyze.py 76.92% 3 Missing ⚠️
deepnote_toolkit/ocelots/pyspark/implementation.py 62.50% 3 Missing ⚠️
deepnote_toolkit/ocelots/pandas/utils.py 89.47% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #19      +/-   ##
==========================================
- Coverage   76.67%   75.16%   -1.52%     
==========================================
  Files          99       99              
  Lines        5488     5620     +132     
  Branches      751      783      +32     
==========================================
+ Hits         4208     4224      +16     
- Misses       1280     1396     +116     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 6, 2025
@deepnote-bot
Copy link

deepnote-bot commented Nov 6, 2025

🚀 Review App Deployment Started

📝 Description 🌐 Link / Info
🌍 Review application ra-19
🔑 Sign-in URL Click to sign-in
📊 Application logs View logs
🔄 Actions Click to redeploy
🚀 ArgoCD deployment View deployment
Last deployed 2025-11-06 22:41:56 (UTC)
📜 Deployed commit fee1f1ccc465d293b4175a83bab16c18cc5fd6ab
🛠️ Toolkit version e9e527c

coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 6, 2025
@OlegWock OlegWock marked this pull request as ready for review November 6, 2025 13:39
@OlegWock OlegWock requested a review from a team as a code owner November 6, 2025 13:39
@OlegWock
Copy link
Member Author

OlegWock commented Nov 6, 2025

Can be tested with DataFrame like this — toolkit should be able to calculate stats for it (i.e. you'll see normal data table with column stats in the RA)

import pandas as pd
import numpy as np

binary_data = {
    'Feature1': np.random.randint(0, 2, 200),
    'Feature2': np.random.randint(0, 2, 200),
    'BinaryColumn': np.random.choice([b'\x80\x81\x82', b'\x83\x84\x85', b'\x86\x87\x88', b'\x89\x8A\x8B', b'\x8C\x8D\x8E'], 200),
    'ComplexColumn': [complex(np.random.randint(0, 10), np.random.randint(0, 10)) for _ in range(200)]
}

test_df = pd.DataFrame(binary_data)

test_df

m1so
m1so previously approved these changes Nov 6, 2025
@m1so m1so dismissed stale reviews from coderabbitai[bot] and themself via 4f53ec6 November 6, 2025 21:38
coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 6, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 6, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3dba348 and ca22a35.

📒 Files selected for processing (1)
  • deepnote_toolkit/ocelots/pyspark/implementation.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Test - Python 3.9
  • GitHub Check: Test - Python 3.10
  • GitHub Check: Test - Python 3.13
  • GitHub Check: Test - Python 3.11
  • GitHub Check: Test - Python 3.12
  • GitHub Check: Build and push artifacts for Python 3.10
  • GitHub Check: Build and push artifacts for Python 3.13
  • GitHub Check: Build and push artifacts for Python 3.12
  • GitHub Check: Build and push artifacts for Python 3.11
  • GitHub Check: Build and push artifacts for Python 3.9

@m1so m1so self-requested a review November 6, 2025 22:39
@m1so m1so merged commit 5bf9888 into main Nov 6, 2025
33 of 34 checks passed
@m1so m1so deleted the oleh/blu-5125-and-blu-4744 branch November 6, 2025 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants