-
Notifications
You must be signed in to change notification settings - Fork 2
fix: Multiple fixes for handling different data types in pandas columns analysis #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughReplaces prior numeric-only type checks with a refined numeric/temporal taxonomy in pandas utilities: adds safe_convert_to_string, is_type_datetime_or_timedelta, is_numeric_or_temporal, is_pure_numeric; removes _is_type_number. Analyzer logic now uses these helpers for histogram, min/max, categories, and color-scale decisions, with additional guards and exception handling. cast_objects_to_string uses safe_convert_to_string. PySpark records serialization switches BinaryType handling to slicing a Column and converting sliced bytes to a Python string representation via a local UDF. Tests and fixtures updated to include binary data and many edge cases; no public APIs changed. Sequence Diagram(s)sequenceDiagram
autonumber
participant DF as DataFrame
participant Analyzer as analyze_columns
participant Utils as pandas.utils
participant Binner as histogram/categories
participant Output as Result
DF->>Analyzer: provide dataframe
Analyzer->>Utils: query dtype (is_numeric_or_temporal / is_type_datetime_or_timedelta / is_pure_numeric)
note right of Utils `#EFEFEF`: safe_convert_to_string used for element casting on errors
Utils-->>Analyzer: dtype classification
alt numeric or temporal
Analyzer->>Binner: compute histogram / min / max (datetime-aware)
Binner-->>Analyzer: histogram / min / max
else non-numeric/object
Analyzer->>Utils: cast elements to string (safe_convert_to_string)
Analyzer->>Binner: compute categories / counts
Binner-->>Analyzer: categories
end
Analyzer->>Output: assemble per-column stats (color-scale, budgets, guards)
Output-->>DF: return analysis result
sequenceDiagram
autonumber
participant Schema as PySpark StructField
participant Selector as to_records
participant SparkFn as F
participant UDF as binary_to_string_repr
participant Output as Records
Schema->>Selector: encounter BinaryType field
Selector->>SparkFn: F.substring(F.col(field.name), 1, max_binary_bytes)
SparkFn-->>Selector: Column expression (sliced bytes)
Selector->>UDF: apply binary_to_string_repr(sliced_col)
UDF-->>Selector: Python string representation (e.g., b'hello')
Selector->>Output: include binary string in records
Suggested reviewers
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro Disabled knowledge base sources:
📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
🔇 Additional comments (2)
Comment |
|
📦 Python package built successfully!
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #19 +/- ##
==========================================
- Coverage 76.67% 75.16% -1.52%
==========================================
Files 99 99
Lines 5488 5620 +132
Branches 751 783 +32
==========================================
+ Hits 4208 4224 +16
- Misses 1280 1396 +116 ☔ View full report in Codecov by Sentry. |
|
🚀 Review App Deployment Started
|
|
Can be tested with DataFrame like this — toolkit should be able to calculate stats for it (i.e. you'll see normal data table with column stats in the RA) import pandas as pd
import numpy as np
binary_data = {
'Feature1': np.random.randint(0, 2, 200),
'Feature2': np.random.randint(0, 2, 200),
'BinaryColumn': np.random.choice([b'\x80\x81\x82', b'\x83\x84\x85', b'\x86\x87\x88', b'\x89\x8A\x8B', b'\x8C\x8D\x8E'], 200),
'ComplexColumn': [complex(np.random.randint(0, 10), np.random.randint(0, 10)) for _ in range(200)]
}
test_df = pd.DataFrame(binary_data)
test_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
deepnote_toolkit/ocelots/pyspark/implementation.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Test - Python 3.9
- GitHub Check: Test - Python 3.10
- GitHub Check: Test - Python 3.13
- GitHub Check: Test - Python 3.11
- GitHub Check: Test - Python 3.12
- GitHub Check: Build and push artifacts for Python 3.10
- GitHub Check: Build and push artifacts for Python 3.13
- GitHub Check: Build and push artifacts for Python 3.12
- GitHub Check: Build and push artifacts for Python 3.11
- GitHub Check: Build and push artifacts for Python 3.9
Previously trying to analyze DataFrame (for stats displayed in data table headers) which contained complex numbers or binary data caused unhandled exception which made Deepnote fallback to use default Pandas output. This PR fixes that by handling both binary and complex numbers, as well as introduces more tests for column analysis functions
Summary by CodeRabbit
Bug Fixes
Improvements
Tests
Documentation