[ENH] add folder and data for tests #107

KristijanArmeni · 2025-03-20T13:42:42Z

This is a WIP PR for adding tests for the hashtag analyzer following the recipe outlined of the test_example_base.py.

It's currently failing (see below). Input and help welcome while I troubleshoot this (@soul-codes ).

Things added

test_hashtag_analyzer.py
test_data/
- hashtag_test_input.csv
- hashtag_test_output.json

Failing

Right now running pytest in the root folder locally on this branch, test_hashtag_analyzer.py will lead to the following error stack originating test_primary_analyzer:

Error stack

testing/testers.py:40: in test_primary_analyzer
    input.convert_to_parquet(input_path)
testing/testdata.py:23: in convert_to_parquet
    return self._transform(lf).sink_parquet(target_path)
../../code/venvs/mango/lib/python3.12/site-packages/polars/_utils/unstable.py:58: in wrapper
    return function(*args, **kwargs)
E       polars.exceptions.ComputeError: found more fields than defined in 'Schema'
E       
E       Consider setting 'truncate_ragged_lines=True'.

../../code/venvs/mango/lib/python3.12/site-packages/polars/lazyframe/frame.py:2385: ComputeError

If I put a break point on l. 39 in test_primary_analyzer, like so:

Show breakpoint

(Pdb) l
 36             actual_output_dir = exit_stack.enter_context(TemporaryDirectory(delete=True))
 37             actual_input_dir = exit_stack.enter_context(TemporaryDirectory(delete=True))
 38  
 39             breakpoint()
 40  
 41  ->         input_path = os.path.join(actual_input_dir, "input.parquet")
 42             input.convert_to_parquet(input_path)
 43  
 44             context = TestPrimaryAnalyzerContext(
 45                 temp_dir=temp_dir,
 46                 input_parquet_path=input_path,
(Pdb) n

I see that in test_primary_analyzer, input evaluates to this:

(Pdb) input
CsvTestData(semantics={}, filepath='/Users/kriarm/project/mango-tango-cli/analyzers/hashtags/test_data/hashtag_test_input.csv', csv_config=CsvConfig(has_header=True, quote_char='"', encoding='utf8', separator=','))

But input_path evaluates to this:

(Pdb) input_path
'/var/folders/c3/d9x17dm16c3ddq8sthrvlzv80000gn/T/tmpvvnjcav8/input.parquet'

soul-codes · 2025-03-22T18:25:46Z

Hi @KristijanArmeni I walked the debugger and found this issue:

Exception has occurred: ComputeError
found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.
  File "/home/tar/code/mango-tango-cli/testing/testdata.py", line 23, in convert_to_parquet
    return self._transform(lf).sink_parquet(target_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tar/code/mango-tango-cli/testing/testers.py", line 40, in test_primary_analyzer
    input.convert_to_parquet(input_path)
  File "/home/tar/code/mango-tango-cli/analyzers/hashtags/test_hashtag_analyzer.py", line 15, in test_hashtag_analyzer
    test_primary_analyzer(
polars.exceptions.ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.

The error found more fields than defined in 'Schema' typically means that there aren't a consistent number of columns across the rows of the CSV file and Polars didn't know what to do with it.

It looks like your input files are tab-separated, but the CsvTestData implementation uses polars' read_csv/scan_csv under the hood, which defaults the separator to ,.

We should certainly note this in the test documentation. ~~In the mean time, we have a few options here:~~

~~Expose the separator argument to CsvTestData that we pass down to polars.~~
~~Insists that contributor use polars standard CSV files, that is, " for quote, and , for separator.~~

Edit: we did in fact already express an intention to allow CsvTestData to be customizable. There is a class CsvConfig that's not exposed in testing, nor is it wired properly in the CsvTestData implementation. Let me PR a fix to that.

In any case though, your input data in analyzers/hashtags/test_data/hashtag_test_input.csv is missing a column that your test requires:

screen_name	created_at	text

The column hashtags is missing.

KristijanArmeni · 2025-03-23T14:20:45Z

Thank you @soul-codes, good catch, i would have missed that. That's an easy fix then, but: does the input schema have to be fixed?

In #106 I was trying to implement an update such that the hashtag analyzer could work with two input data schemas (see below). That's because I encountered the two schemas in the datasets I've been analyzing.

It would good to be flexible to some degree, though I'm sure input data will inevitably come in lots of different schemas. If we have to choose then I'd think 1) would be more common.

Two assumed input data schemas:

Hashtags are part of the text column and analyzer extracts them via regex:

screen_name	created_at	text
`user1`		00:00:00	"Some text #hashtag1 #hashtag2"

Hashtags are in the input data come in a separate hashtags column:

screen_name	created_at	text		hashtags
`user1`		00:00:00	"Some text"	"[hashtag1, hashtag2]"

soul-codes · 2025-03-30T22:16:25Z

In #106 I was trying to implement an update such that the hashtag analyzer could work with two input data schemas (see below). That's because I encountered the two schemas in the datasets I've been analyzing.

We can do this, but it would require us to break from the existing "contracts" between analyzers and the application. [Edit:] to be clear, we cannot do this right now.

A "proper" way to do this would be to have a pre-processing step where we can perform the destructive conversion from Some text #hashtag1 #hashtag2 to [hashtag1, hashtag2]. But it is difficult to come up with analysis-agnostic pre-processing. I think it would need to be part of each analyzer unless it's super generic (like stripping spaces, change capitalization, etc), which isn't relevant to what you're trying to accomplish here.

I would recommend that we reduce the scope of this PR by only allowing the more prevalent kind of data (i.e. choose one). I know that sucks, but for the PR's scope (having data for testing), it's more important that the existing behavior is tested than us introducing a new behavior (being able to handle a different schema).

Let's keep the pre-processing in the back of our mind though. Whatever cannot be accomplished right now, feel free to open an issue and we can think about how to make it work as a generic contract for all analyzers.

KristijanArmeni · 2025-03-30T22:23:05Z

Thanks @soul-codes ! I agree, I was coming to the same conclusion myself, i.e. choose the more prevalent schema and expect users to conform in cases when it is different.

I think Schema 1 above seems a reasonable default expectation to me (especially if it is coherent with the ngrams test).

I won't get to it today, but can give it a shot over the week, after Tuesday or so. (if you feel like ironing it out, welcome to!)

KristijanArmeni added 6 commits March 19, 2025 20:40

[ENH] extract hashtags directly from text and refactor

3212db4

[ENH] gini operate on pl.Series, use .map_batches to compute gini

dd86f94

[FIX] use .alias and interface constants for naming

60bf3f4

[FIX] add users col and rename window_start

7437f77

[MAINT] remove joining nested cols into str, export with .csv will fail

09e98e0

[FIX] formatting

4bbd3a1

KristijanArmeni added help wanted We need some extra helping hands or expertise in order to resolve this! domain: core Affects the app's core architecture domain: datasci Affects the analyzer's data science logic labels Mar 20, 2025

KristijanArmeni self-assigned this Mar 20, 2025

[ENH] add folder and data for tests

6f90142

KristijanArmeni force-pushed the hashtags-test branch from 214c0bc to 6f90142 Compare March 20, 2025 13:45

soul-codes mentioned this pull request Mar 30, 2025

test: 💍 expose CsvConfig for CsvTestData #114

Merged

KristijanArmeni mentioned this pull request Apr 9, 2025

[ENH] Hashtag update and add tests #121

Merged

DeanEby closed this Apr 11, 2025

KristijanArmeni deleted the hashtags-test branch April 30, 2025 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] add folder and data for tests #107

[ENH] add folder and data for tests #107

Uh oh!

KristijanArmeni commented Mar 20, 2025

Uh oh!

soul-codes commented Mar 22, 2025 •

edited

Loading

Uh oh!

KristijanArmeni commented Mar 23, 2025

Uh oh!

soul-codes commented Mar 30, 2025 •

edited

Loading

Uh oh!

KristijanArmeni commented Mar 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ENH] add folder and data for tests #107

[ENH] add folder and data for tests #107

Uh oh!

Conversation

KristijanArmeni commented Mar 20, 2025

Things added

Failing

Uh oh!

soul-codes commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KristijanArmeni commented Mar 23, 2025

Two assumed input data schemas:

Uh oh!

soul-codes commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KristijanArmeni commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

soul-codes commented Mar 22, 2025 •

edited

Loading

soul-codes commented Mar 30, 2025 •

edited

Loading

KristijanArmeni commented Mar 30, 2025 •

edited

Loading