Skip to content

Conversation

@KristijanArmeni
Copy link
Collaborator

This is a WIP PR for adding tests for the hashtag analyzer following the recipe outlined of the test_example_base.py.

It's currently failing (see below). Input and help welcome while I troubleshoot this (@soul-codes ).

Things added

  • test_hashtag_analyzer.py
  • test_data/
    • hashtag_test_input.csv
    • hashtag_test_output.json

Failing

Right now running pytest in the root folder locally on this branch, test_hashtag_analyzer.py will lead to the following error stack originating test_primary_analyzer:

Error stack
testing/testers.py:40: in test_primary_analyzer
    input.convert_to_parquet(input_path)
testing/testdata.py:23: in convert_to_parquet
    return self._transform(lf).sink_parquet(target_path)
../../code/venvs/mango/lib/python3.12/site-packages/polars/_utils/unstable.py:58: in wrapper
    return function(*args, **kwargs)
E       polars.exceptions.ComputeError: found more fields than defined in 'Schema'
E       
E       Consider setting 'truncate_ragged_lines=True'.

../../code/venvs/mango/lib/python3.12/site-packages/polars/lazyframe/frame.py:2385: ComputeError

If I put a break point on l. 39 in test_primary_analyzer, like so:

Show breakpoint
(Pdb) l
 36             actual_output_dir = exit_stack.enter_context(TemporaryDirectory(delete=True))
 37             actual_input_dir = exit_stack.enter_context(TemporaryDirectory(delete=True))
 38  
 39             breakpoint()
 40  
 41  ->         input_path = os.path.join(actual_input_dir, "input.parquet")
 42             input.convert_to_parquet(input_path)
 43  
 44             context = TestPrimaryAnalyzerContext(
 45                 temp_dir=temp_dir,
 46                 input_parquet_path=input_path,
(Pdb) n

I see that in test_primary_analyzer, input evaluates to this:

(Pdb) input
CsvTestData(semantics={}, filepath='/Users/kriarm/project/mango-tango-cli/analyzers/hashtags/test_data/hashtag_test_input.csv', csv_config=CsvConfig(has_header=True, quote_char='"', encoding='utf8', separator=','))

But input_path evaluates to this:

(Pdb) input_path
'/var/folders/c3/d9x17dm16c3ddq8sthrvlzv80000gn/T/tmpvvnjcav8/input.parquet'

@KristijanArmeni KristijanArmeni added help wanted We need some extra helping hands or expertise in order to resolve this! domain: core Affects the app's core architecture domain: datasci Affects the analyzer's data science logic labels Mar 20, 2025
@KristijanArmeni KristijanArmeni self-assigned this Mar 20, 2025
@soul-codes
Copy link
Collaborator

soul-codes commented Mar 22, 2025

Hi @KristijanArmeni I walked the debugger and found this issue:

Exception has occurred: ComputeError
found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.
  File "/home/tar/code/mango-tango-cli/testing/testdata.py", line 23, in convert_to_parquet
    return self._transform(lf).sink_parquet(target_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tar/code/mango-tango-cli/testing/testers.py", line 40, in test_primary_analyzer
    input.convert_to_parquet(input_path)
  File "/home/tar/code/mango-tango-cli/analyzers/hashtags/test_hashtag_analyzer.py", line 15, in test_hashtag_analyzer
    test_primary_analyzer(
polars.exceptions.ComputeError: found more fields than defined in 'Schema'

Consider setting 'truncate_ragged_lines=True'.

The error found more fields than defined in 'Schema' typically means that there aren't a consistent number of columns across the rows of the CSV file and Polars didn't know what to do with it.

It looks like your input files are tab-separated, but the CsvTestData implementation uses polars' read_csv/scan_csv under the hood, which defaults the separator to ,.

We should certainly note this in the test documentation. In the mean time, we have a few options here:

  • Expose the separator argument to CsvTestData that we pass down to polars.
  • Insists that contributor use polars standard CSV files, that is, " for quote, and , for separator.

Edit: we did in fact already express an intention to allow CsvTestData to be customizable. There is a class CsvConfig that's not exposed in testing, nor is it wired properly in the CsvTestData implementation. Let me PR a fix to that.

In any case though, your input data in analyzers/hashtags/test_data/hashtag_test_input.csv is missing a column that your test requires:

screen_name	created_at	text

The column hashtags is missing.

@KristijanArmeni
Copy link
Collaborator Author

Thank you @soul-codes, good catch, i would have missed that. That's an easy fix then, but: does the input schema have to be fixed?

In #106 I was trying to implement an update such that the hashtag analyzer could work with two input data schemas (see below). That's because I encountered the two schemas in the datasets I've been analyzing.

It would good to be flexible to some degree, though I'm sure input data will inevitably come in lots of different schemas. If we have to choose then I'd think 1) would be more common.

Two assumed input data schemas:

  1. Hashtags are part of the text column and analyzer extracts them via regex:
screen_name	created_at	text
`user1`		00:00:00	"Some text #hashtag1 #hashtag2"
  1. Hashtags are in the input data come in a separate hashtags column:
screen_name	created_at	text		hashtags
`user1`		00:00:00	"Some text"	"[hashtag1, hashtag2]"

@soul-codes
Copy link
Collaborator

soul-codes commented Mar 30, 2025

In #106 I was trying to implement an update such that the hashtag analyzer could work with two input data schemas (see below). That's because I encountered the two schemas in the datasets I've been analyzing.

We can do this, but it would require us to break from the existing "contracts" between analyzers and the application. [Edit:] to be clear, we cannot do this right now.

A "proper" way to do this would be to have a pre-processing step where we can perform the destructive conversion from Some text #hashtag1 #hashtag2 to [hashtag1, hashtag2]. But it is difficult to come up with analysis-agnostic pre-processing. I think it would need to be part of each analyzer unless it's super generic (like stripping spaces, change capitalization, etc), which isn't relevant to what you're trying to accomplish here.

I would recommend that we reduce the scope of this PR by only allowing the more prevalent kind of data (i.e. choose one). I know that sucks, but for the PR's scope (having data for testing), it's more important that the existing behavior is tested than us introducing a new behavior (being able to handle a different schema).

Let's keep the pre-processing in the back of our mind though. Whatever cannot be accomplished right now, feel free to open an issue and we can think about how to make it work as a generic contract for all analyzers.

@KristijanArmeni
Copy link
Collaborator Author

KristijanArmeni commented Mar 30, 2025

Thanks @soul-codes ! I agree, I was coming to the same conclusion myself, i.e. choose the more prevalent schema and expect users to conform in cases when it is different.

I think Schema 1 above seems a reasonable default expectation to me (especially if it is coherent with the ngrams test).

I won't get to it today, but can give it a shot over the week, after Tuesday or so. (if you feel like ironing it out, welcome to!)

@DeanEby DeanEby closed this Apr 11, 2025
@KristijanArmeni KristijanArmeni deleted the hashtags-test branch April 30, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: core Affects the app's core architecture domain: datasci Affects the analyzer's data science logic help wanted We need some extra helping hands or expertise in order to resolve this!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants