Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explorer revamp #428

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open

Explorer revamp #428

wants to merge 43 commits into from

Conversation

sal-uva
Copy link
Collaborator

@sal-uva sal-uva commented Apr 23, 2024

This PR revamps how the Explorer works and looks. It specifically does the following:

  • Adds a new OPTION_DATASOURCES_TABLE user input that creates a table with dynamic columns for each enabled dataset. Input fields per row can be text, dropdown, and checkbox
  • Uses this table to create a new Settings page where the Explorer can be enabled per data source (more table options can be added later).
  • Simplifies how custom data source templates for the Explorer are handled: they are now composed of CSS files (in static/css/explorer/) and Jinja2 templates (in webtool/templates/explorer/datasource-templates/) instead of CSS and JSON files in the data source folders that need to be verified and parsed.
  • Integrates the Explorer with the UI of 4CAT.
  • Makes the Explorer use iterate_items.
  • Re-integrate sorting so that all dataset columns can be used for sorting in the Explorer.
  • Enable new annotation columns for for sorting, filtering, and other features.
  • Adds Explorer templates for Twitter and Instagram (other data sources will soon follow).
  • Deletes much unnecessary code.

Note that some unused code is still present for future updates with respect to 4CAT scrapers and database-accessible data sources generally.

…hodsinitiative/4cat into explorer-improvements

# Conflicts:
#	common/lib/config_definition.py
@sal-uva sal-uva requested a review from dale-wahl April 23, 2024 14:59
Copy link
Member

@dale-wahl dale-wahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over the backend code and only noticed one real issue (in an edge case). I ran this version and tested out the Explorer on a number of datasets (instagram, custom, telegram, tumblr, tiktok, youtube, reddit). It looks good! Sort works well. Reddit was missing the "subject" field (it's probably the only dataset that uses subject anymore). Telegram has an issue which I will post separately.

I tested saving annotations and writing them to datasets. This worked for me (and broke one with my edge case 😬; see comment). I did notice that the new fields show up in the Dataset preview view, but the values saved to the database do not show up in preview. The values do show up after you have run "write annotations".

Changing deactivating/activating settings seem to work fine. There is an explorerflask settings group that could probably be merged with the Explorer group.

If you want to merge now, I would deactivate Telegram as a default (till addressed) and consider how to address my comment re: field names for annotations.

@@ -418,7 +418,7 @@ def add_field_to_parent(self, field_name, new_data, which_parent=source_dataset,
parent_path = which_parent.get_results_path()

if len(new_data) != which_parent.num_rows:
raise ProcessorException('Must have new data point for each record: parent dataset: %i, new data points: %i' % (which_parent.num_rows, len(new_data)))
self.dataset.update_status('The amount of new data points and existing records don\'t match; data may be misaligned (parent dataset: %i, new data points: %i)' % (which_parent.num_rows, len(new_data)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not return here or raise then the code will add data to the original dataset. This may be the intent (if whatever list always starts at the first item, the result would be fine, BUT if you fed a list that starts somewhere else, then those new records will be incorrectly updated).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended; this method could not be used with different data lengths before, but we do need this now because num_rows does not take into account when map_item() fails and creates a shorter CSV than an NDJSON. Didn't seem to cause any problems when making it a warning instead of exception!

@@ -8,6 +8,7 @@

from backend.lib.search import Search
from common.lib.item_mapping import MappedItem
from common.lib.helpers import UserInput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're importing UserInput in a few datasources unnecessarily. Probably an oversight and otherwise has no effect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we should do some cleanup..

@@ -101,7 +102,7 @@ def process(self):

# Write to top dataset
for label, values in new_data.items():
self.add_field_to_parent("annotation_" + label, values, which_parent=self.source_dataset, update_existing=True)
self.add_field_to_parent(label, values, which_parent=self.source_dataset, update_existing=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_field_to_parent function does not check for existing fields. If a User creates a field called "username" they will overwrite an existing field with the same name. If I recall, I could not figure out how to check that an existing column had the name because the add_field_to_parent function needs to be able to update existing annotation fields. This is just a bit dangerous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a dataset by creating a field called "author", adding some values, and writing to dataset. I was able to overwrite the original "author" field (which in my case was actually a dictionary of author related data which caused map item to break). I recommend reverting this change. We could even add 4CAT_annotation_ or something so that it would be virtually impossible for raw data to contain that fieldname.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed an oversight for now, though I would like to have the option for annotation fields to have a 'clean' name; long names are quickly unreadable in spreadsheet software. This can be resolved by initially checking whether an annotation field key already exists in the dataset columns or, when annotated datasets are filtered and create a new dataset, if it is not a field registered in the annotations table for the parent dataset.

This is a sort-of edge case for now, but I'll try to resolve this next week!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this with a back-end and front-end check

@dale-wahl
Copy link
Member

And this is the issue with a Telegram dataset I ran into:

File "/opt/venv/lib/python3.8/site-packages/flask/templating.py", line 151, in render_template
2024-04-23 23:46:05     return _render(app, template, context)
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/flask/templating.py", line 132, in _render
2024-04-23 23:46:05     rv = template.render(context)
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/jinja2/environment.py", line 1301, in render
2024-04-23 23:46:05     self.environment.handle_exception()
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception
2024-04-23 23:46:05     raise rewrite_traceback_stack(source=source)
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/explorer.html", line 1, in top-level template code
2024-04-23 23:46:05     {% extends "layout.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/layout.html", line 71, in top-level template code
2024-04-23 23:46:05     {% block body %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/explorer.html", line 40, in block 'body'
2024-04-23 23:46:05     {% include "explorer/post.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/post.html", line 12, in top-level template code
2024-04-23 23:46:05     {% include "explorer/datasource-templates/generic.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/datasource-templates/generic.html", line 122, in top-level template code
2024-04-23 23:46:05     <i class="fa-solid fa-comment"></i> {{ fields.comments | commafy }}
2024-04-23 23:46:05   File "/usr/src/app/webtool/lib/template_filters.py", line 64, in _jinja2_filter_commafy
2024-04-23 23:46:05     number = int(number)
2024-04-23 23:46:05 ValueError: invalid literal for int() with base 10: '👍👍👍👍👍❤🏆🆒'

Looks like perhaps the emojis are killing the template. Telegram, I think, is the only datasource using them.

@sal-uva
Copy link
Collaborator Author

sal-uva commented Apr 29, 2024

Since merging to master is no longer of immediate concern, I will address the above and make further improvements; will notify when it's time for a review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants