Update the parse function to accept an entity id #189

Rosencrantz · 2021-09-07T08:40:46Z

This includes a tweak to the parse function so that it generates an entity id before creating an entity. There are two ways in which this can occur

Supply a list of xpath values that will be concatenated together and hashed in order to generate a unique key
Do nothing and allow the parse function to automatically generate a key based on the url of the page that is being parsed.

…if none is supplied

sunu · 2021-09-07T10:46:52Z

docs/buildingcrawler.md

@@ -371,8 +371,13 @@ parse:
      author: .//meta[@name="author"]/@content
      publishedAt: .//*[@class="date"]/text()
      description: .//meta[@property="og:description"]/@content
+    keys:


I would suggest we use syntax similar to ftm mappings for consistency. Something like https://github.com/alephdata/aleph/blob/main/mappings/md_companies.yml#L15-L17

So the keys section will look like:

keys: - title - author

The keys section needs to be updated now I think?

sunu · 2021-09-07T10:49:40Z

memorious/operations/parse.py

+            hashlib.md5(key_string.encode("utf-8")).hexdigest()
+            != hashlib.md5("".encode("utf-8")).hexdigest()
+        ):
+            entity_id = hashlib.md5(key_string.encode("utf-8")).hexdigest()


We can use make_id from memorious.helpers.key instead of making the key using hashlib.

sunu · 2021-09-07T10:58:38Z

memorious/operations/parse.py

    for key, value in properties.items():
        properties_dict[key] = html.xpath(value)

+    data["entity_id"] = hashlib.md5(data["url"].encode("utf-8")).hexdigest()


It's probably better and more consistent to make keys out of the supplied keys only instead of falling back on a default. We should instead raise an error in case of null keys so that the user can notice and change their keying strategy.

…defined

…er all

sunu · 2021-09-21T07:03:22Z

memorious/operations/aleph.py

+    countries: list[str] = list(context.params.get("countries", []))
+    mime_type: str = context.params.get("mime_type", "")
+
+    context.log.warn(languages)


Stray warning

sunu · 2021-09-21T07:10:56Z

memorious/operations/aleph.py

+        published_at=data.get("published_at"),
+        headers=data.get("headers", {}),
+        keywords=data.get("keywords", []),
+    )

    if data.get("aleph_folder_id"):
        meta["parent"] = {"id": data.get("aleph_folder_id")}


We are using 2 different styles here. We are treating meta as a object in the line above and as a dict here. I think we should be consistent and use a single style throughout the function.

And imo we can merge _create_meta_object and _create_document_metadata together into just one function and only set whatever metadata is available.

sunu · 2021-09-21T07:20:34Z

memorious/logic/meta.py

@@ -0,0 +1,24 @@
+# from typing_extensions import TypedDict


IMO this is not generic enough to be its own module. We should put it in the aleph operation module.

But it's not a module. It's the definition of a type. I would type definitions seperate from modules that actually do stuff.

Putting the type definition near the relevant code is the convention we're already using in other places. See for example https://github.com/alephdata/followthemoney/blob/master/followthemoney/schema.py#L24

Fair enough. I'll move the meta file into the operations module.

I actually meant the Meta type definition should live in operations/aleph.py and not in a separate file since it's very specific to the aleph_emit operations and unlikely to be reused anywhere else.

sunu · 2021-09-21T07:20:58Z

memorious/operations/parse.py

@@ -88,22 +92,41 @@ def parse_for_metadata(context, data, html):
                if value is not None:
                    data[key] = value
                break
+    meta_paths.update(data)


Why is this necessary?

sunu · 2021-09-21T07:22:31Z

memorious/operations/parse.py

+            temp_key = "".join(properties[key])
+
+        if not temp_key == "":
+            return make_id(temp_key)


make_id can take multiple arguments. You don't need to join them beforehand.

sunu · 2021-09-21T07:25:25Z

docs/buildingcrawler.md

@@ -371,8 +371,13 @@ parse:
      author: .//meta[@name="author"]/@content
      publishedAt: .//*[@class="date"]/text()
      description: .//meta[@property="og:description"]/@content
+    keys:


The keys section needs to be updated now I think?

sunu · 2021-09-21T07:27:58Z

example/config/simple_article_scraper.yml

@@ -45,6 +45,10 @@ pipeline:
        author: .//meta[@name="author"]/@content
        publishedAt: .//*[@class="date"]/text()
        description: .//meta[@property="og:description"]/@content
+      keys:


May be the method needs to be changed to use the built-in parse method instead of a custom Python method to match the documentation example?

Sure. The problem with doing that though is that the scraper won't be able to extract the body of the article, which is why the custom script exists. I guess, as it's an example it doesn't really matter too much, but that is why we have a difference.

Personally I don't have an issue with having the documentation not match the example in the repo

sunu · 2021-10-19T08:08:19Z

memorious/operations/meta.py

+from typing import Optional, TypedDict
+
+
+class MetaBase(TypedDict):


MetaBase vs Meta difference is no longer used. So I guess it's safe to merge these two together now?

sunu · 2021-10-19T08:09:31Z

Could you fix the merge conflict too? I think it's from the changes Simon made to fix a couple of issues in aleph_emit operation.

sunu

Looks like the merge conflict wasn't fixed the way it should be. So the code is left in a broken state. A test for the aleph operations should have caught this. But we don't have any tests for it. May be it's worth adding one now since we are making changes to the aleph_emit operations.

sunu · 2021-10-21T05:14:20Z

memorious/operations/aleph.py

+from memorious.logic.context import Context
+
+
+class Meta(MetaBase, total=False):


MetaBase is undefined

sunu · 2021-10-21T05:15:12Z

memorious/operations/aleph.py

+        make_key(collection_id, foreign_id, content_hash)
+    )
+
+    if document_id:
        context.log.info("Skip aleph upload: %s", foreign_id)
        data["aleph_id"] = document["id"]


document is undefined

Rosencrantz · 2022-11-18T08:25:17Z

This died

Update the parse function to accept an entity id, or to generate one …

e7d6fde

…if none is supplied

Rosencrantz added this to the 2.4.2 milestone Sep 7, 2021

Rosencrantz requested a review from sunu September 7, 2021 08:40

Rosencrantz self-assigned this Sep 7, 2021

Rosencrantz linked an issue Sep 7, 2021 that may be closed by this pull request

Using the standard parse function for creating entities does not generate an entity_id #187

Open

sunu reviewed Sep 7, 2021

View reviewed changes

Dealing with review feedback. Don't autocreate entity_id, allow user …

b9e2b60

…defined

Rosencrantz requested a review from sunu September 8, 2021 10:04

Rosencrantz added 6 commits September 9, 2021 10:32

Add some typing to make things more understandable maybe

e1cf239

Remove comments

d2baf9d

More typing changes

9c46a11

Tweaking the simple_article_scraper

95ff207

Add missing file

eab0424

Get some tests passing, remove the Data thing, wasn't a good idea aft…

33c2182

…er all

sunu reviewed Sep 21, 2021

View reviewed changes

Rosencrantz added 5 commits October 1, 2021 10:36

Addressing feedback from Sunu

b592e8b

Further review feedback and stuff

f378a14

Update documentation to match changes to key generation

880d1ef

More documentation tweaks

a56e43a

Remove commented out code

3203a51

Rosencrantz requested a review from sunu October 6, 2021 06:28

Move meta into the operations module

1196764

sunu reviewed Oct 19, 2021

View reviewed changes

Rosencrantz added 2 commits October 20, 2021 16:26

All the review feedback

5acee7f

Merge branch 'master' into rosencrantz/parse-entity-id

b130665

Rosencrantz requested a review from sunu October 20, 2021 14:41

sunu suggested changes Oct 21, 2021

View reviewed changes

Added some more tests, fixed errors caused by a bad merge

dc45471

Rosencrantz added 3 commits October 26, 2021 12:56

Not addressing remote test failure

8a88907

Remove failing remote test

4d570c3

Remove failing remote test

0d7e6ec

Rosencrantz closed this Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the parse function to accept an entity id #189

Update the parse function to accept an entity id #189

Rosencrantz commented Sep 7, 2021

sunu Sep 7, 2021

sunu Sep 21, 2021

sunu Sep 7, 2021

sunu Sep 7, 2021

sunu Sep 21, 2021

sunu Sep 21, 2021 •

edited

sunu Sep 21, 2021

Rosencrantz Oct 1, 2021

sunu Oct 7, 2021

Rosencrantz Oct 7, 2021

sunu Oct 19, 2021

sunu Sep 21, 2021

sunu Sep 21, 2021

sunu Sep 21, 2021

sunu Sep 21, 2021

Rosencrantz Oct 1, 2021

sunu Oct 19, 2021

sunu commented Oct 19, 2021

sunu left a comment

sunu Oct 21, 2021

sunu Oct 21, 2021

Rosencrantz commented Nov 18, 2022

		from typing import Optional, TypedDict


		class MetaBase(TypedDict):

		from memorious.logic.context import Context


		class Meta(MetaBase, total=False):

Update the parse function to accept an entity id #189

Update the parse function to accept an entity id #189

Conversation

Rosencrantz commented Sep 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunu Sep 21, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunu commented Oct 19, 2021

sunu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rosencrantz commented Nov 18, 2022

sunu Sep 21, 2021 •

edited