Some stuff not getting parsed #1

p-acharya · 2024-06-07T02:08:13Z

Hello, thank you very for the very useful code.

here is my somewhat messy python notebook.

I am running into some issues:

Issue #1

It seems the script is not properly recognizing compounds properly:
Here is the original wikitionary page

===Etymology===
From {{af|es|litografía|-ico}}.

I would like to see the full etymological path like so:
litográfico -> litografía -> lito- -> λίθος (Ancient Greek)
litográfico -> -ico -> -icus (Latin)

But as you can see in the image it seems it can't find the parent.

Issue #2

Similarly, for the English word copious, I can't find its etymological path. Other cognates of copious seem to have successfully parsed : (

The text was updated successfully, but these errors were encountered:

clefourrier · 2024-06-07T08:24:57Z

Hi! Yes, derivative or compound words are managed differently - their root will be linked to the parent, but themselves won't.
It's a design choice which was made at the time.

p-acharya · 2024-06-07T14:57:03Z

Hi! Yes, derivative or compound words are managed differently - their root will be linked to the parent, but themselves won't. It's a design choice which was made at the time.

I see, thank you for your response, I really appreciate it 🙂.
What about the english word copious?

Question 1

Also, I'm having similar issues for English words whose immediate ancestor is Middle English or old French.

Case in point: the word example (see wikitionary).

It doesn't seem to go all the way back to the Latin root which is what I need.

Question 2

I had to deaccent the latin forms to get some etymological paths working properly (see the notebook I originally linked in the post)

Here I have collapsed all subcategories of Latin into a column called classification_lang called "Latin", because wiktionary doesn't specify "Late Latin" in the etymology section language headers.

see the word plethora for example.

Before deaccenting

After deaccenting

Now it correctly finds the etyomlogical paths:

Is there something that can be done about this? I feel like my solution is very brittle. 😅

clefourrier · 2024-06-07T16:27:05Z

I think it's because copious is a borrowing, but it's probably a bug. For the fix you mentionned, we probably accidentally had an issue with diacritics. If you think your fix works we could add it to the extraction scripts :)

p-acharya · 2024-06-07T18:21:30Z

I wish I knew how to write and debug Perl code, but I'm still learning. Is there any chance you could help fix the copious bug and the example bug? I've been trying to learn Perl and debug it myself for the past week, but I haven't had much luck so far 😅.

Normalized forms

One possible way to generate a "normalized form" column is by extracting it from the link i.e.

The plēthōra hyperlink points to the URL https://en.wiktionary.org/wiki/plethora#Latin. We can grab the normalized form the link itself and add that in as a column. We would also be able to get the classification_langas well this way.
I.e., all wiktionary links are composed of the following format: https://en.wiktionary.org/wiki/{normalized_form}#{classification_lang}

Would it be possible to add this logic to the extraction scripts?

Therefore, plēthōra_la and plethora_la would need to map to the same ID.

clefourrier · 2024-06-07T18:31:32Z

I have not done any Perl since, and tbh I ended up re-writing the entire code base in Python a while ago XD (it was as fast as the Perl parsing).
At the moment, I don't have the time to redo so at all (and I no longer have the files as they were deleted at the end of my PhD), but it could be a faster way to process, I could give you some pointers if needed :)

p-acharya · 2024-06-07T22:01:44Z

I have not done any Perl since, and tbh I ended up re-writing the entire code base in Python a while ago XD (it was as fast as the Perl parsing).

I have been attempting to rewrite all the Perl code in Python for the past week 😅.

I no longer have the files as they were deleted at the end of my PhD

I.e., you no longer have the Python files?

I could give you some pointers if needed.

This would be great. I would be happy to debug / contribute to the code base if I knew how lol. I could try and rewrite it in Python with some help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some stuff not getting parsed #1

Some stuff not getting parsed #1

p-acharya commented Jun 7, 2024 •

edited

Loading

clefourrier commented Jun 7, 2024

p-acharya commented Jun 7, 2024 •

edited

Loading

clefourrier commented Jun 7, 2024

p-acharya commented Jun 7, 2024

clefourrier commented Jun 7, 2024

p-acharya commented Jun 7, 2024 •

edited

Loading

Some stuff not getting parsed #1

Some stuff not getting parsed #1

Comments

p-acharya commented Jun 7, 2024 • edited Loading

Issue #1

Issue #2

clefourrier commented Jun 7, 2024

p-acharya commented Jun 7, 2024 • edited Loading

Question 1

Question 2

clefourrier commented Jun 7, 2024

p-acharya commented Jun 7, 2024

Normalized forms

clefourrier commented Jun 7, 2024

p-acharya commented Jun 7, 2024 • edited Loading

p-acharya commented Jun 7, 2024 •

edited

Loading

p-acharya commented Jun 7, 2024 •

edited

Loading

p-acharya commented Jun 7, 2024 •

edited

Loading