Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some stuff not getting parsed #1

Open
p-acharya opened this issue Jun 7, 2024 · 6 comments
Open

Some stuff not getting parsed #1

p-acharya opened this issue Jun 7, 2024 · 6 comments

Comments

@p-acharya
Copy link

p-acharya commented Jun 7, 2024

Hello, thank you very for the very useful code.

here is my somewhat messy python notebook.

I am running into some issues:

Issue #1

CleanShot 2024-06-06 at 19 09 26@2x

It seems the script is not properly recognizing compounds properly:
Here is the original wikitionary page

===Etymology===
From {{af|es|litografía|-ico}}.

I would like to see the full etymological path like so:
litográfico -> litografía -> lito- -> λίθος (Ancient Greek)
litográfico -> -ico -> -icus (Latin)

But as you can see in the image it seems it can't find the parent.


Issue #2

ಚಿತ್ರ ಚಿತ್ರ

Similarly, for the English word copious, I can't find its etymological path. Other cognates of copious seem to have successfully parsed : (

@clefourrier
Copy link
Owner

Hi! Yes, derivative or compound words are managed differently - their root will be linked to the parent, but themselves won't.
It's a design choice which was made at the time.

@p-acharya
Copy link
Author

p-acharya commented Jun 7, 2024

Hi! Yes, derivative or compound words are managed differently - their root will be linked to the parent, but themselves won't. It's a design choice which was made at the time.

I see, thank you for your response, I really appreciate it 🙂.
What about the english word copious?

Question 1

Also, I'm having similar issues for English words whose immediate ancestor is Middle English or old French.

Case in point: the word example (see wikitionary).

ಚಿತ್ರ ಚಿತ್ರ

It doesn't seem to go all the way back to the Latin root which is what I need.


Question 2

I had to deaccent the latin forms to get some etymological paths working properly (see the notebook I originally linked in the post)

Here I have collapsed all subcategories of Latin into a column called classification_lang called "Latin", because wiktionary doesn't specify "Late Latin" in the etymology section language headers.

ಚಿತ್ರ

see the word plethora for example.

Before deaccenting

ಚಿತ್ರ

After deaccenting

ಚಿತ್ರ

Now it correctly finds the etyomlogical paths:
ಚಿತ್ರ

Is there something that can be done about this? I feel like my solution is very brittle. 😅

@clefourrier
Copy link
Owner

I think it's because copious is a borrowing, but it's probably a bug. For the fix you mentionned, we probably accidentally had an issue with diacritics. If you think your fix works we could add it to the extraction scripts :)

@p-acharya
Copy link
Author

I wish I knew how to write and debug Perl code, but I'm still learning. Is there any chance you could help fix the copious bug and the example bug? I've been trying to learn Perl and debug it myself for the past week, but I haven't had much luck so far 😅.


Normalized forms

One possible way to generate a "normalized form" column is by extracting it from the link i.e.
ಚಿತ್ರ
The plēthōra hyperlink points to the URL https://en.wiktionary.org/wiki/plethora#Latin. We can grab the normalized form the link itself and add that in as a column. We would also be able to get the classification_langas well this way.
I.e., all wiktionary links are composed of the following format: https://en.wiktionary.org/wiki/{normalized_form}#{classification_lang}

Would it be possible to add this logic to the extraction scripts?

ಚಿತ್ರ

Therefore, plēthōra_la and plethora_la would need to map to the same ID.

@clefourrier
Copy link
Owner

I have not done any Perl since, and tbh I ended up re-writing the entire code base in Python a while ago XD (it was as fast as the Perl parsing).
At the moment, I don't have the time to redo so at all (and I no longer have the files as they were deleted at the end of my PhD), but it could be a faster way to process, I could give you some pointers if needed :)

@p-acharya
Copy link
Author

p-acharya commented Jun 7, 2024

I have not done any Perl since, and tbh I ended up re-writing the entire code base in Python a while ago XD (it was as fast as the Perl parsing).

I have been attempting to rewrite all the Perl code in Python for the past week 😅.

I no longer have the files as they were deleted at the end of my PhD

I.e., you no longer have the Python files?

I could give you some pointers if needed.

This would be great. I would be happy to debug / contribute to the code base if I knew how lol. I could try and rewrite it in Python with some help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants