- Removed instances of sqlalchemy from main. May add it in again later, but not in the main program. - Removed some unecesary sections from the parser. - Updated the data writer to handle the new entries list. Not 100% and running into issues with encodings, either writing to the db or in the windows shell.
It may be from some other libraries I installed, but on Windows 7 running "import Parser from parser" lead to a name collision. This actually makes sense to do in general, though. Also added a .gz ignore because I downloaded the JMdict file locally here.
Also required updating the parse to return a list of Entry objects. Rather than the combination of items from before. Probably a better decision overall but this means that the database insert no longer works. Also created a gloss class due to the annoyence of dealing with tuples. Finall moved the JMdict file to the main directory as it tends to be easier to type when testing. So added it to the ignore file.
Basically a simple collection of all of the data, un-normalized, but faster to search and display. The sad fact of the matter is that it's probably exactly what I want, aside from the glosses not containing all of their information. Well I should say I do still want to have a normalized table to do random querying against, but for the other immediate tools I want to build this will probably work perfectly (i.e. it's fast).
Also added a to string to entries, for testing purposes. It doesn't handle the glosses very well yet.
Using those instead to track joins. Not sure it's really helping. Things seem to be missing from the final listing, but I can't quite seem to track them down. I wonder if it might be easier to keep a list of joins, and read the ids back out from the table? Then again the big problem seems to be a lack of kana kanji link and spead of reading the list back out more than anything else.
Incrementing my own ids as a solution. I don't like how I figured out how to do this, and I certainly don't like that it's valid sqlite syntax...but such is life. The sub selects weren't working, so I guess for now this works. Then again I was trying to cut out everything I could to increse the speed of the conversion.
Thinking about it, I'm not entirely sure that the sense element was entirely necessary. I couldn't find a unique way to identify them aside from the entire element, so it made sense to just move the part of speach into the gloss and run with that. Who knows, I may be completely wrong about it. It is becoming increasingly obvious that I need to references ids somehow, this join table is noticably slow, so figuring that out will be the first step to try fixing it.
I really need a way to get ids at this point. It seems that querying the table as is right now is noticably slower than before. Or maybe it's just my imagination. Working on the pos now. Simplified the error checking before thinking that it should be a separate commit.
Rather than using intermediary files. Ran into an odd issue where it seemed the cursor was trying to iterate over the characters in the text rather than taking them as a single item. Actually that may be the correct way of handling these lists...either way, I just ran a list comprehension on them to get everything into the correct values.
The files (while not yet quite complete) insert all of this data in seconds rather than hours. Perhaps there are things that can be done with sqlalchemy to speed things up. Although I doubt it will ever match a bulk insert. That's not a huge priority, I started using it mostly to make querying easier not necessarily insertions.
Hardcoded it into the parser because I don't think it's possible to easily read out of the xml file (at least not with lxml). I doubt these will change that much at this point, but you never know. The next step is to alter the parser so this information can be linked to the sense elment.
It's being grabbed in a roundabout way simply because I didn't want to hardcode the whole namespace defenition. Yeah, that's a little lazy, but for now it works. I also don't like how I'm displaying these, but then again using the __str__ method was only meant to used for debugging purposes anyway. I'll have to write more robust print code again eventually anyway. So this can wait for now.
That's been bugging me for a while. It probably shouldn't need to be called explicitely, although the option is nice. Also I'm not 100% sure if this is the best way to handle setting the sqlalchemy objects, but the globals weren't being set correctly the way I was importing and initializing them before.
I wanted to move the ids in the entry, although this still seems like it should be many-to-many. Also renamed the kana/kanji elements as k_ele and r_ele were getting annoying to remember. TODO: - Still need to rename the columns in the database to something easier to understand.
Still incredibly slow, and there seems to be an issue with duplicate entries. Still that might be fine if all I care about it filling the db, and use only use it later. The one test I tried showed that I could get results fast enough.
The commit's still take forever. The reb element isn't complete either, if it's inside a r_elem(?) element then it has a slightly different meaning which should be preseved. Also this will just blindly add the elements to the database, it should be deleted/dropped prior to this.
Instead of trying it again on the end tag as well.
TODO: * The names could use some clarification in order to make the objects easier to use. * It takes quite a whilee to read the file right now, the code should be profiled, but the item lookup is a likely cause. Maybe it would be possible to sacrifice memory for speed?