-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing numbers in output #189
Comments
I guess I need to add a parser function here? |
I've also noticed that in many cases with this parser function, the argument list is empty which seems to be related to source text like this Edit: This specific template is described here (in Swedish). Is this problem just a matter of including this template definition as the |
I am seeing missing numbers in both Hungarian and English WP outputs. In Hungarian, the template It seems that when the code tries to expand I attached two logs of what is happening. They are full of random stuff, for which I apologize, but it gives an idea of what is happening. |
Thanks for responding. I didn't have time to look further into this, but do you know if this script should be able to properly parse and look up values if given the correct input files? I've learnt that a Wikipedia dump consists of a lot of files, some of which are sql files to recreate databases which I'm guessing are needed to fill in certain values, like maybe my previous example |
As I understand, this script takes as input a single bz2 file, so I just gave it the full pages-articles bz2 dump. That one includes the template pages, and they are extracted into the templates file ( |
I have noticed the same. Especially on newer machine generated articles, where it costs nothing to burden the text with a lot of markup, necessary, or not. Regarding numbers, they are described as formatting here: https://www.mediawiki.org/wiki/Help:Magic_words There are a lot of possibilities of user generated formatting. They could be regarded as too much work for this tool, unless it impacts the uses of the text. Some projects recommend the json output, which may be the way to go if there are less errors. It may be easier to just remove key words from json, than to interpret a dump to text. Has anyone fixed this problem either way?, i.e. fixed the parsing or cleaning functions? Or are there better dumps or tools? |
Same here. Would love to find a way to fix that. |
I did a quick fix that may work. Since template expansion seems to be outdated, why not used already expanded text: the cirrussearch dumps. It seems to work. And there are many mirrors to download from. I had to do a minor code update, and added a text-only option. Please feel free to copy: https://github.com/HjalmarrSv/wikiextractor/blob/master/cirrus-extract.py It may be possible to integrate the cirrus reader into WikiExtractor, if there is such a need. |
Thx for sharing! I tried the same thing, it works well to get the text with numbers, but I also get all the references mixed with the text. I'm still not sure how I can filter those out properly. |
With two spaces before the caret references were still there. Changing to one space removes all of them: |
This would only remove a tiny fraction of references. Also, I'm not sure what you want to do with your wiki corpus, but in the eventuality that you'd like to train some language model, here is what you will get from the cirrus dump:
As opposed to the same article extracted from the standard wiki dump (using wikiextractor):
It seems to me that working with the cirrus dump makes it really difficult to filter out annoying features such as formulas, references, ... I personally would feel more confident training on a corpus missing few numbers here and there rather than a corpus containing a significant fraction of noisy references and markup. |
Just to chime in with another alternative: I ended up using the Kiwix Wikipedia dumps. The dump consists of ZIM archives, which contain the pages in HTML format. This makes parsing the infoboxes and similar templates all but impossible, but for those of us who only need the text, it is actually easier to process than the abomination that Mediawiki markup is. For those interested, I have created a repo for parsing these ZIM dumps into a very limited form of HTML (only |
Great with alternatives! ZIM is new to me. I do not like math. :-) Maybe filter out math articles... I have not thought about them. Will now! |
As johnPertoft pointed out above there is a place where formatnum should be. These lines can be added there. As far as I know nothing improves, but code readability. If there is a function that just drops the tags {{, }} when unknown function, then knowing the function may ruin that. Otherwise this is a step forward. The formatnum in itself would be simple to parse, but there is the template connection. #from https://en.wikipedia.org/wiki/Help:Magic_words I would also like to put 'as of' here, but find no support for this assumption. I proposed another temporary solution in the As of-thread. |
This bug seems to be the root cause of most invalid sentences on http://voice.mozilla.org. |
In above thread {{date de naissance-|17|octobre|1973}} is mentioned as not handled by WikiExtractor. Which I guess is a {{dateformat}} or {{formatdate}} parser tag. There is a proposal for making a centralised function for handling parser tags in multiple languages. In this proposal there is also a list of a number of problems the lack of a central function is causing. Two of the problems identified are relevant to this thread: "For people editing in different languages, templates make translation harder. When translating a page, templates are much harder to handle than the article text (“prose”), whether the translation is done manually or with Content Translation. Users often have to skip the template or to correct it after the article was published. This also causes abandoning translations in progress, because template translation looks intimidating." "Content Translation has a template adaptation feature, which automates some parts of this process, but it works only if a corresponding template exists in both languages, and if all the parameters were meticulously mapped by the template maintainers. This must be done for each template in each language separately and manually, and continuously maintained when the source template changes. This happens even though the templates’ function across the languages is the same most of the time." https://www.mediawiki.org/wiki/Global_templates/Proposed_specification [This page was last edited on 19 January 2020, at 12:03] Some problems that have been identified must reasonably be solved at the level where they are created. Automatic template expansion and parsing, primarily with tag-names in english, may be the furthest this project can possibly handle. That could mean support for the idea of pre-parsing the .xml-code for translating it to non-local tags. Edit: Ideally the "pre-parsing"would be done by the template maintaners in the templates themselves. Otherwise, pending a central function, there would be a need to write a translator by anyone interested in a specific language. |
'formatnum': lambda extr, string, *rest: string, seems to work for me (disambig also works suddenly, so may also depend on something else I did/reverted) this should work on other parser tags where keeping the value is enough. |
If someone else can replicate the result, we could call this problem closed. In the same way as with 'int': lambda extr, string, *rest: text_type(int(string)), it is possbile to do an operation on the string. With import decimal * there is not even a need to convert from string, to be able to use roundup (to avoid zero values) if decimal markers are to be avoided. if a formatnum fails now, it is probably because a template was not expanded before parsing formatnum, in which case the template possibly is still in the text. I have not tried, but if you are not interested in numbers, replace the string argument with "number" or "nummer" or "your choice". Could work. A function could be called that localized the decimal point (. or ,) to user locale or article language. I see no reason why it should not work to copy "formatnum ..." and also have other languages., thus not getting stuck on the "template expansion and translation"-problem above. I will try the dateformat and formatdate with re.sub, or equivalent. You can of course use the formatnum example and get the unprocessed argument (with pipe (|) symbols). Also here you can have any language as parser tag, to be able to process wikis in other languages. |
If you want comma then this works for me:
A pre-formated number may turn to rubbish, 1.000.000,00, would beome 1,000,000,00, or 1,000,000.00, would become 1,000,000,00. The entire list looks like this. Note that tag, dateformat and formatdate are experimental. I have not found them in my .xml.
For tags in other languages - copy and add - in theory. Test! May actually be a template tag and may not work.
|
So in the end you managed to recover the missing numbers? I personally went on with the solution of @DavidNemeskey, I get the html files from the zim archive using his code, then I have my own code to extract the text from the html. I end up with a relatively clean content, including all numbers. The html markup allows a lot of room to preprocess the data in the way you want. |
I'll check those out. Luckily disk space is affordable. Only time is lacking. Yes, the Formatnum works. -The idea of using formatnum is kind of silly. If my locale is US and I read a Swedish text - why would I want US dot decimal separator when swedes use comma as decimal separator? And if it only is markup for autotranslation, then do you really want translation that needs all kinds of markup to work, especially when most text lacks but basic markup? Not counting about a million articles of robot text, having latin titles, or foreign language hill names with all kinds of markup, with the same sentence structure, not so useful for language studies. I made a pull request of the code, since I am not sure everyone checks out the forks. The formatnum consists of three parts, the function (optional), the calling of the function in the parser tag list (a mandatory lamda) and the command line option (optional) if you want to change between comma and dot when calling WikiExtractor. |
Hi, I also came across this problem. Is there any way to fix it? Example in wiki: output: |
Yes! Try adding the following: Depending on the language, you must decide on comma or dot as decimal separator. If problem persists then it is not formatnum that is the problem. It may be that someone has defined their own template for numbers, and the template translation does not catch it, because of errors in the template. As explained above you can choose zim or cirrus archives as an alternative. They have expanded the templates already, not always correctly, but good enough I guess. You need to use another parser in that case. Hope this works for you! |
Thanks for your reply. I found the reason inducing my problem is all these numbers contained in a template {{convert|xx|..} which has been filtered. |
Any update on this issue? |
I tried to use this for a dump of Swedish Wikipedia but in a lot of cases I noticed that numbers are often missing from the output files. After some cross referencing between the output json files and the source xml file it seems related to
{{formatnum}}
which is a Magic word.(parts of) Example article in xml file with this problem:
Corresponding article in json output file:
In the articles without this problem, number are (as far as I've seen) written as plain text without any magic words.
Is there any way to avoid this and other similar missing words? This issue is probably related to #151 and #153.
The text was updated successfully, but these errors were encountered: