New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting links apparently broken #112
Comments
I suspect you had an error running the extractor, which failed to collect the definition for macro {{w}}. |
OK. Does the latest commit fix the broken links as well? |
It does not fix the problem of the first sentence either, since macro {{w}} uses an unsupported parser function #ifexists. |
Try now. |
Hello, I re-run the commands I listed in the first post and the result is exactly the same. To re-initialize the script, I just ran rm -r wikiextractor
git clone https://github.com/attardi/wikiextractor.git Is it enough or do I have to delete other files? Please note that the extracted files were in the |
FYI, thanks to your suggestion about the broken I just run: sed 's/{{w|\([^}]*\)}}/[[\1]]/g' enwikinews-latest-pages-meta-current.xml > filtered-enwikinews-latest-pages-meta-current.xml This way,
Becomes
So, I run the extractor on the resulting file: ./WikiExtractor.py -o extractedWithLinks -l filtered-enwikinews-latest-pages-meta-current.xml The resulting XML is well formed: <doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
Nobel Peace Prize awarded to Kenyan environmental activist
<a href="Oslo">OSLO</a> — The 2004 <a href="Nobel%20Peace%20Prize">Nobel Peace Prize</a> was awarded today to <a href="Wangari%20Maathai">Dr Wangari Maathai</a> from <a href="Kenya">Kenya</a>. She is the first <a href="Africa">African</a> woman to win the Peace prize, and the 12th woman to win the prize since its inception in 1901. The Nobel committee cited "her contribution to sustainable development, democracy and peace" as the reasons for awarding the prize. It is the first Peace prize awarded to an environmentalist.
Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a <a href="Ph.D.">Ph.D.</a> in <a href="anatomy">anatomy</a> from the University of Nairobi. For seven years she was the director of the <a href="Red%20Cross">Red Cross</a> in Kenya, and is most known for founding the <a href="Green%20Belt%20Movement">Green Belt Movement</a> — a non-governmental organization dedicated to environmental conservation and protecting forests. Since its founding in 1997, the organization claims to have planted over 30 million trees, in the process employing thousands of women — offering them empowerment, education and even family planning.
...
</doc> It's a very, very, very dirty solution but it seems to work. |
Sorry, the {{w}} issue has been solved by fixing the loading of templates. |
So which files should I remove? Removing the WikiExtractor folder and re-cloning the repo is not enough? |
You should remove the file that was given as argument for the --templates option. |
As I wrote, I called the script with just the |
The template {{w}} is indeed defined in enwikinews-latest-pages-articles.xml.bz2 |
Hello, sorry to bother you again, but I had time to dig more in the project, and I still have problems. Following your suggestions, I ran $ python WikiExtractor.py -o extractedWithLinks --templates
../enwikinews-lastest-pages-articles.xml.bz2 ../enwikinews-latest-pages-articles.xml.bz2 The result is still not what I expected. Take for example this page. The output of the extractor is: <doc id="817" url="https://en.wikinews.org/wiki?curid=817" title="Pope John Paul II meets Iraq's Ambassador">
Pope John Paul II meets Iraq's Ambassador
</doc> ..so, basically, the extactor wipes away all the content of the page. What could be the problem? |
Myself, I had a similar issue on Wikinews; the way I solved it might help you (here, I completely remove the links; the way I find them may help you modify them): Problem was I had: {"id": "736", "revid": "70202", "url": "https://en.wikinews.org/wiki?curid=736", "title": "President of China lunches with Brazilian President", "text": ", the of the People's Republic of China had lunch today with the of Brazil, , at the "Granja do Torto", the President's country residence in the . Lunch was a traditional Brazilian with different kinds of meat. \nSome Brazilian ministers were present at the event: (Economy), (), (Agriculture), (Development), (), (Mines and Energy). Also present were ( company president) and Eduardo Dutra (, government oil company, president).\nThis meeting is part of a new agreement between Brazil and China where Brazil has recognized mainland China's status, and China has promised to buy more ."} instead of: {"id": "736", "revid": "70202", "url": "https://en.wikinews.org/wiki?curid=736", "title": "President of China lunches with Brazilian President", "text": "Hu Jintao, the President of the People's Republic of China had lunch today with the President of Brazil, Luiz In\u00e1cio Lula da Silva, at the "Granja do Torto", the President's country residence in the Brazilian Federal District. Lunch was a traditional Brazilian barbecue with different kinds of meat. \nSome Brazilian ministers were present at the event: Antonio Palocci (Economy), Eduardo Campos (Science and Technology), Roberto Rodrigues (Agriculture), Luiz Fernando Furlan (Development), Celso Amorim (Exterior Relations), Dilma Rousseff (Mines and Energy). Also present were Roger Agnelli (Vale do Rio Doce company president) and Eduardo Dutra (Petrobras, government oil company, president).\nThis meeting is part of a new political economy agreement between Brazil and China where Brazil has recognized mainland China's market economy status, and China has promised to buy more Brazilian products."}
(You can add This way, this raw Wikinews code:
|
Hello,
I'm using WikiExtractor for an academic project and I need to extract the pages from WikiNews while keeping the links. My problem is that the script, when called with the
-l
option, removes links instead of preserving them.Take at example this news, titled Nobel Peace Prize awarded to Kenyan environmental activist. I download the latest dump, then I run the script as follows:
~/wikiextractor$ ./WikiExtractor.py -o extractedWithLinks -l enwikinews-latest-pages-meta-current.xml
I look for the file containing the text of the page:
If I look at the XML extracted by WikiExtractor it looks like this:
As you can see, the first sentence of the page is missing:
And some of the links in the following sentences are missing as well. The extracted text is:
While the original text reads (the missing links are in bold):
So: am I missing something in the configuration of WikiExtractor? Is it a bug? Are WikiNews dumps for some reason not supported, even if they should be identical in structure to the usual Wikipedia ones?
The text was updated successfully, but these errors were encountered: