Nlp work on Brazil Portuguese newswire text
You can browse the dataset online and see annotations on drive
We have x number of newswire articles collected between years 1994-2016. After preprocessing the dataset, since the articles are in html format, we first clean the tags and rename all files such as:
folca/data/2005/01/01/19.html --> folca/parsed-data/2005_01_01_19.html
and collect them all in one folder.