Split DELAS into DELAS and DELAS-Pr #2

ppKrauss · 2018-01-14T19:09:34Z

There are a lot of "pure named entity" as proper nom, that are not real "dictionary words".

Examples: abel,N004+Pr, abelson,N004+Pr, abélson,N004+Pr, abigail,N104+Pr, abília,N104+Pr, abílio,N004+Pr, abraão,N004+Pr, abraham,N004+Pr, abrantes,N306+Pr, abrão,N004+Pr, zico,N004+Pr, zilda,N104+Pr, zimbábue,N304+Pr, zingarelli,N306+Pr, zoroastro,N004+Pr, zucolotto,N306+Pr, zurique,N104+Pr

Many are usual human given names (modern as zico, zilda or classic as zoroastro) or surnames (zingarelli, zucolotto). Other are commom toponyms, as country names, city names (abrantes,zurique), etc.

So, at DELAS-pr must include a column indicating the type of entity where the name is usually used (ex. Italy is a country-name but in Brasil there is also a female name).

There are other sources of names and its use-statistics, see here datasets-br/prenomes or datasets-br/city-codes, for confirmed Brazilian names, and world-cities, etc. for international.

The text was updated successfully, but these errors were encountered:

ppKrauss · 2018-01-14T19:19:41Z

From examples:

graph_id	classifications
N004	ms
N104	:fs
N304	:ms:fs
N306	:ms:mp:fs:fp

If all ok... Lets split!

Spliting with grep:

check wc -l DELAS.csv (75106)
grep -E "(Pr|SIGL)" DELAS.csv > DELAS_Pr.csv
check 4898
prepend... DELAS_Pr.csv
grep -vwE "(Pr|SIGL)" DELAS.csv > /tmp/DELAS_noPr.csv; mv /tmp/DELAS_noPr.csv DELAS.csv
check 70208
check 70208+4898=75106?

...

Or by SQL...

Check with SELECT DISTINCT regexp_replace(graph_id,'^[^\+]+\+','') FROM dataset.vw2_delas WHERE graph_id like '%+%' shows:
Dem, PosXTra, Art+Ind, Ind, Pr, Pos, Art+Def, Num, Tra, Pes, DemXInd, Int, Rel.
No SIGL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split DELAS into DELAS and DELAS-Pr #2

Split DELAS into DELAS and DELAS-Pr #2

ppKrauss commented Jan 14, 2018

ppKrauss commented Jan 14, 2018 •

edited

Split DELAS into DELAS and DELAS-Pr #2

Split DELAS into DELAS and DELAS-Pr #2

Comments

ppKrauss commented Jan 14, 2018

ppKrauss commented Jan 14, 2018 • edited

ppKrauss commented Jan 14, 2018 •

edited