Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split DELAS into DELAS and DELAS-Pr #2

Open
ppKrauss opened this issue Jan 14, 2018 · 1 comment
Open

Split DELAS into DELAS and DELAS-Pr #2

ppKrauss opened this issue Jan 14, 2018 · 1 comment

Comments

@ppKrauss
Copy link
Contributor

There are a lot of "pure named entity" as proper nom, that are not real "dictionary words".

Examples: abel,N004+Pr, abelson,N004+Pr, abélson,N004+Pr, abigail,N104+Pr, abília,N104+Pr, abílio,N004+Pr, abraão,N004+Pr, abraham,N004+Pr, abrantes,N306+Pr, abrão,N004+Pr, zico,N004+Pr, zilda,N104+Pr, zimbábue,N304+Pr, zingarelli,N306+Pr, zoroastro,N004+Pr, zucolotto,N306+Pr, zurique,N104+Pr

Many are usual human given names (modern as zico, zilda or classic as zoroastro) or surnames (zingarelli, zucolotto). Other are commom toponyms, as country names, city names (abrantes,zurique), etc.

So, at DELAS-pr must include a column indicating the type of entity where the name is usually used (ex. Italy is a country-name but in Brasil there is also a female name).

There are other sources of names and its use-statistics, see here datasets-br/prenomes or datasets-br/city-codes, for confirmed Brazilian names, and world-cities, etc. for international.

@ppKrauss
Copy link
Contributor Author

ppKrauss commented Jan 14, 2018

From examples:

graph_id classifications
N004 ms
N104 :fs
N304 :ms:fs
N306 :ms:mp:fs:fp

If all ok... Lets split!

Spliting with grep:

  1. check wc -l DELAS.csv (75106)
  2. grep -E "(Pr|SIGL)" DELAS.csv > DELAS_Pr.csv
    check 4898
  3. prepend... DELAS_Pr.csv
  4. grep -vwE "(Pr|SIGL)" DELAS.csv > /tmp/DELAS_noPr.csv; mv /tmp/DELAS_noPr.csv DELAS.csv
    check 70208
  5. check 70208+4898=75106?

...

Or by SQL...

  • Check with SELECT DISTINCT regexp_replace(graph_id,'^[^\+]+\+','') FROM dataset.vw2_delas WHERE graph_id like '%+%' shows:
    Dem, PosXTra, Art+Ind, Ind, Pr, Pos, Art+Def, Num, Tra, Pes, DemXInd, Int, Rel.
    No SIGL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant