Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable Entities #43

Closed
x-fran opened this issue Sep 1, 2015 · 26 comments
Closed

Unreliable Entities #43

x-fran opened this issue Sep 1, 2015 · 26 comments
Assignees
Labels

Comments

@x-fran
Copy link
Member

x-fran commented Sep 1, 2015

I'm testing e-Entity to find a way to get the most reliable entities from the content as fastest is posible.

Let's take as a example this well known piece of text.

$content = "
        Madrid (/məˈdrɪd/, Spanish: [maˈðɾið], locally: [maˈðɾiθ, -ˈðɾi]) is a south-western European city and the
        capital and largest municipality of Spain. The population of the city is almost 3.2 million[4] and that of
        the Madrid metropolitan area, around 7 million. It is the third-largest city in the European Union, after
        London and Berlin, and its metropolitan area is the third-largest in the European Union after Paris and
        London.[5][6][7][8] The city spans a total of 604.3 km2 (233.3 sq mi).[9]
        The city is located on the Manzanares River in the centre of both the country and the Community of Madrid
        (which comprises the city of Madrid, its conurbation and extended suburbs and villages); this community
        is bordered by the autonomous communities of Castile and León and Castile-La Mancha. As the capital city of
        Spain, seat of government, and residence of the Spanish monarch, Madrid is also the political, economic and
        cultural centre of Spain.[10] The current mayor is Manuela Carmena from Ahora Madrid.
        The Madrid urban agglomeration has the third-largest GDP[11] in the European Union and its influences
        in politics, education, entertainment, environment, media, fashion, science, culture, and the arts all
        contribute to its status as one of the world's major global cities.[12][13] Due to its economic output,
        high standard of living, and market size, Madrid is considered the major financial centre of Southern
        Europe[14][15] and the Iberian Peninsula; it hosts the head offices of the vast majority of the major
        Spanish companies, such as Telefónica, Iberia and Repsol. Madrid is the 17th most livable city in the
        world according to Monocle magazine, in its 2014 index.[16][17]
        Madrid houses the headquarters of the World Tourism Organization (WTO), belonging to the United Nations
        Organization (UN), the SEGIB, the Organization of Ibero-American States (OEI), and the Public Interest
        Oversight Board (PIOB). It also hosts major international regulators of Spanish: the Standing Committee
        of the Association of Spanish Language Academies, headquarters of the Royal Spanish Academy (RAE), the
        Cervantes Institute and the Foundation of Urgent Spanish (Fundéu BBVA). Madrid organizes fairs such as
        FITUR,[18] ARCO,[19] SIMO TCI[20] and the Cibeles Madrid Fashion Week.[21]
        While Madrid possesses a modern infrastructure, it has preserved the look and feel of many of its historic
        neighbourhoods and streets. Its landmarks include the Royal Palace of Madrid; the Royal Theatre with its
        restored 1850 Opera House; the Buen Retiro Park, founded in 1631; the 19th-century National Library building
        (founded in 1712) containing some of Spain's historical archives; a large number of national museums,[22]
        and the Golden Triangle of Art, located along the Paseo del Prado and comprising three art museums:
        Prado Museum, the Reina Sofía Museum, a museum of modern art, and the Thyssen-Bornemisza Museum, which
        completes the shortcomings of the other two museums.[23] Cibeles Palace and Fountain have become the
        monument symbol of the city.[24][25][26]
        Madrid is home to two world-famous football clubs, Real Madrid and Atlético de Madrid.
        ";

Sending this text as it is we have in our response the issue that we've discussed here #41 and 48 entities back

array (size=48)
  0 => string 'Spain' (length=5)
  1 => string 'Madrid' (length=6)
  2 => string 'European Union' (length=14)
  3 => string 'Southern
        Europe' (length=23)
  4 => string 'Iberian Peninsula' (length=17)
  5 => string 'Spanish' (length=7)
  6 => string 'Telefónica' (length=11)
  7 => string 'Iberia' (length=6)
  8 => string 'Repsol' (length=6)
  9 => string 'Monocle' (length=7)
  10 => string 'World Tourism Organization' (length=26)
  11 => string 'WTO' (length=3)
  12 => string 'United Nations
        Organization' (length=35)
  13 => string '(' (length=1)
  14 => string 'UN' (length=2)
  15 => string 'Organization of Ibero-American States' (length=37)
  16 => string 'Public Interest
        Oversight Board' (length=39)
  17 => string 'Royal Spanish Academy' (length=21)
  18 => string 'Cervantes Institute' (length=19)
  19 => string 'Foundation of Urgent Spanish' (length=28)
  20 => string 'Fundéu BBVA' (length=12)
  21 => string 'FITUR' (length=5)
  22 => string 'ARCO' (length=4)
  23 => string 'SIMO' (length=4)
  24 => string 'Madrid metropolitan area' (length=24)
  25 => string 'Cibeles Madrid Fashion Week' (length=27)
  26 => string 'Royal Palace of Madrid' (length=22)
  27 => string 'Royal Theatre' (length=13)
  28 => string 'Opera House' (length=11)
  29 => string 'Buen Retiro Park' (length=16)
  30 => string 'National Library' (length=16)
  31 => string 'Golden Triangle of Art' (length=22)
  32 => string 'Paseo del Prado' (length=15)
  33 => string 'Prado Museum' (length=12)
  34 => string 'Reina Sofía Museum' (length=19)
  35 => string 'Thyssen-Bornemisza Museum' (length=25)
  36 => string 'Cibeles Palace' (length=14)
  37 => string 'Fountain' (length=8)
  38 => string 'Real Madrid' (length=11)
  39 => string 'Atlético de Madrid' (length=19)
  40 => string 'London' (length=6)
  41 => string 'Berlin' (length=6)
  42 => string 'Paris' (length=5)
  43 => string 'Manzanares River' (length=16)
  44 => string 'Community of Madrid' (length=19)
  45 => string 'Castile and León' (length=17)
  46 => string 'Castile-La Mancha' (length=17)
  47 => string 'European' (length=8)

A lot of entities right? But we have a lot of things that we don't need e.g:

...
 13 => string '(' (length=1)
...
 20 => string 'Fundéu BBVA' (length=12)
...

What I did is clean up the content.

        $charsToRemoveFromContent = [
            "\n", "\r", "(", ")", "{", "}", "[", "]", "!", "?", "¡", "¿", ".", ",", '"', ":", ";", "=", "*", "\\", "#", "+", "/",
        ];
        // Clean up html tags, non-alphanumeric chars and blank spaces
        $content = preg_replace('/\s+/', ' ', strip_tags(str_replace($charsToRemoveFromContent, " ", htmlspecialchars($content))));
        // Remove non-ascii chars non-printables
        $content = preg_replace('/[[:^print:]]/', '', $content);
        // Remove numbers from string
        $content = preg_replace("/[0-9]/", "", $content);
        // Remove invalid UTF-8 chars
        $content = iconv("UTF-8","UTF-8//IGNORE",$content);

Note: I'm not proud of this code but hey I'm just playing around. :)

Now the content I send to FREME NER is looking like this:

Madrid mdrd Spanish mai locally mai -i is a south-western European city and the capital and largest municipality of Spain The population of the city is almost million and that of the Madrid metropolitan area around million It is the third-largest city in the European Union after London and Berlin and its metropolitan area is the third-largest in the European Union after Paris and London The city spans a total of km sq mi The city is located on the Manzanares River in the centre of both the country and the Community of Madrid which comprises the city of Madrid its conurbation and extended suburbs and villages this community is bordered by the autonomous communities of Castile and Len and Castile-La Mancha As the capital city of Spain seat of government and residence of the Spanish monarch Madrid is also the political economic and cultural centre of Spain The current mayor is Manuela Carmena from Ahora Madrid The Madrid urban agglomeration has the third-largest GDP in the European Union and its influences in politics education entertainment environment media fashion science culture and the arts all contribute to its status as one of the world's major global cities Due to its economic output high standard of living and market size Madrid is considered the major financial centre of Southern Europe and the Iberian Peninsula it hosts the head offices of the vast majority of the major Spanish companies such as Telefnica Iberia and Repsol Madrid is the th most livable city in the world according to Monocle magazine in its index Madrid houses the headquarters of the World Tourism Organization WTO belonging to the United Nations Organization UN the SEGIB the Organization of Ibero-American States OEI and the Public Interest Oversight Board PIOB It also hosts major international regulators of Spanish the Standing Committee of the Association of Spanish Language Academies headquarters of the Royal Spanish Academy RAE the Cervantes Institute and the Foundation of Urgent Spanish Fundu BBVA Madrid organizes fairs such as FITUR ARCO SIMO TCI and the Cibeles Madrid Fashion Week While Madrid possesses a modern infrastructure it has preserved the look and feel of many of its historic neighbourhoods and streets Its landmarks include the Royal Palace of Madrid the Royal Theatre with its restored Opera House the Buen Retiro Park founded in the th-century National Library building founded in containing some of Spain's historical archives a large number of national museums and the Golden Triangle of Art located along the Paseo del Prado and comprising three art museums Prado Museum the Reina Sofa Museum a museum of modern art and the Thyssen-Bornemisza Museum which completes the shortcomings of the other two museums Cibeles Palace and Fountain have become the monument symbol of the city Madrid is home to two world-famous football clubs Real Madrid and Atltico de Madrid

The response from FREME NER:

array (size=33)
  0 => string 'European Union' (length=14)
  1 => string 'Due' (length=3)
  2 => string 'Madrid' (length=6)
  3 => string 'Spanish' (length=7)
  4 => string 'Southern Europe' (length=15)
  5 => string 'Iberian Peninsula' (length=17)
  6 => string 'Monocle' (length=7)
  7 => string 'Cervantes Institute' (length=19)
  8 => string 'FITUR' (length=5)
  9 => string 'ARCO' (length=4)
  10 => string 'SIMO TCI' (length=8)
  11 => string 'Opera House' (length=11)
  12 => string 'Buen Retiro Park' (length=16)
  13 => string 'National Library' (length=16)
  14 => string 'Spain' (length=5)
  15 => string 'Golden Triangle of Art' (length=22)
  16 => string 'Paseo del Prado' (length=15)
  17 => string 'Prado Museum' (length=12)
  18 => string 'Thyssen-Bornemisza Museum' (length=25)
  19 => string 'Cibeles Palace' (length=14)
  20 => string 'Fountain' (length=8)
  21 => string 'London' (length=6)
  22 => string 'Real Madrid' (length=11)
  23 => string 'Berlin' (length=6)
  24 => string 'Paris' (length=5)
  25 => string 'The city' (length=8)
  26 => string 'Manzanares River' (length=16)
  27 => string 'Community of Madrid' (length=19)
  28 => string 'European' (length=8)
  29 => string 'Castile' (length=7)
  30 => string 'Len' (length=3)
  31 => string 'Castile-La Mancha' (length=17)
  32 => string 'Spain  The' (length=10)

33 items long array instead 48, containing only clean and more or less reliable entities.
This it will be also much faster to process for FREME NER and for the end users, less storage space if needed.

Imagine that I want to use "Fundéu" or ")" to dynamically build a URL.
E.g. "example.com/Fundéu?param=)"

This may be a security issue also.

@x-fran x-fran added the bug label Sep 1, 2015
@m1ci
Copy link
Contributor

m1ci commented Sep 7, 2015

  1. FREME NER and other services, process data that is sent by the clients. If we data cleansing, we might break other tools, which expect the same length of the output text as the input text.
    In fact, Fundéu is incorrectly encoded on the client side, so we cant do anything with it.

  2. As for the "reliable entities, FREME NER, at the moment, does not perform entity ranking. At the moment, it only performs, entity spotting, linking and classification.

@jnehring
Copy link
Member

jnehring commented Sep 8, 2015

All text send to FREME should be UTF-8 encoded. I created an issue to put that in the documentation: freme-project/freme-project.github.io#55

@koidl
Copy link

koidl commented Sep 11, 2015

Hi

We have a problem with the e-entity service.

At the moment ')' shows in the dashboard - see attach

Do we know why that is?

kevin
screen shot 2015-09-11 at 10 28 21

@m1ci
Copy link
Contributor

m1ci commented Sep 11, 2015

Do we know why that is?

Because ")" was spotted as entity.

@koidl
Copy link

koidl commented Sep 11, 2015

Is it one?

@m1ci
Copy link
Contributor

m1ci commented Sep 11, 2015

no, it is not, its mistake. Please provide an example of text and so we can track and address the issue.

@koidl
Copy link

koidl commented Sep 11, 2015

Thanks - we are working on it. We will send examples shortly

@koidl
Copy link

koidl commented Sep 11, 2015

One example:

http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer

[{"tag":"Third Man Records","score":1},{"tag":"Jack White","score":1},{"tag":"Room","score":1},{"tag":"Brie Larson","score":1},{"tag":"Lenny Abrahamson","score":1},{"tag":"Emma Donoghue","score":1},{"tag":"Radiohead","score":1},{"tag":"James Bond","score":1},{"tag":"Spectre","score":1},{"tag":"Sam Smith","score":1},{"tag":""Writing","score":1},{"tag":"On The Wall","score":1}]

The problem one her is

{"tag":""Writing","score":1}

Do you get that too?

@m1ci
Copy link
Contributor

m1ci commented Sep 11, 2015

please send us just the text - preferably in a doc. Thanks!

@koidl
Copy link

koidl commented Sep 11, 2015

Unfortunately we dont store the text in the db only in solr which is super hard to pull out. Its from the WP plugin which only sends the text in the body tag - and the title of the page too. In any case is FREME not also using URLs now which should bring the same problem? Will I mail the body and title text to you from the examples we find? Also not sure if this will fix it. We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters? Also SQL query injection might be possible?

@m1ci
Copy link
Contributor

m1ci commented Sep 11, 2015

Unfortunately we dont store the text in the db only in solr which is super hard to pull out.

I dont know your schema but via the SOLR admin interface you can query the exact document. A query will look something like: url:"http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer"

In any case is FREME not also using URLs now which should bring the same problem?

FREME NER is processing only texts. Any markup is not welcome and might influence the entity spotting phase.

Will I mail the body and title text to you from the examples we find?

FREME NER, as well, I think e-Terminology from Tilde, expects pure text. So please, just send us the text.

We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters?

Lets first find such cases.

Also SQL query injection might be possible?

On which side? Don't understand.

@jnehring
Copy link
Member

Also SQL query injection might be possible?

No user submitted data reaches the MySQL database. Right now we use SQL only for user access tokens. So it is almost impossible that FREME is vulnerable for SQL injections from text data send to FREME NER.

@Xfran Maybe you are mean SOLR query injections instead of SQL injections? And did you find a (potential) security issue or are you just asking a general question?

@koidl
Copy link

koidl commented Sep 11, 2015

I will try to extract some pages - SOLR is messy but I will do my best

SQL query injection would be on the FREME NER side. For example can a user inject a SQL Query that deletes a SOLR core. Reading this: http://www.matrixgroup.net/snackoclock/2013/01/getting-the-most-out-of-solr/#sthash.SieaWK9f.dpuf SOLR is not effected by SQL query injection.

@jnehring
Copy link
Member

That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?

@koidl
Copy link

koidl commented Sep 11, 2015

yes @jnehring thats right SOLR specific.... as long as there is no DB or anything else picking up the sent content?

@m1ci
Copy link
Contributor

m1ci commented Sep 11, 2015

That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities?

We use only the entity surface forms when querying Solr, matching surface form which contains dangerous code is IMO nearly impossible.

@johnmcauley
Copy link

Hey all,

I am getting a lot of this spotting also. It's on insider monkey which is
an absolute nightmare.

Here are two examples:

I can give you about 100,000 though. Strangely I don't get this from
http://api.freme-project.eu/doc/0.2/#!/e-Entity/execute_0 only on the dev
API.


Example 1

[
http://www.insidermonkey.com/blog/mondelez-international-inc-mdlz-kraft-foods-group-inc-krft-hershey-co-hsy-1-huge-reason-to-diversify-and-buy-this-global-giant-97297/2/,See
All

The confectionery category is typically less threatened by private-label
competition because loyal consumers are willing to pay up for their
favorite sweet treats. Higher-margin confectionery also enjoys faster
growth rates. Mondelez primarily competes with big-branded leaders Hershey
and Switzerland-based Nestle (OTCBB: NSRGY ) in this segment.

While Nestle boasts a great deal of presence internationally and presents a
big threat to Mondelez?s European business, Hershey Co (NYSE:HSY)?doesn?t
even come close to its geographic diversity. The more than century-old
candymaker derives only 16% of its revenues internationally. But Hershey
has recently ramped up spending to boost its international presence. The
maker of Kit Kat and Reese?s enjoyed a very successful 2012, with sales up
more than 9%. It did so by raising prices and suffering a very small hit to
volumes.

On the other hand, Mondelez International Inc (NASDAQ:MDLZ)?s cookie and
cracker brands, which include Nabisco and Oreo, are more susceptible to
private-label competition, particularly within Europe, where consumer
acceptance of private labels is particularly high. Aside from private-label
threats, Kellogg Company (NYSE: K ) is a major competitor in these
divisions with its Famous Amos, Keebler, and Cheez-It brands. Even though
Kellogg derives only one-third of its sales internationally, look for the
company to experience continued growth in its established Latin American,
European, and Asian markets, while likely pursuing acquisitions in other
emerging markets.

Foolish bottom line

Without a doubt, Mondelez faces challenges. But its global diversification,
ample international growth opportunities, and desirable product mix offer
it plenty of opportunities. And give its competitors a lot to chew on.

Fool contributor Nicole Seghetti owns shares of Mondelez International. The
Motley Fool recommends Coca-Cola and H.J. Heinz.

Copyright ? 1995 ? 2013 The Motley Fool, LLC. All rights reserved. The
Motley Fool has a disclosure policy .

]



Example 2

[
http://www.insidermonkey.com/blog/hedge-funds-are-betting-on-wesbanco-inc-wsbc-171429/?singlepage=1,By
Asma UL Husna in News

Published: June 14, 2013 at 1:23 pm

Is WesBanco, Inc. (NASDAQ: WSBC ) a buy right now? Prominent investors are
getting more optimistic. The number of long hedge fund positions moved up
by 1 in recent months.

In the financial world, there are dozens of indicators market participants
can use to analyze Mr. Market. A pair of the best are hedge fund and
insider trading sentiment. At Insider Monkey, our research analyses have
shown that, historically, those who follow the top picks of the best fund
managers can beat their index-focused peers by a very impressive amount (
see just how much ).

Just as important, optimistic insider trading activity is another way to
break down the investments you?re interested in. There are lots of reasons
for a bullish insider to sell shares of his or her company, but only one,
very obvious reason why they would behave bullishly. Plenty of empirical
studies have demonstrated the valuable potential of this strategy if
investors know what to do ( learn more here ).

With all of this in mind, we?re going to take a peek at the recent action
regarding WesBanco, Inc. (NASDAQ: WSBC ).

How are hedge funds trading WesBanco, Inc. (NASDAQ:WSBC)?

At Q1?s end, a total of 9 of the hedge funds we track were long in this
stock, a change of 13% from the previous quarter.?As one would reasonably
expect, some big names have been driving this bullishness. Citadel
Investment Group , managed by Ken Griffin, initiated the largest position
in WesBanco, Inc. (NASDAQ:WSBC). Citadel Investment Group had 0.6 million
invested in the company at the end of the quarter.

What do corporate executives and insiders think about WesBanco, Inc.
(NASDAQ:WSBC)?

Bullish insider trading is particularly usable when the primary stock in
question has experienced transactions within the past six months. Over the
latest 180-day time period, WesBanco, Inc. (NASDAQ:WSBC) has experienced
zero unique insiders buying, and 2 insider sales ( see the details of
insider trades here ).

Let?s go over hedge fund and insider activity in other stocks similar to
WesBanco, Inc. (NASDAQ:WSBC). These stocks are Eagle Bancorp, Inc.
(NASDAQ: EGBN ), The Bancorp, Inc. (NASDAQ: TBBK ), SCBT Financial
Corporation (NASDAQ: SCBT ), City Holding Company (NASDAQ: CHCO ), and
United Community Banks Inc (NASDAQ: UCBI ). This group of stocks are the
members of the regional ? mid-atlantic banks industry and their market
caps match WSBC?s market cap.

Company Name

]

On 11 September 2015 at 10:57, Milan Dojčinovski notifications@github.com
wrote:

That would be a SOLR query injection and not a SQL query injection.
@nilesh-c https://github.com/nilesh-c I hope you do proper escaping of
all data send to SOLR to avoid such vulnerabilities?

We use only the entity surface forms when querying Solr, matching surface
form which contains dangerous code is IMO nearly impossible.


Reply to this email directly or view it on GitHub
#43 (comment)
.

John McAuley

@m1ci
Copy link
Contributor

m1ci commented Sep 11, 2015

can you please crete .txt for each example so we can re-produce the problem?

@johnmcauley
Copy link

Will do, it will be later on.

j

On 11 Sep 2015, at 14:24, Milan Dojčinovski notifications@github.com wrote:

can you please crete .txt for each example so we can re-produce the problem?


Reply to this email directly or view it on GitHub.

@koidl
Copy link

koidl commented Sep 11, 2015

Cant get access to SOLR from here. Will be early next week. Just wondering if a special characters filter might make more sense? Not sure if we will be able to find every faulty character. Also using the categories might also reduce this problem a lot.

@jnehring
Copy link
Member

@Xfran I tried your example using API documentation. You mentioned two problems:

13 => string '(' (length=1)

This seems to be a mistake in e-Entity. Maybe its better to generally ignore named entities with length 1 then to delete tokens from the text. E.g. the character . might be used by named entity recognition.

...
20 => string 'Fundéu BBVA' (length=12)
...

The special characters look good in the output of the API tester. Maybe the special characters gets broken on the client side?

@koidl
Copy link

koidl commented Sep 11, 2015

I will have to check when I get to SOLR

Ignoring length 1 is a good idea

Special characters happen a lot in some pages. We need to filter it somehow I guess.

@jnehring
Copy link
Member

I think this issue can be divided in two parts:

  1. Wrongly detected entities like (. I suggest to move this into a new issue.
  2. Broken special chars in the response of FREME NER. I could not reproduce this bug so I assume there is a bug in your client software (see my last comment). @Xfran can you please investigate on that?

Then we should close this issue.

@jnehring jnehring assigned x-fran and unassigned m1ci Sep 15, 2015
@m1ci
Copy link
Contributor

m1ci commented Sep 15, 2015

I suggest to move this into a new issue.

+1

Broken special chars in the response of FREME NER. I could not reproduce this bug so I assume there is a bug in your client software (see my last comment). @Xfran can you please investigate on that?

Without concrete data we can't help.

@x-fran
Copy link
Member Author

x-fran commented Sep 16, 2015

The content I used for testing actually is a copy/paste from wikipedia. Just put Madrid in search field.
You will have exactly the same data/content.

We now clean the content before sending it to FREME NER and we also clean up and get rid of any "strange" chars that we can get back in the entity name before using it.

We can close the issue.

@x-fran x-fran closed this as completed Sep 16, 2015
@jnehring
Copy link
Member

I created #48 because of the wrongly spotted entity (

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants