-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreliable Entities #43
Comments
|
All text send to FREME should be UTF-8 encoded. I created an issue to put that in the documentation: freme-project/freme-project.github.io#55 |
Because ")" was spotted as entity. |
Is it one? |
no, it is not, its mistake. Please provide an example of text and so we can track and address the issue. |
Thanks - we are working on it. We will send examples shortly |
One example: http://spooool.ie/news/take-two/11958-take-two-sam-smiths-bond-theme-room-trailer [{"tag":"Third Man Records","score":1},{"tag":"Jack White","score":1},{"tag":"Room","score":1},{"tag":"Brie Larson","score":1},{"tag":"Lenny Abrahamson","score":1},{"tag":"Emma Donoghue","score":1},{"tag":"Radiohead","score":1},{"tag":"James Bond","score":1},{"tag":"Spectre","score":1},{"tag":"Sam Smith","score":1},{"tag":""Writing","score":1},{"tag":"On The Wall","score":1}] The problem one her is {"tag":""Writing","score":1} Do you get that too? |
please send us just the text - preferably in a doc. Thanks! |
Unfortunately we dont store the text in the db only in solr which is super hard to pull out. Its from the WP plugin which only sends the text in the body tag - and the title of the page too. In any case is FREME not also using URLs now which should bring the same problem? Will I mail the body and title text to you from the examples we find? Also not sure if this will fix it. We get '/' in some cases then '(' in others... should we not think of some kind of filter for special characters? Also SQL query injection might be possible? |
I dont know your schema but via the SOLR admin interface you can query the exact document. A query will look something like:
FREME NER is processing only texts. Any markup is not welcome and might influence the entity spotting phase.
FREME NER, as well, I think e-Terminology from Tilde, expects pure text. So please, just send us the text.
Lets first find such cases.
On which side? Don't understand. |
No user submitted data reaches the MySQL database. Right now we use SQL only for user access tokens. So it is almost impossible that FREME is vulnerable for SQL injections from text data send to FREME NER. @Xfran Maybe you are mean SOLR query injections instead of SQL injections? And did you find a (potential) security issue or are you just asking a general question? |
I will try to extract some pages - SOLR is messy but I will do my best SQL query injection would be on the FREME NER side. For example can a user inject a SQL Query that deletes a SOLR core. Reading this: http://www.matrixgroup.net/snackoclock/2013/01/getting-the-most-out-of-solr/#sthash.SieaWK9f.dpuf SOLR is not effected by SQL query injection. |
That would be a SOLR query injection and not a SQL query injection. @nilesh-c I hope you do proper escaping of all data send to SOLR to avoid such vulnerabilities? |
yes @jnehring thats right SOLR specific.... as long as there is no DB or anything else picking up the sent content? |
We use only the entity surface forms when querying Solr, matching surface form which contains dangerous code is IMO nearly impossible. |
Hey all, I am getting a lot of this spotting also. It's on insider monkey which is Here are two examples: I can give you about 100,000 though. Strangely I don't get this from Example 1 [ The confectionery category is typically less threatened by private-label While Nestle boasts a great deal of presence internationally and presents a On the other hand, Mondelez International Inc (NASDAQ:MDLZ)?s cookie and Foolish bottom line Without a doubt, Mondelez faces challenges. But its global diversification, Fool contributor Nicole Seghetti owns shares of Mondelez International. The Copyright ? 1995 ? 2013 The Motley Fool, LLC. All rights reserved. The ] Example 2 [ Published: June 14, 2013 at 1:23 pm Is WesBanco, Inc. (NASDAQ: WSBC ) a buy right now? Prominent investors are In the financial world, there are dozens of indicators market participants Just as important, optimistic insider trading activity is another way to With all of this in mind, we?re going to take a peek at the recent action How are hedge funds trading WesBanco, Inc. (NASDAQ:WSBC)? At Q1?s end, a total of 9 of the hedge funds we track were long in this What do corporate executives and insiders think about WesBanco, Inc. Bullish insider trading is particularly usable when the primary stock in Let?s go over hedge fund and insider activity in other stocks similar to Company Name ] On 11 September 2015 at 10:57, Milan Dojčinovski notifications@github.com
John McAuley |
can you please crete .txt for each example so we can re-produce the problem? |
Will do, it will be later on. j
|
Cant get access to SOLR from here. Will be early next week. Just wondering if a special characters filter might make more sense? Not sure if we will be able to find every faulty character. Also using the categories might also reduce this problem a lot. |
@Xfran I tried your example using API documentation. You mentioned two problems:
This seems to be a mistake in e-Entity. Maybe its better to generally ignore named entities with length 1 then to delete tokens from the text. E.g. the character ... The special characters look good in the output of the API tester. Maybe the special characters gets broken on the client side? |
I will have to check when I get to SOLR Ignoring length 1 is a good idea Special characters happen a lot in some pages. We need to filter it somehow I guess. |
I think this issue can be divided in two parts:
Then we should close this issue. |
+1
Without concrete data we can't help. |
The content I used for testing actually is a copy/paste from wikipedia. Just put Madrid in search field. We now clean the content before sending it to FREME NER and we also clean up and get rid of any "strange" chars that we can get back in the entity name before using it. We can close the issue. |
I created #48 because of the wrongly spotted entity |
I'm testing e-Entity to find a way to get the most reliable entities from the content as fastest is posible.
Let's take as a example this well known piece of text.
Sending this text as it is we have in our response the issue that we've discussed here #41 and 48 entities back
A lot of entities right? But we have a lot of things that we don't need e.g:
What I did is clean up the content.
Note: I'm not proud of this code but hey I'm just playing around. :)
Now the content I send to FREME NER is looking like this:
The response from FREME NER:
33 items long array instead 48, containing only clean and more or less reliable entities.
This it will be also much faster to process for FREME NER and for the end users, less storage space if needed.
Imagine that I want to use "Fundéu" or ")" to dynamically build a URL.
E.g. "example.com/Fundéu?param=)"
This may be a security issue also.
The text was updated successfully, but these errors were encountered: