New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with stop words and NER in multilingual texts #114

Open
mialondon opened this Issue Aug 5, 2013 · 30 comments

Comments

Projects
None yet
6 participants
@mialondon
Contributor

mialondon commented Aug 5, 2013

Is the current workflow: 'detect language, apply appropriate stopwords' or 'apply generic multilingual stopwords'? If it's the former, can we detect multiple languages and apply the appropriate lists of stopwords?

As this conversation hints, many scholars work in two or more languages https://twitter.com/wilkohardenberg/status/363677752391516161 so ideally we could cope with returning entities and tokens for at least two languages and also apply stop words.

The trickiness of dealing with this might also be a call for more randomness in the way query terms are mixed so people can refresh the results and see different terms applied.

@rlskoeser

This comment has been minimized.

Show comment
Hide comment
@rlskoeser

rlskoeser Aug 5, 2013

Current workflow is to detect language using python guess-language and then select appropriate stopwords if it's a language nltk has stopwords for. I hadn't thought about mixed languages, though. Might be helpful to have some sample mixed language text so we can see what guess-language thinks of it, write some tests.

rlskoeser commented Aug 5, 2013

Current workflow is to detect language using python guess-language and then select appropriate stopwords if it's a language nltk has stopwords for. I hadn't thought about mixed languages, though. Might be helpful to have some sample mixed language text so we can see what guess-language thinks of it, write some tests.

@wilkohardenberg

This comment has been minimized.

Show comment
Hide comment
@wilkohardenberg

wilkohardenberg Aug 5, 2013

Here is some text that hugely confuses the guess-language function:

Later, pressure increased to focus less on animal conservation and more on the welfare of urban-dwellers and tourism promotion. As from 1930 hunting permits were sold and in 1932 the journal of the Italian Alpine Club published an article proposing to transform the Gran Paradiso into a sort of huge open-air zoological garden, with all the features of an urban park. In the same years the Aostan autonomist politician Emile Chanoux lamented that until then the park had stressed too much its scientific aims, forgetting to respond to what it called its “social function”:
"Ma il Parco non deve essere fine a se stesso; deve avere oltre che una funzione scientifica, anche una funzione sociale, deve essere un richiamo per le folle per una vita sana e naturale, deve essere una sorgente di vita per le popolazioni delle montagne sui cui è costituito, deve essere anche (e perché no?) la grande riserva di caccia della Nazionale, poiché anche questo sport della caccia ha motivo di sussistere per le sue utilità sociali."

wilkohardenberg commented Aug 5, 2013

Here is some text that hugely confuses the guess-language function:

Later, pressure increased to focus less on animal conservation and more on the welfare of urban-dwellers and tourism promotion. As from 1930 hunting permits were sold and in 1932 the journal of the Italian Alpine Club published an article proposing to transform the Gran Paradiso into a sort of huge open-air zoological garden, with all the features of an urban park. In the same years the Aostan autonomist politician Emile Chanoux lamented that until then the park had stressed too much its scientific aims, forgetting to respond to what it called its “social function”:
"Ma il Parco non deve essere fine a se stesso; deve avere oltre che una funzione scientifica, anche una funzione sociale, deve essere un richiamo per le folle per una vita sana e naturale, deve essere una sorgente di vita per le popolazioni delle montagne sui cui è costituito, deve essere anche (e perché no?) la grande riserva di caccia della Nazionale, poiché anche questo sport della caccia ha motivo di sussistere per le sue utilità sociali."

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Aug 5, 2013

Contributor

After discussing it with my friendly local multilingual historian and thinking over Wilko's issue, I wonder if there are two parts to the problem: the first is dealing with stop words in the appropriate languages, the second is NER (entity recognition) in other languages. Does dbpedia automatically query Wikipedia content from all languages or just English? If not, can we use the current language detection to query the appropriate instances as well as applying different sets of stop words? Thoughts @moltude ?

Also thanks @wilkohardenberg for your input and earlier comments!

Contributor

mialondon commented Aug 5, 2013

After discussing it with my friendly local multilingual historian and thinking over Wilko's issue, I wonder if there are two parts to the problem: the first is dealing with stop words in the appropriate languages, the second is NER (entity recognition) in other languages. Does dbpedia automatically query Wikipedia content from all languages or just English? If not, can we use the current language detection to query the appropriate instances as well as applying different sets of stop words? Thoughts @moltude ?

Also thanks @wilkohardenberg for your input and earlier comments!

@moltude

This comment has been minimized.

Show comment
Hide comment
@moltude

moltude Aug 5, 2013

Contributor

I'm still thinking about this but I have a couple of thoughts so far:

  • Yes, it does look like DBpedia supports NER in multiple languages with a specific rest url for each language [https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User%27s-manual]
  • If we are able to parse and identify non-english named entieis then our search of the aggregators would require seperate queries for each of the languages found (dp.la and europeana support specifying languages in the query) and I'm not sure how those additional queries will impact query times (there may be more effiecent ways of doing this)
  • We might consider whether to use the 'en' form of a non-'en' named entity (I think this can be gotten from the spotlight service) and then use both the original language form and 'en' form to the query (or just the 'en' form). I'm not sure how this will actually play out but it seems possible. There is still going to be troubl distinguish between non-english entities and english stopwords ('it' in Swedish means 'den' in english and we wouldn't want every query that includes 'it' to also search for 'den')

I'm still chewing on this so any additional thoughts would be appreciated.

Contributor

moltude commented Aug 5, 2013

I'm still thinking about this but I have a couple of thoughts so far:

  • Yes, it does look like DBpedia supports NER in multiple languages with a specific rest url for each language [https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User%27s-manual]
  • If we are able to parse and identify non-english named entieis then our search of the aggregators would require seperate queries for each of the languages found (dp.la and europeana support specifying languages in the query) and I'm not sure how those additional queries will impact query times (there may be more effiecent ways of doing this)
  • We might consider whether to use the 'en' form of a non-'en' named entity (I think this can be gotten from the spotlight service) and then use both the original language form and 'en' form to the query (or just the 'en' form). I'm not sure how this will actually play out but it seems possible. There is still going to be troubl distinguish between non-english entities and english stopwords ('it' in Swedish means 'den' in english and we wouldn't want every query that includes 'it' to also search for 'den')

I'm still chewing on this so any additional thoughts would be appreciated.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Aug 6, 2013

Contributor

Useful points, thanks! We could possibly assume that any non-English text is more pertinent and prioritise those queries - but do we actually need to run separate queries against the search APIs or do we just add non-English terms into the mix?

Contributor

mialondon commented Aug 6, 2013

Useful points, thanks! We could possibly assume that any non-English text is more pertinent and prioritise those queries - but do we actually need to run separate queries against the search APIs or do we just add non-English terms into the mix?

@briancroxall

This comment has been minimized.

Show comment
Hide comment
@briancroxall

briancroxall Aug 7, 2013

Contributor

Perhaps in the meantime we can make it clear that Serendip-o-matic only supports English language text in the 1.0?

Contributor

briancroxall commented Aug 7, 2013

Perhaps in the meantime we can make it clear that Serendip-o-matic only supports English language text in the 1.0?

@mbwolff

This comment has been minimized.

Show comment
Hide comment
@mbwolff

mbwolff Aug 8, 2013

Contributor

Hi everyone. I sent the pull request for FR stop words and was referred to this discussion (thanks Mia!). One way to solve this problem might be to break a text up into chunks and run guess-language on each chunk, aggregating results to build list of search terms. Chunks could be separated by punctuation and line breaks. This should work for Wilko's text above. For single words and short phrases from one language inserted into a text written mainly in another language, it may be too much trouble to determine the different languages.

Contributor

mbwolff commented Aug 8, 2013

Hi everyone. I sent the pull request for FR stop words and was referred to this discussion (thanks Mia!). One way to solve this problem might be to break a text up into chunks and run guess-language on each chunk, aggregating results to build list of search terms. Chunks could be separated by punctuation and line breaks. This should work for Wilko's text above. For single words and short phrases from one language inserted into a text written mainly in another language, it may be too much trouble to determine the different languages.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Aug 8, 2013

Contributor

I was thinking paragraphs, as detected by various forms of line breaks (assuming they're still slightly different between OSs), how does that sound?

Contributor

mialondon commented Aug 8, 2013

I was thinking paragraphs, as detected by various forms of line breaks (assuming they're still slightly different between OSs), how does that sound?

@wilkohardenberg

This comment has been minimized.

Show comment
Hide comment
@wilkohardenberg

wilkohardenberg Aug 8, 2013

If feasible it sounds good to me. Single words or short sentences should not be too much of a problem in most cases. I wonder however how this should work on a Zotero library: separate language guessing for each entry?

wilkohardenberg commented Aug 8, 2013

If feasible it sounds good to me. Single words or short sentences should not be too much of a problem in most cases. I wonder however how this should work on a Zotero library: separate language guessing for each entry?

@mbwolff

This comment has been minimized.

Show comment
Hide comment
@mbwolff

mbwolff Aug 8, 2013

Contributor

Paragraphs are natural chunks, so that works for me.

Contributor

mbwolff commented Aug 8, 2013

Paragraphs are natural chunks, so that works for me.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Aug 9, 2013

Contributor

Can #78 be resolved at the same time?

We'll also have XML markup in various forms if people try copying other reference library formats - @moltude and @amrys came up with a good example of that

Contributor

mialondon commented Aug 9, 2013

Can #78 be resolved at the same time?

We'll also have XML markup in various forms if people try copying other reference library formats - @moltude and @amrys came up with a good example of that

@rlskoeser

This comment has been minimized.

Show comment
Hide comment
@rlskoeser

rlskoeser Aug 11, 2013

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.

rlskoeser commented Aug 11, 2013

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.

@mbwolff

This comment has been minimized.

Show comment
Hide comment
@mbwolff

mbwolff Aug 12, 2013

Contributor

Hi everyone. Combining stop words from different languages will create problems, e.g. "den" is an article in German and a noun in English.

mw

On Aug 11, 2013, at 6:12 PM, Rebecca Sutton Koeser notifications@github.com wrote:

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.


Reply to this email directly or view it on GitHub.

Contributor

mbwolff commented Aug 12, 2013

Hi everyone. Combining stop words from different languages will create problems, e.g. "den" is an article in German and a noun in English.

mw

On Aug 11, 2013, at 6:12 PM, Rebecca Sutton Koeser notifications@github.com wrote:

Working by paragraph sounds like a feasible solution, although I worry about how that will scale to larger texts (although I suppose there are probably lots of parts of the code where larger text may cause issues). I also wonder if I could adapt the guess-language code to give multiple languages back if there multiple languages with very highly scores - it looks like it might be possible from glancing at the code, but I would need to experiment some. Is there likely to be a problem with combining stop words from all the languages detected? Although that doesn't help as much for knowing which dbpedia spotlight endpoint to use, I guess.

As for the #78 - we probably need some simple input type detection first - plain text, html/xml, csv, etc - and then do some pre-processing based on the input format before generating search terms.


Reply to this email directly or view it on GitHub.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Aug 12, 2013

Contributor

We don't need to keep the paragraph structure, just pass things into a bucket for the appropriate language then push each one to the appropriate tokenisation, stop words and entity recognition steps... Though we might want to adjust the mix of query terms according to the proportional amount of each languages - too fussy?

(At some future point we may want to use the languages detected to query for objects from particular cultures or in particular languages, but that'd need to be considered carefully in relation to 'serendipity' and any future 'hint' function)

Contributor

mialondon commented Aug 12, 2013

We don't need to keep the paragraph structure, just pass things into a bucket for the appropriate language then push each one to the appropriate tokenisation, stop words and entity recognition steps... Though we might want to adjust the mix of query terms according to the proportional amount of each languages - too fussy?

(At some future point we may want to use the languages detected to query for objects from particular cultures or in particular languages, but that'd need to be considered carefully in relation to 'serendipity' and any future 'hint' function)

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Aug 13, 2013

Contributor

Just a note that it might be easiest to work out and document design decisions on the wiki then return here to finish integrating them https://github.com/chnm/serendipomatic/wiki/Serendipomatic-architecture

Contributor

mialondon commented Aug 13, 2013

Just a note that it might be easiest to work out and document design decisions on the wiki then return here to finish integrating them https://github.com/chnm/serendipomatic/wiki/Serendipomatic-architecture

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 8, 2013

Contributor

Do we need a chat to decide on the best solution? If so, who's interested?

Contributor

mialondon commented Oct 8, 2013

Do we need a chat to decide on the best solution? If so, who's interested?

@mbwolff

This comment has been minimized.

Show comment
Hide comment
@mbwolff

mbwolff Oct 8, 2013

Contributor

I'm interested.

Contributor

mbwolff commented Oct 8, 2013

I'm interested.

@moltude

This comment has been minimized.

Show comment
Hide comment
@moltude

moltude Oct 9, 2013

Contributor

I'm also ready to dig back in on this.

On Tue, Oct 8, 2013 at 2:25 PM, Mia notifications@github.com wrote:

Do we need a chat to decide on the best solution? If so, who's interested?


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-25915058
.

Contributor

moltude commented Oct 9, 2013

I'm also ready to dig back in on this.

On Tue, Oct 8, 2013 at 2:25 PM, Mia notifications@github.com wrote:

Do we need a chat to decide on the best solution? If so, who's interested?


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-25915058
.

@rlskoeser

This comment has been minimized.

Show comment
Hide comment
@rlskoeser

rlskoeser Oct 10, 2013

I'm interested too.

rlskoeser commented Oct 10, 2013

I'm interested too.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 13, 2013

Contributor

Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone).

Contributor

mialondon commented Oct 13, 2013

Cool! Is there an asynchronous way we can talk through the options or should we try for a chat? (I'm complicating things slightly by being in a completely different timezone).

@moltude

This comment has been minimized.

Show comment
Hide comment
@moltude

moltude Oct 14, 2013

Contributor

I can make time 9-5 M/F for a chat if that makes the timezone problem
easier (Mia are you GMT?). Other than a chat, I think the best way is to
post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to
setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia notifications@github.com wrote:

Cool! Is there an asynchronous way we can talk through the options or
should we try for a chat? (I'm complicating things slightly by being in a
completely different timezone).


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26208724
.

Contributor

moltude commented Oct 14, 2013

I can make time 9-5 M/F for a chat if that makes the timezone problem
easier (Mia are you GMT?). Other than a chat, I think the best way is to
post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to
setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia notifications@github.com wrote:

Cool! Is there an asynchronous way we can talk through the options or
should we try for a chat? (I'm complicating things slightly by being in a
completely different timezone).


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26208724
.

@mbwolff

This comment has been minimized.

Show comment
Hide comment
@mbwolff

mbwolff Oct 14, 2013

Contributor

This Friday afternoon (10/18), US East Coast time, would work for me. Could we videoconference?

mw

On Oct 14, 2013, at 9:17 AM, Scott Williams notifications@github.com wrote:

I can make time 9-5 M/F for a chat if that makes the timezone problem
easier (Mia are you GMT?). Other than a chat, I think the best way is to
post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to
setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia notifications@github.com wrote:

Cool! Is there an asynchronous way we can talk through the options or
should we try for a chat? (I'm complicating things slightly by being in a
completely different timezone).


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26208724
.


Reply to this email directly or view it on GitHub.

Contributor

mbwolff commented Oct 14, 2013

This Friday afternoon (10/18), US East Coast time, would work for me. Could we videoconference?

mw

On Oct 14, 2013, at 9:17 AM, Scott Williams notifications@github.com wrote:

I can make time 9-5 M/F for a chat if that makes the timezone problem
easier (Mia are you GMT?). Other than a chat, I think the best way is to
post to the Github issue tracking. Other ideas?

Thursday or Friday would be the best day for me this week if we wanted to
setup a chat.

On Sat, Oct 12, 2013 at 8:07 PM, Mia notifications@github.com wrote:

Cool! Is there an asynchronous way we can talk through the options or
should we try for a chat? (I'm complicating things slightly by being in a
completely different timezone).


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26208724
.


Reply to this email directly or view it on GitHub.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 14, 2013

Contributor

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179

Contributor

mialondon commented Oct 14, 2013

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179

@mbwolff

This comment has been minimized.

Show comment
Hide comment
@mbwolff

mbwolff Oct 15, 2013

Contributor

I can meet Friday 8:00 AM Mia's time (Thursday 5:00 PM my time).

mw

On Oct 14, 2013, at 6:31 PM, Mia notifications@github.com wrote:

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179


Reply to this email directly or view it on GitHub.

Contributor

mbwolff commented Oct 15, 2013

I can meet Friday 8:00 AM Mia's time (Thursday 5:00 PM my time).

mw

On Oct 14, 2013, at 6:31 PM, Mia notifications@github.com wrote:

I'm GMT+11, the other East Coast Time (I'm in Australia). I could just about do 7am here, though I'd make more sense at 8am! http://www.timeanddate.com/worldclock/meetingtime.html?iso=20131018&p1=240&p2=179


Reply to this email directly or view it on GitHub.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 15, 2013

Contributor

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so it's voice-only for me at the best of times.

Contributor

mialondon commented Oct 15, 2013

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so it's voice-only for me at the best of times.

@moltude

This comment has been minimized.

Show comment
Hide comment
@moltude

moltude Oct 16, 2013

Contributor

Thursday 5:00 EST on skype would work for me.

On Tue, Oct 15, 2013 at 5:59 PM, Mia notifications@github.com wrote:

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so
it's voice-only for me at the best of times.


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26376189
.

Contributor

moltude commented Oct 16, 2013

Thursday 5:00 EST on skype would work for me.

On Tue, Oct 15, 2013 at 5:59 PM, Mia notifications@github.com wrote:

Skype? I don't have a camera on the dinosaur laptop I'm travelling with so
it's voice-only for me at the best of times.


Reply to this email directly or view it on GitHubhttps://github.com//issues/114#issuecomment-26376189
.

@rlskoeser

This comment has been minimized.

Show comment
Hide comment
@rlskoeser

rlskoeser Oct 16, 2013

I'm available at thursday 5pm EST too. Is skype audio conference calling free? How do we exchange skype account names (prefer not to post them publicly, obviously). When OWOT team did video/audio chat last week it was kind of laggy and a bit difficult to communicate at times, which makes me wonder if a text chat might be more useful - but I guess skype has a chat tool built in that we can use if the audio is too laggy, right? Alternatively we could try a google+ hangout if we want to do video for those who have cameras.

rlskoeser commented Oct 16, 2013

I'm available at thursday 5pm EST too. Is skype audio conference calling free? How do we exchange skype account names (prefer not to post them publicly, obviously). When OWOT team did video/audio chat last week it was kind of laggy and a bit difficult to communicate at times, which makes me wonder if a text chat might be more useful - but I guess skype has a chat tool built in that we can use if the audio is too laggy, right? Alternatively we could try a google+ hangout if we want to do video for those who have cameras.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 17, 2013

Contributor

The document for collecting sample text for testing is 'Help us collect multilingual text for testing Serendip-o-matic' https://docs.google.com/document/d/100UygYyACS7tgU70FYpc4d00NTwoXaDzDmSUCu3naJE/edit#

Contributor

mialondon commented Oct 17, 2013

The document for collecting sample text for testing is 'Help us collect multilingual text for testing Serendip-o-matic' https://docs.google.com/document/d/100UygYyACS7tgU70FYpc4d00NTwoXaDzDmSUCu3naJE/edit#

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 17, 2013

Contributor

Here's a record of the decisions reached during our chat:

a) set up analytics to keep track of word count, languages
b) hint function still useful future functionality, add language as an option
c) start with sentence level, most common language determines which is used
d) collect multilingual test samples for testing (inc poetry, TEI, whatever)
e) check whether dbpedia is multilingual (I think the answer was yes?)
f) these changes drive need for parallelisation
g) help text on formatting text input (e.g. how to prepare BibTeX, TEI etc formatted text for inclusion)
h) html/xml/whatever detection and graceful management
i) check language options in source APIs
j) refactor so Zotero input arrives at detection process looking like any other text

Of those, a, f, g will be new issues, b adds weight to #11, h is related to #78 and c, d, e, i and j are related to the original issue.

Contributor

mialondon commented Oct 17, 2013

Here's a record of the decisions reached during our chat:

a) set up analytics to keep track of word count, languages
b) hint function still useful future functionality, add language as an option
c) start with sentence level, most common language determines which is used
d) collect multilingual test samples for testing (inc poetry, TEI, whatever)
e) check whether dbpedia is multilingual (I think the answer was yes?)
f) these changes drive need for parallelisation
g) help text on formatting text input (e.g. how to prepare BibTeX, TEI etc formatted text for inclusion)
h) html/xml/whatever detection and graceful management
i) check language options in source APIs
j) refactor so Zotero input arrives at detection process looking like any other text

Of those, a, f, g will be new issues, b adds weight to #11, h is related to #78 and c, d, e, i and j are related to the original issue.

@mialondon

This comment has been minimized.

Show comment
Hide comment
@mialondon

mialondon Oct 18, 2013

Contributor

Slightly off-topic, but this article on NER might be worth a look: 'Exploring Entity Recognition and Disambiguation
for Cultural Heritage Collections' http://freeyourmetadata.org/publications/named-entity-recognition.pdf

Contributor

mialondon commented Oct 18, 2013

Slightly off-topic, but this article on NER might be worth a look: 'Exploring Entity Recognition and Disambiguation
for Cultural Heritage Collections' http://freeyourmetadata.org/publications/named-entity-recognition.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment