Picky server dependencies #76

Closed
BadMinus opened this Issue Apr 23, 2012 · 19 comments

Projects

None yet

4 participants

@BadMinus

Just installed picky and tried to run, but having a problems.

After i generate server, I'm trying to run server like rake start or unicorn -c unicorn.rb and it failed silently.
With rake start in logs I see can't activate rack-1.3.6, already activated rack-1.4.1
With bundle exec rake start I see cannot load such file -- sinatra/base
So I need to update Gemfile and run bundle exec rake start

And some questions so not to create many issues.
Picky support only latin characters? I want use Russian language but I don't know how.

Another issue I found is when i want to type Social Issues I get 6 results when I type Social I and 0 when I type Social Is

@floere
Owner
floere commented Apr 23, 2012

Hi Mitya,

Sad to hear you're having trouble (I will look into the silently failing). Glad it worked in the end. If you generate a server, it should tell you to update the Gemfile.
I just noticed this is missing: https://github.com/floere/picky/blob/master/generators/lib/picky-generators/generators/server/sinatra.rb#L28-32
So thanks for telling me!

To your questions:
Picky supports UTF-8 and works well on Japanese, for example: http://wadoku.eu/search/index?utf8=%E2%9C%93&search=%E5%92%8C

However, the generated server does not play well with Russian. This line:
https://github.com/floere/picky/blob/master/generators/prototypes/server/sinatra/app.rb#L34
removes all characters that are not /[^a-z0-9\s\/\-\_\:\"\&\.]/i, so only alphanumeric characters and some special chars survive. I guess adding \p{IsCyrillic} helps?

Social Issues:
By default, Picky indexes only partial words up to the last 3 characters, so issues would be findable as issues, issue, and issu. Using the category option partial: Partial::Substring.new(from: 1) makes it searchable even when just typing i.

If you're still having trouble, please do not hesitate to put your app.rb file in a gist so we can look at it. Thanks!

@BadMinus

Thanks for quick answer.

I change regexp to /[^\p{Cyrillic}a-z0-9\s\/\-\_\:\"\&\.]/i and it works, I see cyrillic titles in prepared_title_index.prepared.txt and client find items, however there is one small problem i don't know where to dig. Capital letters and lowercase letters seems treats like different letters. When I type Ak it finds item but with ak it doesn't.

btw, great product, thank you.

@floere
Owner
floere commented Apr 23, 2012

My pleasure.

Great to hear! What Picky does is use downcase!: https://github.com/floere/picky/blob/master/server/lib/picky/tokenizer.rb#L225
I assume this does not work for Russian characters. Can you confirm this? (By "Ak" you mean with Cyrillic letters, right?)

Thanks, very glad you like it :)

@floere
Owner
floere commented Apr 23, 2012

Actually, I can confirm it myself. Idea: Implement String#downcase! such that it works for Cyrillic characters and retry. Would that be possible?

@floere
Owner
floere commented Apr 23, 2012
@Manfred
Manfred commented Apr 23, 2012

Florian, there is no way to do upcase or downcase on unicode strings without the unicode database. This database isn't included with Ruby and Matz doesn't want to either. He believes that all (natural) language specific operations are library material.

If I understand properly the real goal here is to create a normalized version of the word so could can build a reverse index? In that case you will have to do a bit more than just downcase on unicode strings (Google for unicode normalization).

Two simple solutions would be to either use the implementation from ActiveSupport or a small wrapper for glib2 Unicode functions I wrote a while back. The wrapper is called Unichars (https://github.com/Manfred/unichars).

A complex solution would be to use ICU: http://site.icu-project.org/.

@floere
Owner
floere commented Apr 23, 2012

Thanks @Manfred for the detailed info! I am unsure whether it is a good idea to include this by default since we also have to consider performance. For the issue case, a solution working for the Cyrillic alphabet is sufficient (I assume).

I should – however – definitely include infos on how to handle this (via your lib or otherwise).

P.S: I agree with Matz. However, this is one reason why Python has won over the NLP crowd, I assume.

@BadMinus

Yes it possible and it works if I write in server app.rb

require 'unicode' #gem
class String
  def downcase
     Unicode::downcase(self)
   end
   def downcase!
     self.replace downcase
   end
end

Hope that helps someone, if it doesn't fit in project.

@floere
Owner
floere commented Apr 23, 2012

So, @BadMinus, can I close the issue? :)

(Thanks for the code example!)

@BadMinus

Yes you can :)

@BadMinus BadMinus closed this Apr 23, 2012
@rogerbraun
Contributor

This is really interesting and should probably be in the Wiki.

@floere
Owner
floere commented Apr 23, 2012

@BadMinus Thanks! (I opened issue #77 for documentation regarding this) Glad we resolved this so quickly.

@rogerbraun I agree, and also in the README. However, it's bedtime for this old geezer in Australia. Feel free of course to start a PickyWiki entry :)

@rogerbraun
Contributor

I think 'a-z0-9' in the generator should be replaced with '\p{L}\p{N}', as this actually means letters (in any language) and numbers (in any language). If you actually want to throw out such a large portion of unicode space as 'a-z0-9' does, you should explicitly do so.

@floere
Owner
floere commented Apr 24, 2012

@rogerbraun I think you are right. I didn't want to do it because I was worried people do not understand specific Unicode character classes (but they do the obvious a-z and 0-9). However, Picky is also about teaching (a bit more advanced) Ruby to its users, so I guess this is a good example where people can learn about regexps. Also, the Regexp class help page is much better than I remembered: http://www.ruby-doc.org/core-1.9.3/Regexp.html

@rogerbraun If you wish to issue a pull request on the generators (for honor and glory :) ), please tell me. If not, please tell me too, so I can do it.

@floere
Owner
floere commented Apr 24, 2012

@Manfred @BadMinus @rogerbraun I added a Unicode page to the PickyWiki: https://github.com/floere/picky/wiki/Handling-Unicode If you have the time, I'm glad for some proofreading or link checking. Give it to me straight, please. We want this to be good and helpful.

@Manfred Specifically, could you please look at the example using your lib and tell me if it is correct? Thanks!

@Manfred
Manfred commented Apr 24, 2012

@floere I added some notes about Unicode equivalence and added normalization to the Unichars example.

@floere
Owner
floere commented Apr 24, 2012

@Manfred Thank you very much for your feedback, much appreciated! :)

@rogerbraun
Contributor

@floere I'll do it tomorrow.

@floere
Owner
floere commented Apr 25, 2012

@rogerbraun Thanks!

@floere floere was assigned Dec 6, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment