Skip to content
Newer
Older
100644 196 lines (155 sloc) 6.03 KB
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
1 # Itsy
1d4eecc @dakrone Initial commit
authored May 18, 2012
2
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
3 A threaded web spider, written in Clojure.
1d4eecc @dakrone Initial commit
authored May 18, 2012
4
5 ## Usage
6
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
7 In your project.clj:
8
9 ```clojure
22772f9 @dakrone update readme
authored Aug 28, 2012
10 [itsy "0.1.1"]
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
11 ```
12
13 In your project:
14
15 ```clojure
e811064 @dakrone make require more readable in readme
authored May 19, 2012
16 (ns myns.foo
17 (:require [itsy.core :refer :all]))
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
18
2860a54 @dakrone change handlers to take a map with :url and :body keys
authored Jun 21, 2012
19 (defn my-handler [{:keys [url body]}]
20 (println url "has a count of" (count body)))
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
21
22 (def c (crawl {;; initial URL to start crawling at (required)
23 :url "http://aoeu.com"
24 ;; handler to use for each page crawled (required)
25 :handler my-handler
26 ;; number of threads to use for crawling, (optional,
27 ;; defaults to 5)
28 :workers 10
29 ;; number of urls to spider before crawling stops, note
30 ;; that workers must still be stopped after crawling
31 ;; stops. May be set to -1 to specify no limit.
32 ;; (optional, defaults to 100)
33 :url-limit 100
34 ;; function to use to extract urls from a page, a
35 ;; function that takes one argument, the body of a page.
f2eac8c @dakrone Allow relative URL extraction from pages (still very dumb)
authored May 21, 2012
36 ;; (optional, defaults to itsy's extract-all)
37 :url-extractor extract-all
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
38 ;; http options for clj-http, (optional, defaults to
39 ;; {:socket-timeout 10000 :conn-timeout 10000 :insecure? true})
40 :http-opts {}
41 ;; specifies whether to limit crawling to a single
42 ;; domain. If false, does not limit domain, if true,
43 ;; limits to the same domain as the original :url, if set
44 ;; to a string, limits crawling to the hostname of the
45 ;; given url
f4bbf32 @shriphani politeness docs updated
shriphani authored Oct 3, 2013
46 :host-limit false
47 ;; polite crawlers obey robots.txt directives
48 ;; by default this crawler is polite
49 :polite? true}))
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
50
51 ;; ... crawling ensues ...
52
03f5f36 @dakrone document (thread-status ...)
authored May 19, 2012
53 (thread-status c)
54 ;; returns a map of thread-id to Thread.State:
55 {33 #<State RUNNABLE>, 34 #<State RUNNABLE>, 35 #<State RUNNABLE>,
56 36 #<State RUNNABLE>, 37 #<State RUNNABLE>, 38 #<State RUNNABLE>,
57 39 #<State RUNNABLE>, 40 #<State RUNNABLE>, 41 #<State RUNNABLE>,
58 42 #<State RUNNABLE>}
59
4b1eccd @dakrone add more documentation
authored May 31, 2012
60 (add-worker c)
9b4d5a8 @dakrone add html extration utility
authored May 31, 2012
61 ;; adds an additional thread worker to the pool
4b1eccd @dakrone add more documentation
authored May 31, 2012
62
63 (remove-worker c)
64 ;; removes a worker from the pool
03f5f36 @dakrone document (thread-status ...)
authored May 19, 2012
65
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
66 (stop-workers c)
67 ;; stop-workers will return a collection of all threads it failed to
68 ;; stop (it should be able to stop all threads unless something goes
69 ;; very wrong)
70 ```
71
72 Upon completion, `c` will contain state that allows you to see what
73 happened:
74
75 ```clojure
76 (clojure.pprint/pprint (:state c))
77 ;; URLs still in the queue
78 {:url-queue #<LinkedBlockingQueue []>,
79 ;; URLs that were seen/queued
80 :url-count #<Atom@67d6b87e: 2>,
81 ;; running worker threads (will contain thread objects while crawling)
4b1eccd @dakrone add more documentation
authored May 31, 2012
82 :running-workers #<Ref@decdc7b: []>,
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
83 ;; canaries for running worker threads
4b1eccd @dakrone add more documentation
authored May 31, 2012
84 :worker-canaries #<Ref@397f1661: {}>,
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
85 ;; a map of URL to times seen/extracted from the body of a page
86 :seen-urls
87 #<Atom@469657c4:
88 {"http://www.phpbb.com" 1,
89 "http://pagead2.googlesyndication.com/pagead/show_ads.js" 2,
90 "http://www.subBlue.com/" 1,
91 "http://www.phpbb.com/" 1,
92 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 1,
93 "http://www.w3.org/1999/xhtml" 1,
94 "http://forums.asdf.com" 1,
95 "http://www.google.com/images/poweredby_transparent/poweredby_000000.gif" 1,
96 "http://asdf.com" 1,
97 "http://www.google.com/cse/api/branding.css" 1,
98 "http://www.google.com/cse" 1}>}
99 ```
100
e3c845b @dakrone extol features in readme
authored May 31, 2012
101 ## Features
102 - Multithreaded, with the ability to add and remove workers as needed
103 - No global state, run multiple crawlers with multiple threads at once
104 - Pre-written handlers for text files and ElasticSearch
105 - Skips URLs that have been seen before
106 - Domain limiting to crawl pages only belonging to a certain domain
107
9b4d5a8 @dakrone add html extration utility
authored May 31, 2012
108 ## Included handlers
43dc170 @dakrone document included handler for ES
authored May 31, 2012
109
110 Itsy includes handlers for common actions, either to be used, or
111 examples for writing your own.
112
113 ### Text file handler
114
edae425 @dakrone use tika for textfile extraction, documentation fixes
authored May 31, 2012
115 The text file handler stores web pages in text files. It uses the
116 `html->str` method in `itsy.extract` to convert HTML documents to
117 plain text (which in turn uses [Tika](http://tika.apache.org) to
118 extract HTML to plain text).
352c21f @dakrone add text file handler
authored May 31, 2012
119
120 Usage:
121
122 ```clojure
123 (ns bar
124 (:require [itsy.core :refer :all]
125 [itsy.handlers.textfiles :refer :all]))
126
127 ;; The directory will be created when the handler is created if it
128 ;; doesn't already exist
129 (def txt-handler (make-textfile-handler {:directory "/mnt/data" :extension ".txt"}))
130
131 (def c (crawl {:url "http://example.com" :handler txt-handler}))
132
133 ;; then look in the /mnt/data directory
134 ```
43dc170 @dakrone document included handler for ES
authored May 31, 2012
135
157a483 @dakrone add link to ES
authored May 31, 2012
136 ### [ElasticSearch](http://elasticsearch.org) handler
43dc170 @dakrone document included handler for ES
authored May 31, 2012
137
138 The elasticsearch handler stores documents with the following mapping:
139
140 ```clojure
352c21f @dakrone add text file handler
authored May 31, 2012
141 {:id {:type "string"
142 :index "not_analyzed"
143 :store "yes"}
144 :url {:type "string"
145 :index "not_analyzed"
146 :store "yes"}
147 :body {:type "string"
148 :store "yes"}}
43dc170 @dakrone document included handler for ES
authored May 31, 2012
149 ```
150
151 Usage:
152
153 ```clojure
154 (ns foo
155 (:require [itsy.core :refer :all]
156 [itsy.handlers.elasticsearch :refer :all]))
157
158 ;; These are the default settings
159 (def index-settings {:settings
160 {:index
161 {:number_of_shards 2
162 :number_of_replicas 0}}})
163
164 ;; If the ES index doesn't exist, make-es-handler will create it when called.
9295981 @terjesb fix readme example for es-handler
terjesb authored Oct 17, 2013
165 (def es-handler (make-es-handler {:es-url "http://localhost:9200/"
166 :es-index "crawl"
167 :es-type "page"
168 :es-index-settings index-settings
169 :http-opts {}}))
43dc170 @dakrone document included handler for ES
authored May 31, 2012
170
171 (def c (crawl {:url "http://example.com" :handler es-handler}))
172
173 ;; ... crawling and indexing ensues ...
174 ```
175
176
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
177 ## Todo
178
f2eac8c @dakrone Allow relative URL extraction from pages (still very dumb)
authored May 22, 2012
179 - <del>Relative URL extraction/crawling</del>
180 - Always better URL extraction
e1bdf12 @dakrone add initial documentation in readme
authored May 19, 2012
181 - Handlers for common body actions
4b1eccd @dakrone add more documentation
authored May 31, 2012
182 - <del>elasticsearch</del>
352c21f @dakrone add text file handler
authored May 31, 2012
183 - <del>text files</del>
edae425 @dakrone use tika for textfile extraction, documentation fixes
authored Jun 1, 2012
184 - other?
4b1eccd @dakrone add more documentation
authored May 31, 2012
185 - <del>Helpers for dynamically raising/lowering thread count</del>
09e8cd7 @dakrone more todos
authored May 19, 2012
186 - Timed crawling, have threads clean themselves up after a limit
f9d7e2f @dakrone make threads auto-terminate when url-limit hit
authored May 19, 2012
187 - <del>Have threads auto-clean when url-limit is hit</del>
edae425 @dakrone use tika for textfile extraction, documentation fixes
authored Jun 1, 2012
188 - <del>Use Tika for HTML extraction</del>
c09a4a5 @dakrone additional todo
authored May 31, 2012
189 - Write tests
1d4eecc @dakrone Initial commit
authored May 18, 2012
190
191 ## License
192
3116e14 @dakrone README
authored May 18, 2012
193 Copyright © 2012 Lee Hinman
1d4eecc @dakrone Initial commit
authored May 18, 2012
194
195 Distributed under the Eclipse Public License, the same as Clojure.
Something went wrong with that request. Please try again.