Skip to content
Chris Mattmann edited this page Oct 13, 2015 · 3 revisions

Welcome to the nutch-python wiki!

Getting started with Nutch-Python

Right now the API is evolving rapidly, but here is some code that should get you running.

Start the Nutch Server

Download and build the latest version of Nutch trunk (in a separate terminal).

  1. git clone https://github.com/apache/nutch.git
  2. cd nutch
  3. ant runtime
  4. cd runtime/local
  5. ./bin/nutch startserver

Get your Nutch-Python script going

    from nutch.nutch import Nutch
    from nutch.nutch import SeedClient
    from nutch.nutch import Server
    from nutch.nutch import JobClient
    import nutch

    sv=Server('http://localhost:8081')
    sc=SeedClient(sv)
    seed_urls=('http://espn.go.com','http://www.espn.com')
    sd= sc.create('espn-seed',seed_urls) 

    nt = Nutch('default')
    jc = JobClient(sv, 'test', 'default')
    cc = nt.Crawl(sd, sc, jc)
    while True:
        job = cc.progress() # gets the current job if no progress, else iterates and makes progress
        if job == None:
            break