Skip to content

Latest commit

 

History

History
227 lines (208 loc) · 8.61 KB

README.org

File metadata and controls

227 lines (208 loc) · 8.61 KB

Install

Dependencies

To run you need the following depdencies:

The simplest way to install them is by using pip, the easy to use tool to install Python packages. Once you have pip installed just run the following in your shell of choice:

pip install scrapy
pip install selenium

Getting the code

To get the code from this git repository you need to first install git, the version control system. Once you have git you can run the following to fetch the code:

git clone git@github.com:climatewarrior/elc-scrapers.git

After running this you will have downloaded the scrapers code to the working directory of your shell when you ran the command.

How to run

To run the scrapers change directory in your shell to the root directory of elc-scrapers. Then run the following:

scrapy crawl <crawler name>

for example to run the youtube crawler you would do:

scrapy crawl youtube

On the following section there is a list of all the available crawlers and their Scrapy names which are located inside parenthesis. To data scraped will located in a directory called:

../data/<scraper name>/

The directory will as indicated, will be at one higher level of in the hierarchy as the place where you ran scrapy. The data will be separated into files, where each file represents a thread. In the case of scrapers that also scrap user profiles there will also be files for each user profile.

Crawlers

Some crawlers aren’t capable of scraping the entire forums because forum searches are required to get very old posts. A workaround for this is doing the searches for terms you are interested in the respective forums and then setting those results urls as the start_urls for the spiders. The that need such a workaround are marked with a “*”.

deviantART (deviantart)*

Notes

Scraping normal forums looks to be working fine.

Next items

Add support for threaded comments

Notes

Looks like it’s working just fine.

Next items

Notes

Parsing threads and normal forum posts seems to be ok.

Next items

Remix64 (remix64)

Notes

Practically DONE

Next items

YouTube Forums (youtube)

Notes

In theory it is complete, but it needs more testing.

Next items

MMORPG Forum (mmorpg)

Notes

Seems to be in working order so far.

Next items

HPFanFic Forums (hpfanfic)

Notes

Appears to be done, needs more testing.

Next items

Notes

It seems like it works alright.

Next items

NaNoWriMo (nanowrimo)

Notes

100% Ready

Next items

Etsy (etsy)

Notes

Looks pretty easy to scrape.

Next items

Nothing, until sure it’s needed

Notes

It keeps giving me 403s whenever I try to scrape it. I have tried changing the user agent, removing cookies and other techniques but I keep facing the same problem.

Next items

Look for ways to fix 403 problem.

Keywords we are looking for in posts

  • copyright
  • legal
  • illegal
  • permission
  • trademark
  • stealing / steal / stole
  • license
  • rights
  • attorney
  • infringement
  • copy / copying
  • plagiarism

How to add a new forum scraper

Introduction

Most web forums are very similar. They contain multi-page threads, are organized in sub-forums and they include common attributes for posts e.g. date posted and author. To simplify the development of new forum scrapers I have created a Python class that abstracts all of the common things away so you only have to worry about the differences. Also, most of the code leverages from the Scrapy provided class CrawlSpider which helps implement Scrapy crawlers. In the following sections I will explain how to use these classes to make a new scraper. Please if you are not familiar with the basics of Scrapy please go first through the Scrapy documentation , especially the beginner tutorial.

Example classes

The best way to start is to copy the following files:

./research_scrapers/spiders/example_spider.py
./research_scrapers/spiders/spider_helpers/ExampleHelper.py

and give the copies the name of your new scraper. For example if you were creating a scraper for foo.com you would name the copies the following way:

./research_scrapers/spiders/foo_spider.py
./research_scrapers/spiders/spider_helpers/FooHelper.py

After you have made these copies you can start working on your brand new forum scraper.

Creating your first forum Spider

Open the file foo_spider.py with your following editor and take a look. First you will notice that this class inherits from CrawlSpider and from ExampleHelper. CrawlSpider makes it very easy to crawl through entire sites, ExampleHelper will be explained in the next section. The job of our Spider class is to go through the entire site to find forum threads and call the “parse_thread” method on them. The “parse_thread” method, as the name implies, takes care of parsing through threads.

What you have to do here is some pretty simple.

  • Update the class name from ExampleSpider to FooSpider.
  • Change the name field from “example” to “foo”.
  • Update the allowed_domains list which as the name implies is just a list of allowed domains. This keeps the Spider from wandering off to other websites.
  • Update the start_urls list, this is usually just the link to forums you want to scrape, but you could specify more urls where the Spider should start crawling.
  • Update the rules. This is the most important part. Here you specify regular expressions which the Spider should follow or call methods on. You can look at some of the spiders for examples. The key here is to make sure you have expressions to follow through all sub-forums and that all thread urls are detected. For the thread rules you should specify:
    # We want to call parse_thread on our thread urls
    callback="parse_thread"
        

    and

    follow = False
        

    We don’t want to follow thread urls. If we follow them page links will confuse the spider. The threads won’t be parsed correctly unless you take special precautions, “parse_thread” knows how to go through threads with multiple pages. For the details please see the code in helper_base.py.

Creating your Helper class

ExampleHelper is a dedicated helper for the spider, which in turn inherits from HelperBase. HelperBase contains the logic that is consisted across forum sites and ExampleHelper encapsulates logic that is specific to the forum we are scraping. Here all you have to do is to finish implementing the methods. Docstrings on what each method is responsible for are found in help_base.py. Also, you can look at the available scrapers for examples. The core task is to fill the data fields with the required data using XPath selectors. For example in the method

load_first_page(self, ft)

you can add the page title the following way:

ft['title'] = self.hxs.select("//div[@id='page-body']/h2/a/text()").extract()[0]

To create these XPath selectors I highly recommend to use Scrapy interactively with the command:

scrapy shell <foo.com/forums/thread_url>

This allows for rapid XPath test and development with the use of an interactive Scrapy enabled Python shell.

Advanced usage

Modifying the data output

In the files pipelines.py there is a Scrapy Item Pipeline called TextFileExportPipeline. In order to modify the data output you must modify that class.

Copying

This is and Open Source Free Software provided under the terms of the GNU GPL v3.0. For the full license text read the COPYING file.