READ ME FIRST

This is a modification to the popular Python Scrapy Framework that included rotation of proxies and UA. Modifications include:

*Ability to scrape Amazon
*Increased Spider coverage for crawling categories
*Pipeline creation for JSON constructs
Note:
Please install Twisty 14.6.1 only as it is required for OpenSSL

Modified By: Carl Laneave 2017

# PythonScrapyBasicSetup Basic setup with random user agents and proxy addresses for [Python Scrapy Framework](http://scrapy.org/).

Setup

Install Scrapy Framework

pip install Scrapy

Detailed installation guide 2. Install Beautiful Soup 4 for proxy middleware based on hidemyass lists

pip install beautifulsoup4

Detailed installation guide 3. Install Tor, Stem (controller library for Tor), and Privoxy (HTTP proxy server).

apt-get install tor python-stem privoxy

Hash a password with Tor:

tor --hash-password secretPassword

Then copy a hashed password and paste it with control port to /etc/tor/torrc:

ControlPort 9051
HashedControlPassword 16:72C8ADB0E34F8DA1606BB154886604F708236C0D0A54557A07B00CAB73

Restart Tor:

sudo /etc/init.d/tor restart

Enable Privoxy forwarding by adding next line to /etc/privoxy/config:

forward-socks5 / localhost:9050 .

Restart Privoxy:

sudo /etc/init.d/privoxy restart

Both Tor and Privoxy should be up & running (check netstat -l). If you used different password or control port, update settings.py.

If you get some errors regarding the pyOpenSSL (check this issue), try to downgrade the Twisted engine:

pip install Twisted==16.4.1

Usage

To see what it does just:

python run.py

Project contains three middleware classes in middlewares directory. ProxyMiddleware downloads IP proxy addresses and before every process request chooses one randomly. TorMiddleware has a similar purpose, but it relies on Tor network. RandomUserAgentMiddleware downloads user agent strings and saves them into 'USER_AGENT_LIST' settings list. It also selects one UA randomly before every process request. Middlewares are activated in settings.py file. This project also contains two spiders just for testing purposes, spiders/iptester.py and spiders/uatester.py. You can run them individually:

scrapy crawl UAtester
scrapy crawl IPtester

run.py file is a also good example how to include and run your spiders sequentially from one script.

If you have any questions or problems, feel free to create a new issue. Scrap responsibly!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
amazon		amazon
License.md		License.md
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amazon

amazon

License.md

License.md

README.md

README.md

scrapy.cfg

scrapy.cfg

Repository files navigation

READ ME FIRST

Setup

Usage

About

Releases

Packages

Languages

License

claneave28/PythonScraper

Folders and files

Latest commit

History

Repository files navigation

READ ME FIRST

Setup

Usage

About

Resources

License

Stars

Watchers

Forks

Languages