Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
TwitterHashTagCrawler hashtags and hashtags count fix Jan 9, 2020
Readme.md Update Readme.md Jan 7, 2020
myhashtags.csv adding requirements.txt Jan 6, 2020
requirements.txt adding requirements.txt Jan 6, 2020
sampledata.csv refactor Jan 4, 2020
scrapy.cfg refactor Jan 4, 2020
utils.py refactor Jan 4, 2020

Readme.md

Twitter Hashtag crawler

A fast and unofficial twitter crawler to collect tweets using hashtag search.

Notice: The crawler is meant to be used for collecting data purely for academic and research purpose only. I am not responsible for any legal issue that might arise for any unintended use of this crawler

Python 3 twitter crawler

This is written using scrapy and python. The logic is straight forward. We are simply sending get requests to the mobile version of the twitter(mobile.twitter.com) to collect the list of tweets and sending get requests to the web version to parse tweet details.

Installation

OS X & Linux:

  1. Download the project
git clone https://github.com/amitupreti/Hands-on-WebScraping

cd Hands-on-WebScraping/project1_twitter_hashtag_crawler
  1. Install the dependencies
pip install -r requirements.txt --user
  1. Verify the crawler spider exists
scrapy list

if you see twittercrawler than you are all set.

Windows:

  1. Install python3 if you haven't already
  2. Download the project. https://github.com/amitupreti/Hands-on-WebScraping/archive/master.zip
  3. Extract the project
  4. Open cmd and navigate inside the project directory
cd Hands-on-WebScraping/project1_twitter_hashtag_crawler
  1. Follow step 2 and 3 from Mac/Linux installation

Usage example

  1. Put the hashtags in a csv file seperated by new line. For example, I have included myhashtags.csv as a sample.

Hashtags file

  1. Run the crawler with your hashtag file and the desired [output formats] (https://docs.scrapy.org/en/latest/topics/feed-exports.html)(JSON,JSON lines,CSV,XML)
  • For csv

     scrapy crawl twittercrawler -a filename=myhashtags.csv -o mydata.csv
    
  • For JSON

     scrapy crawl twittercrawler -a filename=myhashtags.csv -o mydata.json
    

sample images sample images

Speeding up the crawls

If you feel like the crawler is a little slow then find the hashtag.py file in the project and edit the custom settings.

custom_settings = {
    'USER_AGENT': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36',
    'CONCURRENT_REQUESTS': 2, 'DOWNLOAD_DELAY': 1, 'LOG_LEVEL': 'INFO'}

Here CONCURRENT_REQUESTS is the number of URLs that will be processed parallelly and DOWNLOAD_DELAY is a wait between each request. So, Increase CONCURRENT_REQUESTS and decrease DOWNLOAD_DELAY (minimum value for download delay is 0).

Data Columns

  • username
  • full_name
  • twitter_url
  • tweet_text
  • tweet_time
  • number_of_likes
  • no_of_retweets
  • no_of_replies
  • mentions
  • no_of_mentions
  • hashtags
  • no_of_hashtags
  • call_to_action
  • image_url

Release History

  • 1.0.0
    • first release crawl by hashtags

Meta

Amit Upreti – @amitupreti

Distributed under the MIT license. See LICENSE for more information.

Contributing

  1. Fork it (https://github.com/amitupreti/Hands-on-WebScraping/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request
You can’t perform that action at this time.