This is a light-weight agent-based framework to help you schedule a workflow of scraping tasks, mainly focused on Twitter.
-
Fill in config.txt with your own values.
-
Next run the agent:
python agent.py
This agent runs in the backround and does things based on the tasks that are
assigned to it. You can schedule tasks using another command line took, monitor.py
.
To sample from the twitter stream
python monitor.py --tasks add StartStream
To pull down tasks for a @edwardbenson
python monitor.py --tasks add ScrapeUser 0 0 @edwardbenson
To pull down tasks for a @edwardbenson
, refreshing every day
python monitor.py --tasks add ScrapeUser 1 86400 @edwardbenson
To sample from the stream for tweets that match the query "Red Sox" Redsox
and print these out every 10 seconds
python monitor.py --tasks add StartFilterStream "Red Sox" Redsox
python monitor.py --tasks add DumpTweets 1 10
To sample from the stream, and every hour pull down the last 20 tweets from 10 random users
python monitor.py --tasks add StartStream
python monitor.py --tasks add PullRandomUsers 1 3600 10 20
To shutdown the agent: python monitor.py --tasks add Die
To add a task:
python monitor.py --tasks add <TaskName> <Repeat> <Delta>
For adding a task, <Repeat>
is either 0 or 1, and <Delta>
is the time
between repetitions. Set it to 0 if <Repeat>
is also 0.
To list tasks:
python monitor.py --tasks list
Pulls tweets from the provided user, such as @edwardbenson
python monitor.py --tasks add ScrapeUser <repeat> <delta> <username>
Starts pulling form the twitter stream
python monitor.py --tasks add StartStream 0 0
Stops pulling form the twitter stream
python monitor.py --tasks add StopStream 0 0
Starts pulling form the filtered twitter stream with the provided args
python monitor.py --tasks add StartFilterStream 0 0 <arg1> .. <argN>
Stops pulling form the filtered twitter stream
python monitor.py --tasks add StartFilterStream 0 0
Queries for N random users in the database with only 1 tweet recorded in the DB and pulls their latest M tweets.
python monitor.py --tasks add PullRandomUsers 0 0 <N> <M>
Implementing your own tasks is easy. Just dig around in the tasks/tasks.py
file to see examples.