Permalink
Newer
Older
100644 185 lines (129 sloc) 10.7 KB
Dec 11, 2014 @bianjiang Update README.md
1
Jul 3, 2015 @bianjiang Update README.md
2 #### Updates
3
Feb 18, 2017 @bianjiang Update README.md
4 * 2/18/2017: Update the list of datasets that I have. Please note the Twitter policy on sharing raw datasets (see below).
Jul 3, 2015 @bianjiang Update README.md
5 * 7/2/2015: If you need `geocode` Twitter users (e.g., figure out where the user is from based on the `location` string in their profile), you can take a look at this [TwitterUserGeocoder](https://github.com/bianjiang/twitter-user-geocoder)
6 * 7/2/2015: I've developed another set of scripts (without using `redis`) with different but similar use cases. Specifically, my use case requires me to 1) track a set of keywords; and 2) track a set of users. The new scripts, for example, will keep pulling in new tweets if you provide a set of seed user ids. It's not in the state that can be released yet, but email me if you are interested in that.
7 * Older: I haven't been updating this for a while, but I just setup a EC2 instance and tested this. Looks like most of the things are still working fine. I have some new research needs myself, so I might update this more frequently in the next few months. But, in general, I would be happy to take requests to add specific functionalities, merge pull requests, and even requests for specific datasets. Just make a ticket ;)*
Dec 11, 2014 @bianjiang Update README.md
8
9
Oct 19, 2013 @bianjiang Correct ``tweetf0rmer`` to ``tweetf0rm``
10 tweetf0rm
Oct 14, 2013 @bianjiang initial
11 =========
12
Nov 4, 2013 update readme, and sample config file
13 A crawler that helps you collect data from Twitter for research. Most of the heavy works are already done by [Twython](https://github.com/ryanmcgrath/twython). ``tweetf0rm`` is just a collection of python scripts help to deal with parallelization, proxies, and errors such as connection failures. In most use cases, it will auto-restart when an exception occurs. And, when a crawler exceeds the Twitter API's [rate limit](https://dev.twitter.com/docs/rate-limiting/1.1/limits), the crawler will pause itself and auto-restart later.
Oct 14, 2013 @bianjiang initial
14
Nov 4, 2013 update readme, and sample config file
15 To workaround Twitter's rate limits, ``tweetf0rm`` can spawn multiple crawlers each with different proxy and twitter dev account on a single computer (or on multiple computers) and collaboratively ``farm`` user tweets and twitter relationship networks (i.e., friends and followers). The communication channel for coordinating among multiple crawlers is built on top of [redis](http://redis.io/) -- a high performance in-memory key-value store. It has its own scheduler that can balance the load of each worker (or worker machines).
Oct 18, 2013 @bianjiang change the readme.md
16
Nov 14, 2013 @bianjiang Update README.md
17 It's quite stable for the things that I want to do. I have collected billions of tweets from **2.6 millions** twitter users in about 2 weeks with a single machine.
Oct 14, 2013 @bianjiang initial
18
Feb 18, 2017 @bianjiang Update README.md
19 Datasets
Nov 14, 2013 @bianjiang Releasing health topic follower dataset
20 ------------
Feb 18, 2017 @bianjiang Update README.md
21 **Twitter license (or at least the company's position on this) does not allow me redistribute the crawled data (e.g., someone asked the question a while back: https://dev.twitter.com/discussions/8232).** If you want to get a hand on this dataset (e.g., through collaboration), contact me at <ji0ng.bi0n@gmail.com>. But, here is what I have:
22
23 * **Random sample since 2014**: I have been crawling tweets using [GET statuses/sample](https://dev.twitter.com/streaming/reference/get/statuses/sample) since 2014, nonstop... except a few days the server went down...
24 * **Tweets within US by states**: Using [POST statuses/filter](https://dev.twitter.com/streaming/reference/post/statuses/filter) with a `locations` filter by US states, since 10/16/2016.
25 * **Tweets related to HPV**: HPV related tweets using keywords such as “Human Papillomavirus”, “HPV”, “Gardasil” and “Cervarix” with the Twitter [Search API](https://dev.twitter.com/rest/public/search), since 02/2016 (as of today, 2/18/2017, it is still running). I do have a similar dataset from 11/2/2015 till 02/2016, but that's from a friend.
26 * **Tweets related to transgender**: Tweets collected using keywords related to transgender (e.g., trans*, transmale, etc.) between 01/17/2015 and 05/12/2015; and then user timelines of whom are self-identified as trans. This is published here, *"Hicks A, Hogan WR, Rutherford M, Malin B, Xie M, Fellbaum C, Yin Z, Fabbri D, Hanna J, Bian J. Mining Twitter as a First Step toward Assessing the Adequacy of Gender Identification Terms on Intake Forms. AMIA Annu Symp Proc. 2015;2015:611-620. PMID: [26958196](https://www.ncbi.nlm.nih.gov/pubmed/26958196)."*
Nov 14, 2013 @bianjiang Releasing health topic follower dataset
27
Oct 14, 2013 @bianjiang initial
28
29 Installation
30 ------------
31
32 None... just clone this and start using it. It's not that complicated yet to have a setup.py..
33
Oct 18, 2013 @bianjiang change the readme.md
34 git clone git://github.com/bianjiang/tweetf0rm.git
Oct 14, 2013 @bianjiang initial
35
36 Dependencies
37 ------------
38 To run this, you will need:
39 - [Twython](https://github.com/ryanmcgrath/twython)
40 - [futures](https://pypi.python.org/pypi/futures) if you are on Python 2.7
Oct 17, 2013 @bi0nji0ng change README.md according to the updates
41 - [redis server](http://redis.io/) and [redis python library](https://pypi.python.org/pypi/redis)
42 - [requests](http://www.python-requests.org/en/latest/)
Nov 4, 2013 update readme, and sample config file
43 - (optional) [lxml](http://lxml.de/) if you want to use the ``crawl_proxies.py`` script to get a list of free proxies from http://spys.ru/en/http-proxy-list/.
Oct 14, 2013 @bianjiang initial
44
Nov 4, 2013 update readme, and sample config file
45 ##### I haven't tested Python 3 yet...
Oct 14, 2013 @bianjiang initial
46
47 Features
48 ------------
49
Oct 18, 2013 @bianjiang change the readme.md
50 - Support running multiple crawler processes (through python ``multiprocessing``) with different proxies on single node;
51 - Support a cluster of nodes to collaboratively ``f0rm`` tweets.
Oct 14, 2013 @bianjiang initial
52
53
54 How to use
55 ------------
56
57 First, you'll want to login the twitter dev site and create an applciation at https://dev.twitter.com/apps to have access to the Twitter API!
58
Oct 18, 2013 @bianjiang change the readme.md
59 After you register, create an access token and grab your applications ``Consumer Key``, ``Consumer Secret``, ``Access token`` and ``Access token secret`` from the OAuth tool tab. Put these information into a ``config.json`` under ``apikeys`` (see an example below).
Oct 14, 2013 @bianjiang initial
60
Oct 18, 2013 @bianjiang change the readme.md
61 You have to have a redis server setup ([redis quick start](http://redis.io/topics/quickstart)). Note that if you want to run multiple nodes, you will only need to have one redis instance, and that instance has to be reachable from other nodes. The ``redis_config`` needs to be specified in the ``config.json`` as well.
62
63 Even you only wants to run on one node with multiple crawler processes, you will still need a local redis server for coordinating the tasks.
Oct 14, 2013 @bianjiang initial
64
65 {
Oct 18, 2013 @bianjiang change the readme.md
66 "apikeys": {
67 "i0mf0rmer01" :{
68 "app_key":"CONSUMER_KEY",
69 "app_secret":"CONSUMER_SECRET",
70 "oauth_token":"ACCESS_TOKEN",
71 "oauth_token_secret":"ACCESS_TOKEN_SECRET"
72 },
73 "i0mf0rmer02" :{
74 "app_key":"CONSUMER_KEY",
75 "app_secret":"CONSUMER_SECRET",
76 "oauth_token":"ACCESS_TOKEN",
77 "oauth_token_secret":"ACCESS_TOKEN_SECRET"
78 },
79 "i0mf0rmer03" :{
80 "app_key":"CONSUMER_KEY",
81 "app_secret":"CONSUMER_SECRET",
82 "oauth_token":"ACCESS_TOKEN",
83 "oauth_token_secret":"ACCESS_TOKEN_SECRET"
Oct 14, 2013 @bianjiang initial
84 }
Oct 18, 2013 @bianjiang change the readme.md
85 },
86 "redis_config": {
87 "host": "localhost",
88 "port": 6379,
89 "db": 0,
Aug 10, 2016 @bianjiang Update README.md
90 "password": "PASSWORD"
Oct 18, 2013 @bianjiang change the readme.md
91 },
92 "verbose": "True",
Nov 4, 2013 update readme, and sample config file
93 "output": "./data",
94 "archive_output": "./data"
Oct 14, 2013 @bianjiang initial
95 }
96
Nov 4, 2013 update readme, and sample config file
97 Most of these options are straightforward. ``output`` defines where the crawled data will be stored; ``archive_output`` defines where the gzipped files will be stored (without compression, it takes a lot of space to store the raw tweets; about 100G per 100,000 users tweets).
98
Oct 18, 2013 @bianjiang change the readme.md
99 The proxies need to be listed in ``proxy.json`` file like:
100
101 {
Nov 4, 2013 update readme, and sample config file
102 "proxies": [{"66.35.68.146:8089": "http"}, {"69.197.132.80:7808": "http"}, {"198.56.208.37:8089": "http"}]
Oct 18, 2013 @bianjiang change the readme.md
103 }
104
105 The proxy will be verified upon bootstrap, and only the valid ones will be kept and used (currently it's not switching to a different proxy when a proxy server goes down, but will be added soon). There are a lot free proxy servers available.
106
107 Remember that Twitter's rate limit is per account as as well as per IP. So, you should have at least one twitter API account per proxy. Ideally, you should more proxies than twitter accounts, so that ``tweetf0rm`` can switch to a different proxy, if one failed (haven't implemented yet, but higher on the list).
108
109 To start the ``f0rm", you can simply run:
110
111
Nov 4, 2013 update readme, and sample config file
112 $ ./bootstrap.sh -c config.json -p proxies.json
Oct 18, 2013 @bianjiang change the readme.md
113
114
Oct 22, 2013 @bianjiang client.py and README.md that explains the usage of client.py
115 To issue a command to the ``f0rm``, you are basically pushing commands to redis. This is how the commands should look like, e.g.,
Oct 18, 2013 @bianjiang change the readme.md
116
117 cmd = {
118 "cmd": "CRAWL_FRIENDS",
119 "user_id": 1948122342,
120 "data_type": "ids",
121 "depth": 2,
122 "bucket":"friend_ids"
123 }
124
125 cmd = {
126 "cmd": "CRAWL_FRIENDS",
127 "user_id": 1948122342,
128 "data_type": "users",
129 "depth": 2,
130 "bucket":"friends"
131 }
132
133 cmd = {
134 "cmd": "CRAWL_USER_TIMELINE",
Oct 22, 2013 @bianjiang client.py and README.md that explains the usage of client.py
135 "user_id": 53039176,
Oct 18, 2013 @bianjiang change the readme.md
136 "bucket": "timelines"
137 }
138
139 bucket determine where the results will be saved in the ``output`` folder (specified in ``config.json``). All twitter data are json encoded strings, and output files are normally named with the twitter user id, e.g., if you are crawling a user's timeline, all his/her tweets will be organized in the "timelines" sub-folder with his/her twitter id (numerical and unique identifier for each twitter user).
140
Oct 22, 2013 @bianjiang client.py and README.md that explains the usage of client.py
141 There is a ``client.py`` script that helps you generate these commands and push to the local redis node queue. e.g.,
142
Nov 4, 2013 update readme, and sample config file
143 $ client.sh -c tests/config.json -cmd CRAWL_FRIENDS -d 1 -dt "ids" -uid 1948122342
Oct 22, 2013 @bianjiang client.py and README.md that explains the usage of client.py
144
145 means you want to crawl all friends of ``uid=1948122342`` with ``depth`` as ``1`` and the results are just the twitter user ids of his/her friends. There are also commands you can use to crawl a list of users, e.g.,
146
Nov 4, 2013 update readme, and sample config file
147 $ client.sh -c tests/config.json -cmd BATCH_CRAWL_FRIENDS -d 1 -dt "ids" -j user_ids.json
Oct 22, 2013 @bianjiang client.py and README.md that explains the usage of client.py
148
149 instead of providing a specific ``user_id``, you provide a ``json`` file that contains a list of ``user_ids``.
150
151 MISC
152 ------------
153
154 There is also a script (``scripts/crawl_proxies.py``) for crawling proxies server list from spy.ru. It will crawl the http proxies listed in spy.ru, test each one and produce the ``proxies.json`` with valid proxies.
155
156 Note that, if you don't use proxies, you can only have one crawler active, since Twitter's rate limit applies to both the account as well as the IP.
Oct 14, 2013 @bianjiang initial
157
158
159 ### License
160 ------------
161
162 The MIT License (MIT)
163 Copyright (c) 2013 Jiang Bian (ji0ng.bi0n@gmail.com)
164
165 Permission is hereby granted, free of charge, to any person obtaining a copy of
166 this software and associated documentation files (the "Software"), to deal in
167 the Software without restriction, including without limitation the rights to
168 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
169 the Software, and to permit persons to whom the Software is furnished to do so,
170 subject to the following conditions:
171
172 The above copyright notice and this permission notice shall be included in all
173 copies or substantial portions of the Software.
174
175 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
176 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
177 FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
178 COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
179 IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
180 CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Feb 5, 2014 @bitdeli-chef Add a Bitdeli badge to README
181
182
183 [![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/bianjiang/tweetf0rm/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
184