A crawler collect data from Sina Weibo using none-official APIs
pip install -r requirements.txt
# install and configure your mongodb
# Necessary
DEFAULT_REQUEST_HEADERS = {
'cookie': 'your cookie'
}
MONGO_URI = 'YOUR_IP:YOUR_PORT'
MONGO_DATABASE = 'YOUR_DBNAME'
MONGO_USER = 'YOUR_USERNAME'
MONGO_PASSWORD = 'YOUR_PASSWORD'
# Speed parameters
CONCURRENT_REQUESTS = 32 # Configure maximum concurrent requests performed by Scrapy
DOWNLOAD_DELAY = 0.5 # Configure a delay for requests for the same website
DOWNLOAD_TIMEOUT = 3 # Configure a timeout for connection
RETRY_TIMES = 1 # Configure retry times for a request
# Crawl Target
## weibo.com
KEYWORD_LIST = ['#日常#'] # keywords to search the tweets
WEIBO_TYPE = 1 # Weibo type,0 for all,1 for origin
CONTAIN_TYPE = 0 # Weibo contain certain content,0 for all,1 for image,2 for video
REGION = ['全部'] # Region where the tweet is posted. '全部' for all, '北京' for Peking
START_DATE = '2014-01-01' # Lower bound of poting date range. (yyyy-mm-dd)
END_DATE = '2020-12-31' # Upper bound of poting date range. (yyyy-mm-dd)
START_TIME = '0' # Lower bound of poting time range. (0-24)
END_TIME = '6' # Upper bound of poting time range. (0-24)
PROXY_POOL_URL = 'YOUR_PROXY_POOL_URL' # purchase a proxy pool yourself
Ind | Field Name in Response | Field Name in Code | Description |
---|---|---|---|
1 | id | _id | User id |
2 | screen_name | screen_name | User nickname |
3 | statuses_count | tweet_count | The number of all tweets |
4 | profile_image_url | avatar | Link of user avatar image |
5 | cover_image_phone | cover_image | Link of user cover image |
6 | description | description | User self-introduction |
7 | follow_count | follow_count | The number of user followings |
8 | followers_count | followers_count | The number of user followers |
9 | gender | gender | User gender(f for female, m for male) |
10 | verified | verified | If user has passed certain verification of Sina Weibo |
11 | verified_type | verified_type | User verification type, eg. Individual, Enterprise, Government, etc. |
12 | verified_reason | verified_reason | The reason why the user pass verification of Sina Weibo |
13 | urank | urank | Liveness of user on Sina Weibo(Bigger number represents for more user activities on Sina Weibo) |
14 | mbrank | mbrank | Vip rank of user on Sina Weibo |
Ind | Field Name in Code | Field Name in Response | Description |
---|---|---|---|
1 | id | id | Tweet id(Primary key) |
2 | user_id | user_id | User id(Foreign key) |
3 | text | text | Html of tweet |
4 | pics | pics | Links of pictures in the tweet content (delimited by comma) |
5 | video_url | page_info | Link of video in the tweet |
6 | created_at | created_at | The posting time of the tweet (weekday month day hour:minute:second timezone year) |
7 | source | source | Use what kind of device to post the tweet or Under what supertopic the user post the tweet |
8 | attitudes_count | attitudes_count | The number of people favor the tweet |
9 | comments_count | comments_count | The number of comments |
10 | reposts_count | reposts_count | Reposting times of the tweet |
11 | location | text | The current place of user when posting the tweet |
12 | topics | text | Topics of the tweet (classified by Sina Weibo, delimited by comma) |
13 | at_users | text | Mention of users of Sina Weibo (delimited by comma) |
14 | article_url | text | Link of front page headline in the tweet |
15 | retweet | retweeted_status | Information of retweets |
Ind | Field Name | Description |
---|---|---|
1 | _id | User id |
2 | screen_name | User Nickname |
3 | orient_type | Type of user to be collected |
4 | keyword | Search keyword to filter content of tweets |
# API for all information of user
https://m.weibo.cn/api/container/getIndex?containerid=100505{uid}
# API for profile of user
https://m.weibo.cn/api/container/getIndex?containerid=230283{uid}_-_INFO
# API for crawling user tweets
https://m.weibo.cn/api/container/getIndex?containerid=107603{tid}&page={}
# API for detailed information of tweet (for long tweet or tweet containing more than 9 pictures)
https://m.weibo.cn/statuses/extend?id={tid}