# Twitter

Is it possible to use the DomainTools list to look for tweets that are potentially spreading misinformation? Fortunately, Twitter's Search does allow for searching by a domain name, try [www.cdc.gov](https://twitter.com/search?q=www.cdc.gov&f=live) for example. However the Search API only returns results for the last week. Given this caveat we can still take a look.

## DomainTools

First we need to load the DoaminTools data:

In [35]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-05.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk']
)
df

Unnamed: 0,domain,created,risk
0,covid19fashions.com,2020-04-04,99
1,coronaviruscardgame.com,2020-04-04,99
2,covidonline.me,2020-04-04,99
3,covid-19contained.com,2020-04-04,99
4,covidcrime-chicago.com,2020-04-04,99
...,...,...,...
116706,wuhanbj.cn,2020-01-23,70
116707,wuhanpneumonia.org,2020-01-19,70
116708,wuhanfengtai.com,2020-01-13,70
116709,ecovida5dmalaga.com,2020-01-07,70


## Collect Data

[twarc](https://github.com/docnow/twarc) is a useful Python library for collecting tweet JSON data while observing their API rate lmits. We have 116,710 domains to search for, so being able to observe the rate limits is important.

[Twitter's Rate Limits]](https://developer.twitter.com/en/docs/basics/rate-limits) show that the search API has a limit of 180 requests / 15 minutes for User authentication and 450 requests/ 15 minutes for App authentication. The difference between these is a bit complicated to go into here, but basically User auth is when you want to do something on behalf of a user (say send a tweet), and App auth is when you can do something as a software application. We aren't doing these searches *for* a particular user, and the App authenticatio will let us performa a lot more queries.

How long will it take?

In [36]:
reqs_per_hour = 450 * 4
116711 / reqs_per_hour

64.83944444444444

It will take 64 hours! How about if we only look at the domains with the highest risk score?

In [37]:
riskiest = df[df.risk >= 99.0]
len(riskiest) / reqs_per_hour

47.946666666666665

Still a long time so we better get started. Let's just look for the domains that have been ranked the risiest then. It's probably for the best anyway, since we probably don't want any false positives.

If you want to just use the data we have stored here you might want to just skip to the next section.

In [38]:
import twarc

You can skip this next cell if you've used twarc before. Otherwise you need to run it to tell twarc your API keys.

In [None]:
t = twarc.Twarc(validate_keys=False)
t.configure()

If you didn't run the cell above you will now need to create your twarc client.

In [39]:
t = twarc.Twarc(app_auth=True)

Now we can search for each donain and write the full tweets as line oriented JSON, where each line is a twee serialized as a JSON object.

In [None]:
import json

for domain in riskiest.domain:
    for tweet in t.search(domain):
        json.dump(tweet, out)