# Twitter

Is it possible to use the DomainTools list to look for tweets that are potentially spreading misinformation? Fortunately, Twitter's Search does allow for searching by a domain name, try [www.cdc.gov](https://twitter.com/search?q=www.cdc.gov&f=live) for example. However the Search API only returns results for the last week. Given this caveat we can still take a look.

## DomainTools

First we need to load the DoaminTools data:

In [107]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-05.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk']
)
df

Unnamed: 0,domain,created,risk
0,covid19fashions.com,2020-04-04,99
1,coronaviruscardgame.com,2020-04-04,99
2,covidonline.me,2020-04-04,99
3,covid-19contained.com,2020-04-04,99
4,covidcrime-chicago.com,2020-04-04,99
...,...,...,...
116706,wuhanbj.cn,2020-01-23,70
116707,wuhanpneumonia.org,2020-01-19,70
116708,wuhanfengtai.com,2020-01-13,70
116709,ecovida5dmalaga.com,2020-01-07,70


## Collect Tweets

[twarc](https://github.com/docnow/twarc) is a useful Python library for collecting tweet JSON data while observing their API rate lmits. We have 116,710 domains to search for, so being able to observe the rate limits is important.

[Twitter's Rate Limits](https://developer.twitter.com/en/docs/basics/rate-limits) show that the search API has a limit of 180 requests / 15 minutes for User authentication and 450 requests/ 15 minutes for App authentication. The difference between these is a bit complicated to go into here, but basically User auth is when you want to do something on behalf of a user (say send a tweet), and App auth is when you can do something as a software application. We aren't doing these searches *for* a particular user, and the App authentication will let us performa a lot more queries.

But how long will it take?

In [36]:
reqs_per_hour = 450 * 4
116711 / reqs_per_hour

64.83944444444444

It will take 64 hours! How about if we only look at the domains with the highest risk score?

In [42]:
len(df[df.risk >= 99.0]) / reqs_per_hour

47.946666666666665

Ok, that's still a long time to wait. Since Twitter's API only lets us search tweets for the past 7 days maybe we could get an interesting view by looking for domains that were created during that time? We also might stand a better chance of seeing them before they are deleted by Twitter.

So how long would will this take?

In [55]:
import datetime

a_week_ago = datetime.datetime.now() - datetime.timedelta(days=7)
df['created']

len(df[(df.risk >= 99.0) & (df.created >= a_week_ago)]) / reqs_per_hour

4.051111111111111

4 hours seems manageable! Lets start by looking at those.

In [57]:
latest_domains = df[(df.risk >= 99.0) & (df.created >= a_week_ago)]
latest_domains

Unnamed: 0,domain,created,risk
0,covid19fashions.com,2020-04-04,99
1,coronaviruscardgame.com,2020-04-04,99
2,covidonline.me,2020-04-04,99
3,covid-19contained.com,2020-04-04,99
4,covidcrime-chicago.com,2020-04-04,99
...,...,...,...
7287,covid19costume.com,2020-03-31,99
7288,covid19testinglocations.co.nz,2020-03-31,99
7289,covidegypt.cf,2020-03-31,99
7290,goawaycovid19.org,2020-03-31,99


In [58]:
import twarc

You can skip this next cell if you've used twarc before. Otherwise you need to run it to tell twarc your API keys.

In [None]:
t = twarc.Twarc(validate_keys=False)
t.configure()

If you didn't run the cell above you will now need to create your twarc client.

In [59]:
t = twarc.Twarc(app_auth=True)

Now we can search for each donain and write the full tweets as line oriented JSON, where each line is a tweet, serialized as a JSON object. As we go we'll keep track of domains that had some tweet matches.

Note, we use the `url:` search operator that is documented in [this extensive but unofficial guide](https://github.com/igorbrigadir/twitter-advanced-search) to Twitter's advanced search. But just to be on the safe side we will also look for a hostname match in each tweet's URLs. One accommodation that is made is allowing for a *www* subdomain to have been used. 

In [105]:
import json

from urllib.parse import urlparse
from collections import Counter

# write the tweets as gzipped json data
today = datetime.date.today().strftime('%Y-%m-%d')
out = open('data/twitter/week-{}.jsonl'.format(today), 'w')

counter = Counter()                
for domain in latest_domains.domain:
    domain = domain.lower()
    
    for tweet in t.search('url:{}'.format(domain)):
        for url in tweet['entities']['urls']:
            found_domain = urlparse(url['expanded_url']).netloc.lower()
            if found_domain == domain or found_domain == 'www.' + domain:
                json.dump(tweet, out)
                counter[domain] += 1
                
    if counter[domain] > 0:
        print('{} [{}]'.format(domain, counter[domain]))
        json.dump(counter, open('data/twitter/week-{}-counts.json', 'w'), indent=2)


covidcrime-chicago.com [6]
faeccadizcovid19.es [1]
covid19speed.com [7]




wrenthamcovid19.com [5]
coronavirus-inform.info [1]
covid19drc.com [26]
coronavirustracker2020.com [1]




coronafeirws.cymru [10]
coronavirusstats.eu [2]
covidaz.help [20]
covidfocus.org [7]
corona-rikon.com [9]
coronavirusespana.app [1]
covid19sikkim.org [5]




coronavirus-per-country.com [1]
does5gcausecovid19.com [2]
healthequitycovid19.org [2]
freakycoronamerch.de [1]
latestcovidnews.com [1]
coughagainstcovid.org [4]
oguncovid19.org [4]
coronavirus-website.ru [45]
heroescovid19.site [5]
covid-calc.org [9]




covid19-now.info [2]
meditateagainstcovid19.org [12]
covid19assam.info [5]
coronamw.com [5]
coronavirusrevolution.com [1]
covid19actioncrm.com [2]




coronadoiscalling.com [1]
corona-sogo.info [2]
covid19rehber.com [39]
finedelcovid19.com [1]
covid19sci.org [22]
coronavirusindia.tech [1]
covidchugchallenge.com [1]
covid19-interventions.com [2]
coronavirusaccessories.com [10]
coronavirusthegame.com [2]
covid19-times.com [3]




covid-near-me.com [4]
covid19intervinents.cat [1]
covidtracker.world [14]
coronaviruschildrensbook.com [2]
covid19masr.com [1]
covid19rois.com [1]
covidnigeria.com.ng [2]
covidsafetynetwork.com [2]
covid-coronavirus.fr [2]
coronatasks.com [1]
covid19insweden.com [5]




coronavirusconspiracy.xyz [1]
newswithoutcoronavirus.com [5]
actions-fondations-covid19.org [3]
coronavirus-covid19.app [1]
covidfactcheck.in [1]
ugcovid19.org [3]
coronavirushelp.com.ng [1]


ERROR:twarc:caught connection error ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')) on 1 try


covid-19-radar.com [1]
covid19newshub.com [3]
covid19switchboard.org [5]
coronavirusdiary.site [2]
covid19rehberi.com [28]
zamcovid19.tech [2]




ending-covid19.com [1]
chineseamericansfightcovid19.com [1]
covid19tracker.io [2]
cryovscovid19.com [2]
hopecovidph.ga [1]
covid19-arab.com [1]
covid19indiaorg.com [3]
covidsafepaths.org [7]
covid19mtl.ca [3]
coronafirws.cymru [10]
covid19impactsurvey.org [188]
indiastopcorona.in [1]
usacovidmasks.com [6]
covid19copingstudy.com [1]
covidpreprints.com [36]


ERROR:twarc:caught connection error ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')) on 1 try


covid19colombia.app [20]
covid19rsa.com [6]
covid19retailpulselive.com [1]
save-covid19.online [1]
covidvirus.me [4]
covid19engr.com [46]
sitsi-covid19.com [2]
corona45.org [1]
insidecovid19.es [10]
covid19vuk.org [3]


ERROR:twarc:caught connection error ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')) on 1 try


ayolawancovid19.id [10]
covid19resources.in [1]
mycoronaimpact.com [3]
covid19misinformation.com [3]
fastestcovidtest.com [1]
bifmacovid19.org [1]
covid-19-paspoort.nl [1]


ERROR:twarc:caught connection error ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')) on 1 try


pacificcountycovid19.com [3]
ahefcovid19test.com [1]
zcovid19.org [4]
covid19usa.io [1]
covidhelp.pk [2]
finducovid19.com [15]
covid19reviews.org [8]
mycovidindia.com [3]
aacoronavirus.org [12]
covid19-dashboard.ch [1]
covid19-tracker-ca.org [3]
stopcovid19-sites.site [1]
coronaaware.app [1]


ERROR:twarc:caught connection error ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')) on 1 try


rgvcovid19info.com [5]
simptom-coronavirusa.ru [2]
beritacovid19.com [1]
laboratore-proti-koronaviru.cz [3]
mbasfightcovid19.com [2]
mycovidpin.com [2]
salone-covid19.website [2]
coronavs.life [7]
corona-eindaemmen.de [1]
covid19sydney.com [35]
nmcovid19.org [7]
holacoronavirus.com.mx [2]
infocovid-entreprise95.fr [2]
coronavirus-business-support.co.uk [1]
argentinalibredecoronavirus.com [1]
covid19jc.com [22]


ERROR:twarc:caught connection error ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')) on 1 try


covid19-updates.info [1]
covid19cuisines.com [1]
covid19abatement.ca [1]
covidgo.in [1]
coronavirus-tables.ru [10]
covid19-pk.live [2]
covidcrashcourse.com [1]
wmasscovid.com [8]
coronvirus.site [1]
covid19stats.ph [156]
hawaiifightscovid.org [10]
hanzecoronameldpunt.nl [1]
challengecovid.org [1]
aev-covid19.org [2]
coronavirussessions.org [1]
champaigncountycovid19.org [3]
covidtrackingus.org [12]
covidtruth.site [4]




hackcovid19.in [6]
spiritualiteitvoorcoronatijd.org [17]
covidmoney.info [4]
covid19mv.live [4]
australiacovid19.com [2]
bloquercovid19.com [1]
covid-19-act.jp [232]
txcovid19erp.org [7]
coronabonanza.com [2]




thecovidate.com [1]
maparealdelcoronavirus.com [15]
putnamcovidresponse.org [1]
ocoronavirus.org [1]
signupforthecovid19update.com [1]
covidchange.com [1]
covid19.co.it [1]
covid19app.site [2]
covidahaarhelp.in [5]
us-covid-tracker.com [32]
macovid19relieffund.org [68]
kickoffcorona.com [8]




covid-alerts.com [1]
yakimavscovid19.org [2]
covid19medicin.dk [1]
covid19jobs.io [6]
covidhaber.com [1]
covidtrends.in [2]
coronavirus-heroes.org [1]
covid19cd.com [19]
coronaviruscommission.com [3]




covid19stories.in [1]
coronavirusmap.me [11]


## Analyze

Let's fold our counts back into the dataset.

In [108]:
tweets = pandas.Series(counter)
df = df.set_index('domain')
df['tweets'] = tweets
df

Unnamed: 0_level_0,created,risk,tweets
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
covid19fashions.com,2020-04-04,99,
coronaviruscardgame.com,2020-04-04,99,
covidonline.me,2020-04-04,99,
covid-19contained.com,2020-04-04,99,
covidcrime-chicago.com,2020-04-04,99,6.0
...,...,...,...
wuhanbj.cn,2020-01-23,70,
wuhanpneumonia.org,2020-01-19,70,
wuhanfengtai.com,2020-01-13,70,
ecovida5dmalaga.com,2020-01-07,70,


We can replace those NaN values with zero:

In [109]:
df['tweets'] = df['tweets'].fillna(int(0))
df

Unnamed: 0_level_0,created,risk,tweets
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
covid19fashions.com,2020-04-04,99,0.0
coronaviruscardgame.com,2020-04-04,99,0.0
covidonline.me,2020-04-04,99,0.0
covid-19contained.com,2020-04-04,99,0.0
covidcrime-chicago.com,2020-04-04,99,6.0
...,...,...,...
wuhanbj.cn,2020-01-23,70,0.0
wuhanpneumonia.org,2020-01-19,70,0.0
wuhanfengtai.com,2020-01-13,70,0.0
ecovida5dmalaga.com,2020-01-07,70,0.0


We can sort the domains by the highest number of tweets:

In [110]:
df = df.sort_values('tweets', ascending=False)
df.head(25)

Unnamed: 0_level_0,created,risk,tweets
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
covid-19-act.jp,2020-03-31,99,232.0
covid19impactsurvey.org,2020-04-01,99,188.0
covid19stats.ph,2020-03-31,99,156.0
macovid19relieffund.org,2020-03-31,99,68.0
covid19engr.com,2020-04-01,99,46.0
coronavirus-website.ru,2020-04-03,99,45.0
covid19rehber.com,2020-04-02,99,39.0
covidpreprints.com,2020-04-01,99,36.0
covid19sydney.com,2020-04-01,99,35.0
us-covid-tracker.com,2020-03-31,99,32.0


So that's somewhat interesting, but we don't see tons of sharing going on. What if we look at the Twitter data itself to get a sense of the number of retweets for some of these.