# Twitter Data Collection with Twint

## Overview

In order for me to perform any form of sentiment analysis on tweets from Twitter, my first step was to collect and aggregate those tweets and any other metadata in one place. Thanks to the work and [Josh Szymanowski](https://github.com/p-szymo) and [Eric Blander](https://github.com/EricB10) on their [project](https://github.com/p-szymo/twitter-sentiment-analysis), and the guidance of Josh in my own work I wrote a function using the specialized Python package, [Twint](https://github.com/twintproject/twint). The function `twint_scrape` scrapes the data from Twitter, and is used in the function `search_loop` to iterate over the specified time period, both of these are found in the `functions.py` file stored in the `src` folder. 

In the end I was able to run a number of times to scrape 80,730 tweets from February to December of 2020 using the search term 'covid vaccine'. The next step after scraping the data is cleaning it, which I did in the next notebook.

In [1]:
# import pandas for viewing and saving to .csv
import pandas as pd

# import my functions and autoreload any updates
from src.functions import * 

%load_ext autoreload
%autoreload 2

In [8]:
# my initial test scrape
search = 'covid vaccine'
filename = 'covid_vaccine_tweets'
limit = 3000

In [13]:
%time df = search_loop(search, filename, limit)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
2020-02-01 saved!
2020-03-01 saved!
2020-04-01 saved!
2020-05-01 saved!
2020-06-01 saved!
2020-07-01 saved!
2020-08-01 saved!
2020-09-01 saved!
2020-10-01 saved!
2020-11-01 saved!
2020-12-01 saved!
CPU times: user 18.9 s, sys: 795 ms, total: 19.7 s
Wall time: 7min 4s


In [14]:
df = pd.read_csv('./data/covid_vaccine_tweets.csv')

In [15]:
df.shape

(33639, 39)

33,000 tweets is a good start, but not sufficient for the purposes of making any strong generalizations or recommendations.

-----------------------------------------------------------------------------------

## Limit 7500 per month

I eventually found the upper limit of my Twint scrapes to be around 7500 per month, testing 8000+ per month led to errors in the scraping process and unusable data. In the future I plan to come back and tweak the function in order to scrape more data, and ideally from specific locations. The function saves the data to a csv with the specified `filename` and stores it in the `data` file in the repository. Below you see I've saved the data I'll be analyzing under the filename 'covid_vaccine_tweets.csv'

In [13]:
search = 'covid vaccine'
filename = 'covid_vaccine_tweets'
limit = 7500

In [14]:
%time df = search_loop(search, filename, limit)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!


ERROR:asyncio:Fatal read error on socket transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7fef0cbd86a0>
transport: <_SelectorSocketTransport fd=62 read=polling write=<idle, bufsize=0>>
Traceback (most recent call last):
  File "/Users/davidbruce/opt/anaconda3/envs/learn-env/lib/python3.6/asyncio/selector_events.py", line 714, in _read_ready
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 60] Operation timed out


2020-03-01 saved!
2020-04-01 saved!
2020-05-01 saved!
2020-06-01 saved!
2020-07-01 saved!
2020-08-01 saved!
2020-09-01 saved!
2020-10-01 saved!
2020-11-01 saved!
2020-12-01 saved!
CPU times: user 45.1 s, sys: 1.94 s, total: 47 s
Wall time: 13min 34s


In [19]:
df_3 = pd.read_csv('./data/covid_vaccine_tweets.csv')

In [20]:
df_3.shape

(80730, 39)

# Twint Scrape w/ Location

In [22]:
search = 'covid vaccine'
filename = 'ny_tweets'
city = 'New York'
limit = 7500

In [23]:
%time df = search_loc_loop(search, city, filename, limit)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-03-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-04-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-05-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-06-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-07-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-08-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-09-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-10-01 saved!
[!] No mor

In [24]:
df = pd.read_csv('./data/ny_tweets.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [26]:
df.shape

(16847, 39)

In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,0,1230529563098304513,1230529563098304513,1582220000000.0,2020-02-20 11:28:04,-500,,"During a live conference regarding COVID-19, @...",en,['covid2019'],...,,,,,,[],,,,
1,1,1244475262835851264,1244475262835851264,1585540000000.0,2020-03-30 00:03:18,-500,,if Shuri from Black Panther was real we would’...,en,[],...,,,,,,[],,,,
2,2,1244177043761356801,1244170766976978945,1585470000000.0,2020-03-29 04:18:17,-500,,"@MalcolmNance @RealCandaceO How about this, Ca...",en,[],...,,,,,,"[{'screen_name': 'MalcolmNance', 'name': 'Malc...",,,,
3,3,1243921301347143682,1243921301347143682,1585410000000.0,2020-03-28 11:22:03,-500,,@realDonaldTrump please don’t allow relaxed EP...,en,[],...,,,,,,[],,,,
4,4,1243158889211797504,1243158889211797504,1585230000000.0,2020-03-26 08:52:30,-500,,Let’s go #Pittsburgh! — Researchers in Pittsbu...,en,['pittsburgh'],...,,,,,,[],,,,


In [28]:
df.tail()

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
16842,8773,1336546754943209474,1335035983327326208,1607492552000.0,2020-12-09 00:42:32,-500,,@EricaKaplan15 @RealCandaceO Alright for real ...,en,[],...,,,,,,"[{'screen_name': 'EricaKaplan15', 'name': 'Eri...",,,,
16843,8774,1336546571123642368,1336546571123642368,1607492509000.0,2020-12-09 00:41:49,-500,,𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻 😧 #Decision #BusinessIndustrial The ...,en,"['decision', 'businessindustrial']",...,,,,,,[],,,,
16844,8775,1336545739120455680,1335237814552649728,1607492310000.0,2020-12-09 00:38:30,-500,,@jorient @YouTube I am so happy to see an inte...,en,[],...,,,,,,"[{'screen_name': 'jorient', 'name': 'Jane Orie...",,,,
16845,8776,1336545725673508864,1336411251078025217,1607492307000.0,2020-12-09 00:38:27,-500,,@ZaidZamanHamid MMR’s vaccine which was tested...,en,[],...,,,,,,"[{'screen_name': 'ZaidZamanHamid', 'name': 'Za...",,,,
16846,8777,1336545412220596230,1336417759970865159,1607492232000.0,2020-12-09 00:37:12,-500,,@GenFlynn @realDonaldTrump The final phase was...,en,[],...,,,,,,"[{'screen_name': 'GenFlynn', 'name': 'General ...",,,,


## Location based Tweets
Below you'll see I've explored using location based tweets, but have not successfully scraped the amount I would like to use from any one locale.

In [52]:
search = 'covid vaccine'
city = 'New York'
filename = 'no_limit_ny'

In [53]:
search_loc_loop(search, city, filename)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-03-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-04-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-05-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-06-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-07-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-08-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-09-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-10-01 saved!
[!] No mor

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1230529563098304513,1230529563098304513,1.582216e+12,2020-02-20 11:28:04,-0500,,"During a live conference regarding COVID-19, @...",en,[covid2019],[],...,,,,,,[],,,,
1,1244475262835851264,1244475262835851264,1.585541e+12,2020-03-30 00:03:18,-0500,,if Shuri from Black Panther was real we would’...,en,[],[],...,,,,,,[],,,,
2,1244177043761356801,1244170766976978945,1.585470e+12,2020-03-29 04:18:17,-0500,,"@MalcolmNance @RealCandaceO How about this, Ca...",en,[],[],...,,,,,,"[{'screen_name': 'MalcolmNance', 'name': 'Malc...",,,,
3,1243921301347143682,1243921301347143682,1.585409e+12,2020-03-28 11:22:03,-0500,,@realDonaldTrump please don’t allow relaxed EP...,en,[],[],...,,,,,,[],,,,
4,1243158889211797504,1243158889211797504,1.585227e+12,2020-03-26 08:52:30,-0500,,Let’s go #Pittsburgh! — Researchers in Pittsbu...,en,[pittsburgh],[],...,,,,,,[],,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12871,1333600238989430786,1333600238989430786,1.606790e+12,2020-11-30 21:34:08,-0500,,My mother is discouraging me from getting a co...,en,[],[],...,,,,,,[],,,,
12872,1333584794698076160,1333584794698076160,1.606786e+12,2020-11-30 20:32:46,-0500,,Really apprehensive about the first round of #...,en,[covidvaccine],[],...,,,,,,[],,,,
12873,1333583283007926272,1333583283007926272,1.606786e+12,2020-11-30 20:26:46,-0500,,no COVID vaccine for whoever discontinued thes...,en,[],[],...,,,,,,[],,,,
12874,1333577970762526722,1333577970762526722,1.606785e+12,2020-11-30 20:05:39,-0500,,The impressive thing about the “COVID vaccine ...,en,[],[],...,,,,,,[],,,,


In [54]:
search = 'covid vaccine'
city = 'Los Angeles'
filename = 'no_limit_la'

In [55]:
%time df = search_loc_loop(search, city, filename)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-03-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-04-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-05-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-06-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-07-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-08-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-09-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-10-01 saved!
[!] No mor

In [56]:
df = pd.read_csv('./data/no_limit_la.csv')

In [59]:
df.shape

(5672, 39)

In [60]:
search = 'covid vaccine'
city = 'Chicago'
filename = 'no_limit_chi'

In [61]:
%time df = search_loc_loop(search, city, filename)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-03-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-04-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-05-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-06-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-07-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-08-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-09-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-10-01 saved!
[!] No mor

In [62]:
df = pd.read_csv('./data/no_limit_chi.csv')

In [63]:
df.shape

(3600, 39)

In [65]:
search = 'covid vaccine'
city = 'San Francisco'
filename = 'lim_5_sf'
limit = 5000

In [66]:
%time df = search_loc_loop(search, city, filename, limit)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-03-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-04-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-05-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-06-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-07-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-08-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-09-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-10-01 saved!
[!] No mor

In [67]:
df = pd.read_csv('./data/lim_5_sf.csv')

In [68]:
df.shape

(2701, 39)

In [69]:
search = 'covid vaccine'
city = 'Phoenix'
filename = 'lim_5_phx'
limit = 5000

In [70]:
%time df = search_loc_loop(search, city, filename, limit)

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-01-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-02-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-03-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-04-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-05-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-06-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-07-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-08-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-09-01 saved!
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
2020-10-01 saved!
[!] No mor

In [71]:
df = pd.read_csv('./data/lim_5_phx.csv')

In [72]:
df.shape

(1330, 39)