# Twitter with twarc
A UCSB original Carpentry workshop

In [1]:
# we made this file for you
# ! twarc2 search "#gasprices" > raw_data/hashtag_gasprices.jsonl

! wc raw_data/hashtag_gasprices.jsonl

! twarc2 flatten raw_data/hashtag_gasprices.jsonl > output_data/hashtag_gasprices_flat.jsonl

! wc output_data/hashtag_gasprices_flat.jsonl

     108  3346644 36403969 raw_data/hashtag_gasprices.jsonl
   10787  5007559 67146087 output_data/hashtag_gasprices_flat.jsonl


In [2]:
# administravia
# upon re-start we need to install twarc2 extensions
! pip install twarc-csv
! pip install emoji



# Episode 2
You should have a hashtag_gasprices.jsonl file

In [3]:
# BASH commands start with a BANG!
!twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  

In [4]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv
import textblob
import nltk
import os
import emoji

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# this comes into play for ep 8
!python -m textblob.download_corpora
nltk.download('stopwords')

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /home/jovyan/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# and of course, it's important to know where we are working
# I can send a BASH command from my notebook with a !:
!pwd

/home/jovyan/twarc_run


In [7]:
# you can also do this with Python
os.getcwd()

'/home/jovyan/twarc_run'

In [8]:
# we can change if we need
# os.chdir(".....")

In [9]:
os.getcwd()

'/home/jovyan/twarc_run'

## Running twarc
Let's get the timeline of one of twarc's creators.

In [10]:
# !twarc2 timeline BergisJules > raw_data/bjules.jsonl

### Challenge 1
- Can you find the file called ‚Äúbjules_flat.jsonl‚Äù?
- How many tweets did you get from Bergis? (we can't tell without flattening or looking at the output)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What‚Äôs the oldest one?

In [11]:
# !twarc2 timeline ecodatasci > raw_data/ecodatasci.jsonl
! twarc2 flatten raw_data/ecodatasci.jsonl > output_data/ecodatasci_flat.jsonl
! wc output_data/ecodatasci_flat.jsonl

    473  205041 2876068 output_data/ecodatasci_flat.jsonl


A straight harvest using search or stream doesn't need to be flattened 
to do our most basic analysis: wc. Do gas prices here?

## To flatten or not flatten

### Make your jsonl 1 tweet per line
Flattening will let you do our most basic unix-y analysis, turn
timelines into countable lists, and enable you to run twarc1
utilities later on in the workshop

In [12]:
# timeline objects need to be flattened in order to be analyzed as tweets
!twarc2 flatten raw_data/bjules.jsonl output_data/bjules_flat.jsonl

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 8.96M/8.96M of input file [00:00<00:00, 10.5MB/s]


## Convert to csv

In [13]:
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 8.96M/8.96M of input file [00:05<00:00, 1.65MB/s]

‚ÑπÔ∏è
Parsed 3143 tweets objects from 33 lines in the input file.
Wrote 3143 rows and output 74 columns in the CSV.



## When we look at bjules, we really do need to flatten it.

In [14]:
! wc raw_data/bjules.jsonl

     33  845463 9393221 raw_data/bjules.jsonl


33 lines doesn't mean 33 tweets. I suspected there was more there because
I got an error message about hitting a limit of 3200. 

And below, the csv converter tells us there are 3143 tweets.

In [15]:
# convert
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 8.96M/8.96M of input file [00:02<00:00, 3.62MB/s]

‚ÑπÔ∏è
Parsed 3143 tweets objects from 33 lines in the input file.
Wrote 3143 rows and output 74 columns in the CSV.



In [16]:
# once I flatten it, my wc will show the correct number
! wc output_data/bjules_flat.jsonl

    3143  1718186 23273101 output_data/bjules_flat.jsonl


In [17]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output_data/bjules_flat.jsonl
! wc output_data/bjules.csv

    3143  1718186 23273101 output_data/bjules_flat.jsonl
    3144   579239 11547625 output_data/bjules.csv


The csv is 1 line longer because it has column headers.
twarc2 csv takes flat or unflattened Twitter data files.

### Challenge 2

In [18]:
# commented line is a solution to challenge 1
# !twarc2 timeline ecodatasci > raw_data/ecodatasci.jsonl

!twarc2 flatten raw_data/ecodatasci.jsonl > output_data/ecodatasci_flat.jsonl
!twarc2 csv output_data/ecodatasci_flat.jsonl > output_data/ecodatasci.csv 
ecodatasci_df = pandas.read_csv("output_data/ecodatasci.csv")


# Episode 3: examining tweets
What comes along with a tweet
- Look at one_tweet in Jupyter viewer
- Look at one_tweet with nano
- Look at tweet as csv
- Look at all the entities of a tweet

In [19]:
### Let's look at a single tweet as a csv:
!twarc2 flatten raw_data/one_tweet.jsonl output_data/one_tweet_flat.jsonl
!twarc2 csv output_data/one_tweet_flat.jsonl output_data/one_tweet.csv




100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 4.63k/4.63k of input file [00:00<00:00, 13.7MB/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 7.09k/7.09k of input file [00:00<00:00, 427kB/s]

‚ÑπÔ∏è
Parsed 1 tweets objects from 1 lines in the input file.
Wrote 1 rows and output 74 columns in the CSV.



In [20]:
!head -n 2 'output_data/hashtag_gasprices_flat.jsonl' > 'output_data/4_tweets.jsonl'
!tail -n 2 'output_data/hashtag_gasprices_flat.jsonl' >> 'output_data/4_tweets.jsonl'

In [None]:
! cat output_data/4_tweets.jsonl

## Next harvest
Next we'll get just Bergis' original content. 
In other words, only the tweets that he wrote, not
any retweets or replied to other people tweets.

Can we go back further on his timeline by looking
only for Bergis's original content?

Not really--it looks like the limit applies to the search,
not the results. 


In [22]:
!twarc2 timeline BergisJules --exclude-retweets --exclude-replies > raw_data/bjules_original.jsonl

API limit of 3200 reached:   0%|             | 47/17680 [00:00<03:44, 78.72it/s]


In [23]:
!twarc2 flatten raw_data/bjules_original.jsonl output_data/bjules_original_flat.jsonl


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 106k/106k of input file [00:00<00:00, 23.8MB/s]


But this does tell us that Jules is a prolific re-tweeter and/or replier. 

In [24]:
! wc output_data/bjules_original_flat.jsonl
! wc output_data/bjules_flat.jsonl

    47  16116 237268 output_data/bjules_original_flat.jsonl
    3143  1718186 23273101 output_data/bjules_flat.jsonl


In [25]:
# save it as a csv so we can easily see the original writings of Jules
!twarc2 csv output_data/bjules_original_flat.jsonl output_data/bjules_original.csv

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 232k/232k of input file [00:00<00:00, 2.80MB/s]

‚ÑπÔ∏è
Parsed 47 tweets objects from 47 lines in the input file.
Wrote 47 rows and output 74 columns in the CSV.



# Episode 4

In [28]:
# fishing around for good searches
# you can count without harvesting.
# kittens is an evergreen search. you should always see at lease
# dozens of mentions per hour
!twarc2 counts --text "kittens"

2022-05-16T02:58:30.000Z - 2022-05-16T03:00:00.000Z: 7
2022-05-16T03:00:00.000Z - 2022-05-16T04:00:00.000Z: 390
2022-05-16T04:00:00.000Z - 2022-05-16T05:00:00.000Z: 359
2022-05-16T05:00:00.000Z - 2022-05-16T06:00:00.000Z: 299
2022-05-16T06:00:00.000Z - 2022-05-16T07:00:00.000Z: 359
2022-05-16T07:00:00.000Z - 2022-05-16T08:00:00.000Z: 311
2022-05-16T08:00:00.000Z - 2022-05-16T09:00:00.000Z: 252
2022-05-16T09:00:00.000Z - 2022-05-16T10:00:00.000Z: 264
2022-05-16T10:00:00.000Z - 2022-05-16T11:00:00.000Z: 282
2022-05-16T11:00:00.000Z - 2022-05-16T12:00:00.000Z: 351
2022-05-16T12:00:00.000Z - 2022-05-16T13:00:00.000Z: 379
2022-05-16T13:00:00.000Z - 2022-05-16T14:00:00.000Z: 371
2022-05-16T14:00:00.000Z - 2022-05-16T15:00:00.000Z: 376
2022-05-16T15:00:00.000Z - 2022-05-16T16:00:00.000Z: 445
2022-05-16T16:00:00.000Z - 2022-05-16T17:00:00.000Z: 592
2022-05-16T17:00:00.000Z - 2022-05-16T18:00:00.000Z: 523
2022-05-16T18:00:00.000Z - 2022-05-16T19:00:00.000Z: 435
2022-05-16T19:00:00.000Z - 2022-0

In [29]:
# recent search with granularity.
!twarc2 counts --granularity "day" --text "(#UCSBLibrary OR UCSBLibrary OR ucsblibrary OR #ucsblibrary OR davidsonlibrary OR #davidsonlibrary)"

2022-05-16T02:58:32.000Z - 2022-05-17T00:00:00.000Z: 1
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 3
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 1
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 2
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 5
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 0
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 0
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:32.000Z: 2
[32m
Total Tweets: 14
[0m


In [30]:
# phrase searching????
# this one isnt working
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb OR ("UC Santa Barbara"))"
!twarc2 counts --granularity "day" --text "("uc santa barbara")"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

Usage: twarc2 counts [OPTIONS] QUERY [OUTFILE]
Try 'twarc2 counts --help' for help.

Error: Got unexpected extra argument (Barbara)))
Usage: twarc2 counts [OPTIONS] QUERY [OUTFILE]
Try 'twarc2 counts --help' for help.

Error: Got unexpected extra argument (barbara))
2022-05-16T02:58:37.000Z - 2022-05-17T00:00:00.000Z: 10
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 14
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 16
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 11
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 14
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 7
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 3
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:37.000Z: 0
[32m
Total Tweets: 75
[0m
2022-05-16T02:58:39.000Z - 2022-05-17T00:00:00.000Z: 226
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 261
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 386
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-

In [31]:
# search for hashtags when you really want hashtags. 
# search for a string returns both text and hashtage (an OR)
# NOT case sensitive
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb)"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

2022-05-16T02:58:41.000Z - 2022-05-17T00:00:00.000Z: 226
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 261
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 386
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 345
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 390
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 355
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:41.000Z: 37
[32m
Total Tweets: 2,232
[0m
2022-05-16T02:58:42.000Z - 2022-05-17T00:00:00.000Z: 10
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 14
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 16
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 11
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 14
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 7
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 3
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:42.000Z: 0
[32m
Total Tweets: 75
[0m
2022-05-16T02:58:44.000Z - 2022-05-17T00:

In [32]:
## Endpoints: counts
!twarc2 counts --text "(Poker OR poker)" --granularity "day"
!twarc2 counts --text "(Golf OR golf)" --granularity "day"
!twarc2 counts --text "(Basketball OR basketball)" --granularity "day"
!twarc2 counts --text "(Baseball OR baseball)" --granularity "day"
!twarc2 counts --text "(Football OR football)" --granularity "day"

2022-05-16T02:58:45.000Z - 2022-05-17T00:00:00.000Z: 12,720
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 12,368
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 14,896
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 19,216
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 16,261
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 17,750
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 13,033
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:45.000Z: 1,242
[32m
Total Tweets: 107,486
[0m
2022-05-16T02:58:47.000Z - 2022-05-17T00:00:00.000Z: 39,730
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 66,141
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 70,817
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 56,048
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 49,114
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 43,242
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 47,119
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:47.000Z: 6,314
[32m
Tot

In [33]:
## What's a lot?
!twarc2 counts --text "dog" --granularity "day"
!twarc2 counts --text "cat" --granularity "day"
!twarc2 counts --text "amazon" --granularity "day"
!twarc2 counts --text "right" --granularity "day"
!twarc2 counts --text "good" --granularity "day"


2022-05-16T02:58:54.000Z - 2022-05-17T00:00:00.000Z: 205,109
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 227,772
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 264,141
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 227,478
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 222,664
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 223,036
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 231,865
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:54.000Z: 24,475
[32m
Total Tweets: 1,626,540
[0m
2022-05-16T02:58:56.000Z - 2022-05-17T00:00:00.000Z: 307,279
2022-05-17T00:00:00.000Z - 2022-05-18T00:00:00.000Z: 318,109
2022-05-18T00:00:00.000Z - 2022-05-19T00:00:00.000Z: 319,860
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 357,590
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 318,376
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 294,558
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 308,585
2022-05-23T00:00:00.000Z - 2022-05-23T02:58:56.000Z

In [38]:
## a SFW timeline
# !twarc2 timeline ucsblibrary raw_data/library_timeline.jsonl
!twarc2 flatten raw_data/library_timeline.jsonl output_data/library_timeline_flat.jsonl
!twarc2 csv output_data/library_timeline_flat.jsonl output_data/library_timeline_flat.csv
library_timeline_df = pandas.read_csv("output_data/library_timeline_flat.csv")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 5.31M/5.31M of input file [00:00<00:00, 11.4MB/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 12.9M/12.9M of input file [00:02<00:00, 4.90MB/s]

‚ÑπÔ∏è
Parsed 3226 tweets objects from 3226 lines in the input file.
Wrote 3226 rows and output 74 columns in the CSV.



In [39]:
# confirm the dataframe's existance
len(library_timeline_df)

3226

In [41]:
# and view all column headers
list(library_timeline_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

### Converting to csv and dataframes

In [42]:
# this line is not running
# flatten it first?
# !twarc2 csv output_data/taxday_flat.jsonl output_data/taxday_flat.csv


## final challenge: Cats of Instagram
Let‚Äôs make a bigger datafile. Harvest 5000 tweets that use the hashtag ‚Äúcatsofinstagram‚Äù and put the dataset through the pipeline to answer the following questions:

- Did you get exactly 5000?
- How far back in time did you get?
- What is the most re-tweeted recent tweet on #catsofinstagram?
- Which person has the most number of followers in your dataset?
- Is it really a person?

In [43]:
# !twarc2 search --limit 5000 "#catsofinstagram" raw_data/hashtagcats.jsonl
!twarc2 flatten raw_data/hashtagcats.jsonl output_data/hashtagcats_flat.jsonl
!twarc2 csv raw_data/hashtagcats.jsonl > output_data/hashtagcats.csv
hashtagcats_df = pandas.read_csv("output_data/hashtagcats.csv")
! wc output_data/hashtagcats.csv
hashtagcats_df["created_at"].head()

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 16.0M/16.0M of input file [00:01<00:00, 12.0MB/s]
    5091  1103819 19588376 output_data/hashtagcats.csv


0    2022-05-20T23:01:16.000Z
1    2022-05-20T22:59:00.000Z
2    2022-05-20T22:58:14.000Z
3    2022-05-20T22:58:09.000Z
4    2022-05-20T22:57:09.000Z
Name: created_at, dtype: object

In [44]:
hashtagcats_df["created_at"].tail()

5085    2022-05-18T04:55:17.000Z
5086    2022-05-18T04:55:03.000Z
5087    2022-05-18T04:51:51.000Z
5088    2022-05-18T04:51:48.000Z
5089    2022-05-18T04:51:30.000Z
Name: created_at, dtype: object

In [45]:
hashtagcats_df

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1527786564398501895,1527786564398501895,,1.527394e+18,,225211845,,3.541330e+08,,2022-05-20T23:01:16.000Z,...,,,,,,,2022-05-20T23:02:13+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
1,1527785992945422336,1527785992945422336,,1.527749e+18,,1508010079194214400,,3.541330e+08,,2022-05-20T22:59:00.000Z,...,,,,,,,2022-05-20T23:02:13+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
2,1527785799588139008,1527785799588139008,,1.527700e+18,,1125800648191217667,,1.458759e+18,,2022-05-20T22:58:14.000Z,...,,,,,,,2022-05-20T23:02:13+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
3,1527785778931281922,1527785778931281922,,,,947647325589131264,,,,2022-05-20T22:58:09.000Z,...,,,,,,,2022-05-20T23:02:13+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
4,1527785528002695168,1527785528002695168,,1.527688e+18,,1508010079194214400,,3.541330e+08,,2022-05-20T22:57:09.000Z,...,,,,,,,2022-05-20T23:02:13+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5085,1526788490536669186,1526788490536669186,,1.526651e+18,,750594529397473280,,1.459073e+18,,2022-05-18T04:55:17.000Z,...,,,,,,,2022-05-20T23:02:57+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
5086,1526788431589675008,1526788426330099712,1.526788e+18,,,1180818595,1.180819e+09,,,2022-05-18T04:55:03.000Z,...,"[-124.482003, 32.528832, -114.131212, 42.009519]",Feature,fbd6d2f5a4e4a15e,California,fbd6d2f5a4e4a15e,admin,2022-05-20T23:02:57+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
5087,1526787626174496769,1526787626174496769,,1.523627e+18,,1724613062,,3.958303e+09,,2022-05-18T04:51:51.000Z,...,,,,,,,2022-05-20T23:02:57+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
5088,1526787617068666880,1526787617068666880,,1.524457e+18,,2489060538,,1.495445e+18,,2022-05-18T04:51:48.000Z,...,,,,,,,2022-05-20T23:02:57+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,


In [46]:
list(hashtagcats_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

In [47]:
# what dataframes do we have at this point?
%whos DataFrame

Variable              Type         Data/Info
--------------------------------------------
ecodatasci_df         DataFrame                          id <...>\n[473 rows x 74 columns]
hashtagcats_df        DataFrame                           id<...>n[5090 rows x 74 columns]
library_timeline_df   DataFrame                           id<...>n[3226 rows x 74 columns]


# Episode 5: Ethics & Twitter

In [49]:
# our first full-text analysis
# a list of words with TextBlob

# first we need to munge the data. remember from:
# list(library_df.columns)
# the tweet is library_df['text']

# TextBlob has its own data format.

# break tweets test column into a list, 
# then .join into one long string 
library_string = ' '.join(library_timeline_df['text'].tolist())
# turn the string into a blob
library_blob = textblob.TextBlob(library_string)


This produces a mess. 
Let's count the words and sort by their frequency of use:


In [50]:
library_freq = library_blob.word_counts
library_sorted_freq = sorted(library_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_sorted_freq[1:25]

[('https', 2282),
 ('to', 1646),
 ('of', 1523),
 ('and', 1244),
 ('ucsb', 1148),
 ('http', 1117),
 ('in', 1107),
 ('library', 1071),
 ('a', 1047),
 ('for', 1012),
 ('s', 745),
 ('on', 671),
 ('at', 634),
 ('you', 605),
 ('our', 547),
 ('we', 532),
 ('is', 499),
 ('from', 475),
 ('this', 453),
 ('ucsblibrary', 415),
 ('with', 391),
 ('by', 391),
 ('amp', 331),
 ('more', 327)]

We can at least get the english stopwords out. but this all didn't reall produce anything cleaner:

In [51]:
# load the stopwords to use
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')

In [52]:
# create a new text list that does
# NOT contain stopwords
library_str_stopped = [word for word in library_string.split() 
                       if word.lower() not in sw_nltk]
library_words_stopped = " ".join(library_str_stopped)

In [53]:
library_blob_stopped = textblob.TextBlob(library_words_stopped)
library_blob_stopped_freq = library_blob_stopped.word_counts
library_blob_stopped_sorted_freq = sorted(library_blob_stopped_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_blob_stopped_sorted_freq[1:50]

[('ucsb', 1148),
 ('http', 1117),
 ('library', 1071),
 ('s', 679),
 ('ucsblibrary', 415),
 ('amp', 331),
 ('new', 288),
 ('today', 250),
 ('us', 232),
 ('‚Äô', 228),
 ('book', 218),
 ('research', 212),
 ('here', 210),
 ('students', 195),
 ('reads', 189),
 ('open', 187),
 ('we', 178),
 ('check', 177),
 ('join', 164),
 ('week', 161),
 ('floor', 145),
 ('day', 143),
 ('free', 139),
 ('collections', 138),
 ('access', 136),
 ('art', 127),
 ('re', 126),
 ('special', 120),
 ('study', 119),
 ('learn', 116),
 ('collection', 112),
 ('first', 108),
 ('campus', 106),
 ('one', 103),
 ('info', 102),
 ('books', 102),
 ('available', 102),
 ('see', 100),
 ('uc', 99),
 ('come', 99),
 ('librarian', 98),
 ('community', 97),
 ('student', 97),
 ('talk', 96),
 ('exhibit', 95),
 ('hours', 93),
 ('more', 92),
 ('read', 90),
 ('science', 89)]

In [54]:
# a more meaningul segment
library_blob_stopped_sorted_freq[7:57]

[('new', 288),
 ('today', 250),
 ('us', 232),
 ('‚Äô', 228),
 ('book', 218),
 ('research', 212),
 ('here', 210),
 ('students', 195),
 ('reads', 189),
 ('open', 187),
 ('we', 178),
 ('check', 177),
 ('join', 164),
 ('week', 161),
 ('floor', 145),
 ('day', 143),
 ('free', 139),
 ('collections', 138),
 ('access', 136),
 ('art', 127),
 ('re', 126),
 ('special', 120),
 ('study', 119),
 ('learn', 116),
 ('collection', 112),
 ('first', 108),
 ('campus', 106),
 ('one', 103),
 ('info', 102),
 ('books', 102),
 ('available', 102),
 ('see', 100),
 ('uc', 99),
 ('come', 99),
 ('librarian', 98),
 ('community', 97),
 ('student', 97),
 ('talk', 96),
 ('exhibit', 95),
 ('hours', 93),
 ('more', 92),
 ('read', 90),
 ('science', 89),
 ('‚Äú', 87),
 ('‚Äù', 86),
 ('time', 86),
 ('barbara', 85),
 ('tbt', 84),
 ('santa', 83),
 ('year', 82)]

In [55]:
type(library_blob_stopped)

textblob.blob.TextBlob

In [56]:
# Challenge: for the Python wizzes. #FIXME
# do that in a tidy way?
# what do pandas pipes look like?

In [57]:
# broke broke broke
# we made the utils folder during setup
# https://github.com/DocNow/twarc/tree/main/utils

!python utils/wall.py output_data/hashtagcats_flat.jsonl > output_data/hashtagcats.html

Traceback (most recent call last):
  File "/home/jovyan/twarc_run/utils/wall.py", line 147, in <module>
    url = tweet["user"]["profile_image_url"]
KeyError: 'user'


In [58]:
!python utils/wall.py output_data/ecodatasci_flat.jsonl > output_data/ecodatasci_flat.html

Traceback (most recent call last):
  File "/home/jovyan/twarc_run/utils/wall.py", line 147, in <module>
    url = tweet["user"]["profile_image_url"]
KeyError: 'user'


In [59]:
# you gotta flatten that shit
!twarc2 flatten raw_data/hashtagcats.jsonl > output_data/hashtagcats_flat.jsonl
# !python utils/wall.py output_data/hashtagcats_flat.jsonl > output_data/hashtagcats.html

# or maybe not flatten.
# either way, I can't get it to run
! python utils/wall.py raw_data/hashtagcats.jsonl > output_data/cat_wall.html

Traceback (most recent call last):
  File "/home/jovyan/twarc_run/utils/wall.py", line 147, in <module>
    url = tweet["user"]["profile_image_url"]
KeyError: 'user'


## Challenge: Insta-rrectionists

In [60]:
# this takes a very long time.
# !twarc2 hydrate raw_data/dehydratedCapitolRiotTweets.txt output_data/riots.jsonl

In [61]:
# how long is this?
riots_dehydrated_df = pandas.read_csv("raw_data/dehydratedCapitolRiotTweets.txt")
len(riots_dehydrated_df)

82308

# Episode 6: Search and Filter

In [66]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
# !twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" raw_data/kittens.jsonl
!twarc2 csv raw_data/kittens.jsonl output_data/kittens.csv

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| Processed 2.59M/2.59M of input file [00:00<00:00, 3.47MB/s]

‚ÑπÔ∏è
Parsed 899 tweets objects from 9 lines in the input file.
Wrote 899 rows and output 74 columns in the CSV.



In [67]:
kittens_df = pandas.read_csv("output_data/kittens.csv")

In [None]:
kittens_df

In [None]:
list(kittens_df.columns)

# Filter

In [68]:
!twarc2 stream-rules add "#WorldGothDay"

[31müí£  [31mDuplicateRule[0m see: https://api.twitter.com/2/problems/duplicate-rules[0m


In [69]:
!twarc2 stream-rules add "gothcats"

[31müí£  [31mDuplicateRule[0m see: https://api.twitter.com/2/problems/duplicate-rules[0m


In [None]:
# press the square to interrup this!

In [None]:
!twarc2 stream > "raw_data/streamed_goth.jsonl"

[32mStarted a stream with rules:[0m
‚òë  #WorldGothDay[0m
‚òë  gothcats[0m
[32mWriting to <stdout>
CTRL+C to stop...[0m


In [None]:
!wc raw_data/streamed_goth.jsonl

In [None]:
!twarc2 stream-rules delete "caturday"

# Retweets vs. tweets
How much original content is there?
Do this for both library timeline and catsofinstagrams

In [None]:
# via pandas and plottting
retweet_count = hashtagcats_df["referenced_tweets.retweeted.id"].value_counts()
sum(retweet_count)


In [None]:
(sum(retweet_count) / len(hashtagcats_df))

In [None]:
78% of the tweets that used #catsofinstagram were retweets.

In [None]:
# so our pipeline on a stream would look like:


# Episode 7: twarc plug-ins

In [None]:
# this reminds you what DataFrames you have in memory
%who DataFrame

In [None]:
!pip install twarc-hashtags

In [None]:
!pip install twarc-network

In [None]:
!twarc2 hashtags raw_data/hashtagcats.jsonl output_data/hashtagcats_hashtags.csv

In [None]:
# how do I print file to cell?
# print(read(output_data/hashtagcats_hashtags.csv))

In [None]:
!twarc2 network raw_data/hashtagcats.jsonl output_data/hashtagcats_network.html

In [None]:
# ! twarc2 mentions ucsblibrary raw_data/ucsblibrary_mentions.jsonl
! twarc2 csv raw_data/ucsblibrary_mentions.jsonl output_data/ucsblibrary_mentions.csv 
ucsb_library_mentions_df = pandas.read_csv('output_data/ucsblibrary_mentions.csv')

In [None]:
# emojis for each of our datasets so far
# !python utils/emojis.py raw_data/hashtagcats.jsonl > output_data/hashtagcats_emojis.csv
!python utils/emojis.py raw_data/ucsblibrary_mentions.jsonl > output_data/ucsblibrary_mentions_emojis.csv

In [None]:
# new beginning
# can I get a fresh filter and run emojis?
! twarc2 search "masked OR #masked" raw_data/masked.jsonl

In [None]:
! twarc2 flatten raw_data/masked.jsonl output_data/masked_flat.jsonl 

In [None]:
!python utils/emojis.py output_data/masked_flat.jsonl > output_data/masked_emojis.csv

# Episode 8: Python text analysis

In [None]:
# what dataframes is still here?
%whos DataFrame
#hashtagcats_df.columns

### Sentiment Analysis
To do this, we need to do a little Python

TextBlob is a text processing library that does sentiment analysis. 
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Before we use TextBlob for sentiment analysis, we need to download
datasets of words and their associated weights. These are called *corpora*.

In [None]:
# commented out because I put it up in ep 2
# !python -m textblob.download_corpora

In [None]:
# TextBlob needs a string, so this won't work.
textblob.TextBlob(hashtagcats_df).sentiment

In [None]:
# even calling the column won't work:
textblob.TextBlob(hashtagcats_df['text']).sentiment

In [None]:
# break tweets test column into a list, then .join into one long string 
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
# turn the string into a blob
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
# get the sentiment
hashtagcats_blob.sentiment

The overall sentiment of the language of kitty twitter is rather positive.
And the tweets tend to be subjective.

In [None]:
# What do you think the sentiment of tax day might be?
# get the overall sentiment and see if it matches your prediction.

# Episode 9: Data Management

# Episode 10: Don't Map Twitter