# Twitter with twarc
A UCSB original Carpentry workshop

You should have a hashtag_gasprices.jsonl file. We made it for you using the code below.

In [1]:
# we made this file for you
# ! twarc2 search "#gasprices" > raw_data/hashtag_gasprices.jsonl




In [2]:
# administravia
# upon re-start we need to install twarc2 and other extensions
! pip install twarc-csv
! pip install emoji



# Episode 2


In [3]:
# BASH commands start with a BANG!
!twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  

In [4]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv
import textblob
import nltk
import os
import emoji

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# this comes into play for ep 8
!python -m textblob.download_corpora
nltk.download('stopwords')

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /home/jovyan/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# and of course, it's important to know where we are working
# I can send a BASH command from my notebook with a !:
!pwd

/home/jovyan/twarc_run


In [7]:
# you can also do this with Python
os.getcwd()

'/home/jovyan/twarc_run'

In [8]:
# we can change if we need
# os.chdir(".....")

In [9]:
os.getcwd()

'/home/jovyan/twarc_run'

## Running twarc
Let's get the timeline of one of twarc's creators.

In [10]:
# !twarc2 timeline BergisJules > raw_data/bjules.jsonl

### Challenge 1
- Can you find the file called “bjules_flat.jsonl”?
- How many tweets did you get from Bergis? (we can't tell without flattening or looking at the output)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What’s the oldest one?

In [11]:
# !twarc2 timeline ecodatasci > raw_data/ecodatasci.jsonl
! twarc2 flatten raw_data/ecodatasci.jsonl > output_data/ecodatasci_flat.jsonl
! wc output_data/ecodatasci_flat.jsonl

    473  205041 2876068 output_data/ecodatasci_flat.jsonl


A straight harvest using search or stream doesn't need to be flattened 
to do our most basic analysis: wc. Do gas prices here?

## To flatten or not flatten

### Make your jsonl 1 tweet per line
Flattening will let you do our most basic unix-y analysis, turn
timelines into countable lists, and enable you to run twarc1
utilities later on in the workshop

In [12]:
# timeline objects need to be flattened in order to be analyzed as tweets
!twarc2 flatten raw_data/bjules.jsonl output_data/bjules_flat.jsonl

100%|██████████████| Processed 8.96M/8.96M of input file [00:00<00:00, 10.4MB/s]


## Convert to csv

In [13]:
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

100%|██████████████| Processed 8.96M/8.96M of input file [00:05<00:00, 1.70MB/s]

ℹ️
Parsed 3143 tweets objects from 33 lines in the input file.
Wrote 3143 rows and output 74 columns in the CSV.



## When we look at bjules, we really do need to flatten it.

In [14]:
! wc raw_data/bjules.jsonl

     33  845463 9393221 raw_data/bjules.jsonl


33 lines doesn't mean 33 tweets. I suspected there was more there because
I got an error message about hitting a limit of 3200. 

And below, the csv converter tells us there are 3143 tweets.

In [15]:
# convert
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

100%|██████████████| Processed 8.96M/8.96M of input file [00:02<00:00, 3.73MB/s]

ℹ️
Parsed 3143 tweets objects from 33 lines in the input file.
Wrote 3143 rows and output 74 columns in the CSV.



In [16]:
# once I flatten it, my wc will show the correct number
! wc output_data/bjules_flat.jsonl

    3143  1718186 23273101 output_data/bjules_flat.jsonl


In [17]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output_data/bjules_flat.jsonl
! wc output_data/bjules.csv

    3143  1718186 23273101 output_data/bjules_flat.jsonl
    3144   579239 11547625 output_data/bjules.csv


The csv is 1 line longer because it has column headers.
twarc2 csv takes flat or unflattened Twitter data files.

### Challenge 2

In [18]:
# commented line is a solution to challenge 1
# !twarc2 timeline ecodatasci > raw_data/ecodatasci.jsonl

!twarc2 flatten raw_data/ecodatasci.jsonl > output_data/ecodatasci_flat.jsonl
!twarc2 csv output_data/ecodatasci_flat.jsonl > output_data/ecodatasci.csv 
ecodatasci_df = pandas.read_csv("output_data/ecodatasci.csv")


# Episode 3: examining tweets
What comes along with a tweet
- Look at one_tweet in Jupyter viewer
- Look at one_tweet with nano
- Look at tweet as csv
- Look at all the entities of a tweet

In [19]:
! wc raw_data/hashtag_gasprices.jsonl

! twarc2 flatten raw_data/hashtag_gasprices.jsonl > output_data/hashtag_gasprices_flat.jsonl

! wc output_data/hashtag_gasprices_flat.jsonl

     108  3346644 36403969 raw_data/hashtag_gasprices.jsonl
   10787  5007559 67146087 output_data/hashtag_gasprices_flat.jsonl


In [20]:
### Let's look at a single tweet as a csv:
!twarc2 flatten raw_data/one_tweet.jsonl output_data/one_tweet_flat.jsonl
!twarc2 csv output_data/one_tweet_flat.jsonl output_data/one_tweet.csv




100%|███████████████| Processed 4.63k/4.63k of input file [00:00<00:00, 412kB/s]
100%|███████████████| Processed 7.09k/7.09k of input file [00:00<00:00, 417kB/s]

ℹ️
Parsed 1 tweets objects from 1 lines in the input file.
Wrote 1 rows and output 74 columns in the CSV.



In [21]:
!head -n 2 'output_data/hashtag_gasprices_flat.jsonl' > 'output_data/4_tweets.jsonl'
!tail -n 2 'output_data/hashtag_gasprices_flat.jsonl' >> 'output_data/4_tweets.jsonl'

In [22]:
! cat output_data/4_tweets.jsonl

{"public_metrics": {"retweet_count": 8, "reply_count": 0, "like_count": 0, "quote_count": 0}, "possibly_sensitive": false, "lang": "en", "created_at": "2022-05-22T23:51:56.000Z", "source": "Twitter for Android", "id": "1528524091250163712", "conversation_id": "1528524091250163712", "text": "RT @Chromosome_XY_: Biden admits here in South Korea that the gas price hike were planned to get puerile to go to electric! Not #Putin Not\u2026", "reply_settings": "everyone", "referenced_tweets": [{"type": "retweeted", "id": "1528510638745702400", "public_metrics": {"retweet_count": 8, "reply_count": 0, "like_count": 9, "quote_count": 0}, "possibly_sensitive": false, "lang": "en", "created_at": "2022-05-22T22:58:29.000Z", "source": "Twitter for Android", "entities": {"urls": [{"start": 201, "end": 224, "url": "https://t.co/40cuYQG6yK", "expanded_url": "https://twitter.com/The_FJC/status/1528475088676302848", "display_url": "twitter.com/The_FJC/status\u2026"}], "hashtags": [{"start": 108, "end": 11

## Next harvest
Next we'll get just Bergis' original content. 
In other words, only the tweets that he wrote, not
any retweets or replied to other people tweets.

Can we go back further on his timeline by looking
only for Bergis's original content?

Not really--it looks like the limit applies to the search,
not the results. 


In [23]:
# !twarc2 timeline BergisJules --exclude-retweets --exclude-replies > raw_data/bjules_original.jsonl

In [24]:
!twarc2 flatten raw_data/bjules_original.jsonl output_data/bjules_original_flat.jsonl


100%|████████████████| Processed 106k/106k of input file [00:00<00:00, 2.29MB/s]


But this does tell us that Jules is a prolific re-tweeter and/or replier. 

In [25]:
! wc output_data/bjules_original_flat.jsonl
! wc output_data/bjules_flat.jsonl

    46  15967 234593 output_data/bjules_original_flat.jsonl
    3143  1718186 23273101 output_data/bjules_flat.jsonl


In [26]:
# save it as a csv so we can easily see the original writings of Jules
!twarc2 csv output_data/bjules_original_flat.jsonl output_data/bjules_original.csv

100%|████████████████| Processed 229k/229k of input file [00:00<00:00, 7.60MB/s]

ℹ️
Parsed 46 tweets objects from 46 lines in the input file.
Wrote 46 rows and output 74 columns in the CSV.



# Episode 4

In [27]:
# fishing around for good searches
# you can count without harvesting.
# kittens is an evergreen search. you should always see at lease
# dozens of mentions per hour
!twarc2 counts --text "kittens"

2022-05-18T00:00:31.000Z - 2022-05-18T01:00:00.000Z: 397
2022-05-18T01:00:00.000Z - 2022-05-18T02:00:00.000Z: 375
2022-05-18T02:00:00.000Z - 2022-05-18T03:00:00.000Z: 410
2022-05-18T03:00:00.000Z - 2022-05-18T04:00:00.000Z: 535
2022-05-18T04:00:00.000Z - 2022-05-18T05:00:00.000Z: 341
2022-05-18T05:00:00.000Z - 2022-05-18T06:00:00.000Z: 295
2022-05-18T06:00:00.000Z - 2022-05-18T07:00:00.000Z: 286
2022-05-18T07:00:00.000Z - 2022-05-18T08:00:00.000Z: 242
2022-05-18T08:00:00.000Z - 2022-05-18T09:00:00.000Z: 263
2022-05-18T09:00:00.000Z - 2022-05-18T10:00:00.000Z: 282
2022-05-18T10:00:00.000Z - 2022-05-18T11:00:00.000Z: 260
2022-05-18T11:00:00.000Z - 2022-05-18T12:00:00.000Z: 573
2022-05-18T12:00:00.000Z - 2022-05-18T13:00:00.000Z: 466
2022-05-18T13:00:00.000Z - 2022-05-18T14:00:00.000Z: 605
2022-05-18T14:00:00.000Z - 2022-05-18T15:00:00.000Z: 451
2022-05-18T15:00:00.000Z - 2022-05-18T16:00:00.000Z: 536
2022-05-18T16:00:00.000Z - 2022-05-18T17:00:00.000Z: 546
2022-05-18T17:00:00.000Z - 2022

In [28]:
# recent search with granularity.
!twarc2 counts --granularity "day" --text "(#UCSBLibrary OR UCSBLibrary OR ucsblibrary OR #ucsblibrary OR davidsonlibrary OR #davidsonlibrary)"

2022-05-18T00:00:33.000Z - 2022-05-19T00:00:00.000Z: 1
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 2
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 5
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 0
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 0
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 3
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 3
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:33.000Z: 0
[32m
Total Tweets: 14
[0m


In [29]:
# phrase searching????
# this one isnt working
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb OR ("UC Santa Barbara"))"
!twarc2 counts --granularity "day" --text "("uc santa barbara")"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

Usage: twarc2 counts [OPTIONS] QUERY [OUTFILE]
Try 'twarc2 counts --help' for help.

Error: Got unexpected extra argument (Barbara)))
Usage: twarc2 counts [OPTIONS] QUERY [OUTFILE]
Try 'twarc2 counts --help' for help.

Error: Got unexpected extra argument (barbara))
2022-05-18T00:00:37.000Z - 2022-05-19T00:00:00.000Z: 16
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 11
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 14
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 7
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 3
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 14
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 16
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:37.000Z: 0
[32m
Total Tweets: 81
[0m
2022-05-18T00:00:39.000Z - 2022-05-19T00:00:00.000Z: 386
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 342
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 385
2022-05-22T00:00:00.000Z - 2022-

In [30]:
# search for hashtags when you really want hashtags. 
# search for a string returns both text and hashtage (an OR)
# NOT case sensitive
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb)"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

2022-05-18T00:00:41.000Z - 2022-05-19T00:00:00.000Z: 386
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 232
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 342
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 385
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 352
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 308
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 308
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:41.000Z: 0
[32m
Total Tweets: 2,313
[0m
2022-05-18T00:00:42.000Z - 2022-05-19T00:00:00.000Z: 16
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 11
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 14
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 7
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 3
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 14
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 16
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:42.000Z: 0
[32m
Total Tweets: 81
[0m
2022-05-18T00:00:44.000Z - 2022-05-19T00:0

In [31]:
## Endpoints: counts
!twarc2 counts --text "(Poker OR poker)" --granularity "day"
!twarc2 counts --text "(Golf OR golf)" --granularity "day"
!twarc2 counts --text "(Basketball OR basketball)" --granularity "day"
!twarc2 counts --text "(Baseball OR baseball)" --granularity "day"
!twarc2 counts --text "(Football OR football)" --granularity "day"

2022-05-18T00:00:45.000Z - 2022-05-19T00:00:00.000Z: 14,802
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 19,012
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 16,089
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 17,538
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 12,667
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 12,455
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 18,928
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:45.000Z: 12
[32m
Total Tweets: 111,503
[0m
2022-05-18T00:00:47.000Z - 2022-05-19T00:00:00.000Z: 70,618
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 55,946
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 48,882
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 43,073
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 46,642
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 45,359
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 83,227
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:47.000Z: 48
[32m
Total Twe

In [32]:
## What's a lot?
!twarc2 counts --text "dog" --granularity "day"
!twarc2 counts --text "cat" --granularity "day"
!twarc2 counts --text "amazon" --granularity "day"
!twarc2 counts --text "right" --granularity "day"
!twarc2 counts --text "good" --granularity "day"


2022-05-18T00:00:53.000Z - 2022-05-19T00:00:00.000Z: 263,194
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 226,446
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 217,322
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 220,597
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 227,639
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 212,016
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 220,806
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:53.000Z: 138
[32m
Total Tweets: 1,588,158
[0m
2022-05-18T00:00:55.000Z - 2022-05-19T00:00:00.000Z: 318,374
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 355,955
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 315,459
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 288,102
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 299,991
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 306,746
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 333,709
2022-05-25T00:00:00.000Z - 2022-05-25T00:00:55.000Z: 1

In [33]:
## a SFW timeline
# !twarc2 timeline ucsblibrary raw_data/library_timeline.jsonl
!twarc2 flatten raw_data/library_timeline.jsonl output_data/library_timeline_flat.jsonl
!twarc2 csv output_data/library_timeline_flat.jsonl output_data/library_timeline_flat.csv
library_timeline_df = pandas.read_csv("output_data/library_timeline_flat.csv")

100%|██████████████| Processed 5.31M/5.31M of input file [00:00<00:00, 9.40MB/s]
100%|██████████████| Processed 12.9M/12.9M of input file [00:02<00:00, 4.82MB/s]

ℹ️
Parsed 3226 tweets objects from 3226 lines in the input file.
Wrote 3226 rows and output 74 columns in the CSV.



In [34]:
# confirm the dataframe's existance
len(library_timeline_df)

3226

In [35]:
# and view all column headers
list(library_timeline_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

## final challenge: Cats of Instagram
Let’s make a bigger datafile. Harvest 5000 tweets that use the hashtag “catsofinstagram” and put the dataset through the pipeline to answer the following questions:

- Did you get exactly 5000?
- How far back in time did you get?
- What is the most re-tweeted recent tweet on #catsofinstagram?
- Which person has the most number of followers in your dataset?
- Is it really a person?

In [36]:
# !twarc2 search --limit 500 "#catsofinstagram" raw_data/hashtagcats.jsonl
!twarc2 flatten raw_data/hashtagcats.jsonl output_data/hashtagcats_flat.jsonl
!twarc2 csv raw_data/hashtagcats.jsonl > output_data/hashtagcats.csv
hashtagcats_df = pandas.read_csv("output_data/hashtagcats.csv")
! wc output_data/hashtagcats.csv
hashtagcats_df["created_at"].head()

100%|██████████████| Processed 1.84M/1.84M of input file [00:00<00:00, 8.78MB/s]
    600  118388 2181953 output_data/hashtagcats.csv


0    2022-05-24T00:17:07.000Z
1    2022-05-24T00:14:06.000Z
2    2022-05-24T00:14:06.000Z
3    2022-05-24T00:12:02.000Z
4    2022-05-24T00:11:58.000Z
Name: created_at, dtype: object

In [37]:
hashtagcats_df["created_at"].tail()

594    2022-05-23T15:27:12.000Z
595    2022-05-23T15:24:35.000Z
596    2022-05-23T15:22:56.000Z
597    2022-05-23T15:21:13.000Z
598    2022-05-23T15:20:22.000Z
Name: created_at, dtype: object

In [38]:
hashtagcats_df

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1528892818105815040,1528892818105815040,,1.517710e+18,,1128200401,,3.958303e+09,,2022-05-24T00:17:07.000Z,...,,,,,,,2022-05-24T00:20:37+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
1,1528892058274091012,1528892058274091012,,,,749661007,,,,2022-05-24T00:14:06.000Z,...,,,,,,,2022-05-24T00:20:37+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
2,1528892056524972033,1528892056524972033,,1.528889e+18,,1088162497725648908,,9.486312e+17,,2022-05-24T00:14:06.000Z,...,,,,,,,2022-05-24T00:20:37+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
3,1528891537995796481,1528891537995796481,,,,444109960,,,,2022-05-24T00:12:02.000Z,...,,,,,,,2022-05-24T00:20:37+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
4,1528891522271150081,1528891522271150081,,1.528845e+18,,1145235467601756161,,1.350456e+18,,2022-05-24T00:11:58.000Z,...,,,,,,,2022-05-24T00:20:37+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
594,1528759456552693760,1528759456552693760,,1.528738e+18,,774243018304057345,,1.479934e+18,,2022-05-23T15:27:12.000Z,...,,,,,,,2022-05-24T00:20:41+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
595,1528758800408121344,1528758800408121344,,1.528635e+18,,911449457824108545,,1.251569e+18,,2022-05-23T15:24:35.000Z,...,,,,,,,2022-05-24T00:20:41+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
596,1528758385394733059,1528758385394733059,,1.528529e+18,,1425297926981488643,,1.524979e+18,,2022-05-23T15:22:56.000Z,...,,,,,,,2022-05-24T00:20:41+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
597,1528757951221096448,1528757951221096448,,,,722125004054388736,,,,2022-05-23T15:21:13.000Z,...,,,,,,,2022-05-24T00:20:41+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,


In [39]:
list(hashtagcats_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

In [40]:
# what dataframes do we have at this point?
%whos DataFrame

Variable              Type         Data/Info
--------------------------------------------
ecodatasci_df         DataFrame                          id <...>\n[473 rows x 74 columns]
hashtagcats_df        DataFrame                          id <...>\n[599 rows x 74 columns]
library_timeline_df   DataFrame                           id<...>n[3226 rows x 74 columns]


# Episode 5: Ethics & Twitter

In [41]:
# what dataframes do we have?
%who DataFrame
# which column is the text?

ecodatasci_df	 hashtagcats_df	 library_timeline_df	 


In [42]:
# our first full-text analysis
# a list of words with TextBlob

# first we need to munge the data. remember from:
# list(library_df.columns)
# the tweet is library_df['text']

# TextBlob has its own data format.

# break tweets test column into a list, 
# then .join into one long string 
library_string = ' '.join(library_timeline_df['text'].tolist())
# turn the string into a blob
library_blob = textblob.TextBlob(library_string)


In [43]:
#This produces a mess when we output it. 
(library_blob)

TextBlob("We are pleased to announce University of California’s transformative open access agreement with the American Chemical Society (ACS).\nThrough this agreement, authors publishing will receive support for open access publication in ACS’ portfolio. \nLearn more:\nhttps://t.co/G5XMqLit1e UCSB Library and the Writing Program invite you to join us today, May 19th, at 4pm, for student-created pandemic stories in the Instruction &amp; Training room. \nHear stories about mental health struggles, finding new communities, and sensory deprivation.\nhttps://t.co/Ca9An4AMgg https://t.co/RlRkyhDAs0 Join us today, May 18th, in the Special Research Collections for The Cambodian Vintage Music Archive Project at 3:30pm. \nHear about CVMA's work to restore pre-Khmer Rouge era music and surviving families' rights.\nhttps://t.co/ije7hxDKBQ https://t.co/w3IvwBqAHG Join us today, May 10th, at 7:30pm in UCSB Campbell Hall for a free community talk by Ted Chiang, UCSB Reads 2022 author of Exhalation: S

In [44]:
# Let's count the words and sort by their frequency of use:
library_freq = library_blob.word_counts
library_sorted_freq = sorted(library_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_sorted_freq[1:25]

[('https', 2282),
 ('to', 1646),
 ('of', 1523),
 ('and', 1244),
 ('ucsb', 1148),
 ('http', 1117),
 ('in', 1107),
 ('library', 1071),
 ('a', 1047),
 ('for', 1012),
 ('s', 745),
 ('on', 671),
 ('at', 634),
 ('you', 605),
 ('our', 547),
 ('we', 532),
 ('is', 499),
 ('from', 475),
 ('this', 453),
 ('ucsblibrary', 415),
 ('with', 391),
 ('by', 391),
 ('amp', 331),
 ('more', 327)]

We can at least get the english stopwords out. but this all didn't reall produce anything cleaner:

In [45]:
# load the stopwords to use
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')

In [46]:
# create a new text list that does
# NOT contain stopwords
library_str_stopped = [word for word in library_string.split() 
                       if word.lower() not in sw_nltk]
library_words_stopped = " ".join(library_str_stopped)

In [47]:
library_blob_stopped = textblob.TextBlob(library_words_stopped)
library_blob_stopped_freq = library_blob_stopped.word_counts
library_blob_stopped_sorted_freq = sorted(library_blob_stopped_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_blob_stopped_sorted_freq[1:50]

[('ucsb', 1148),
 ('http', 1117),
 ('library', 1071),
 ('s', 679),
 ('ucsblibrary', 415),
 ('amp', 331),
 ('new', 288),
 ('today', 250),
 ('us', 232),
 ('’', 228),
 ('book', 218),
 ('research', 212),
 ('here', 210),
 ('students', 195),
 ('reads', 189),
 ('open', 187),
 ('we', 178),
 ('check', 177),
 ('join', 164),
 ('week', 161),
 ('floor', 145),
 ('day', 143),
 ('free', 139),
 ('collections', 138),
 ('access', 136),
 ('art', 127),
 ('re', 126),
 ('special', 120),
 ('study', 119),
 ('learn', 116),
 ('collection', 112),
 ('first', 108),
 ('campus', 106),
 ('one', 103),
 ('info', 102),
 ('books', 102),
 ('available', 102),
 ('see', 100),
 ('uc', 99),
 ('come', 99),
 ('librarian', 98),
 ('community', 97),
 ('student', 97),
 ('talk', 96),
 ('exhibit', 95),
 ('hours', 93),
 ('more', 92),
 ('read', 90),
 ('science', 89)]

In [48]:
# a more meaningful segment
library_blob_stopped_sorted_freq[7:57]

[('new', 288),
 ('today', 250),
 ('us', 232),
 ('’', 228),
 ('book', 218),
 ('research', 212),
 ('here', 210),
 ('students', 195),
 ('reads', 189),
 ('open', 187),
 ('we', 178),
 ('check', 177),
 ('join', 164),
 ('week', 161),
 ('floor', 145),
 ('day', 143),
 ('free', 139),
 ('collections', 138),
 ('access', 136),
 ('art', 127),
 ('re', 126),
 ('special', 120),
 ('study', 119),
 ('learn', 116),
 ('collection', 112),
 ('first', 108),
 ('campus', 106),
 ('one', 103),
 ('info', 102),
 ('books', 102),
 ('available', 102),
 ('see', 100),
 ('uc', 99),
 ('come', 99),
 ('librarian', 98),
 ('community', 97),
 ('student', 97),
 ('talk', 96),
 ('exhibit', 95),
 ('hours', 93),
 ('more', 92),
 ('read', 90),
 ('science', 89),
 ('“', 87),
 ('”', 86),
 ('time', 86),
 ('barbara', 85),
 ('tbt', 84),
 ('santa', 83),
 ('year', 82)]

In [49]:
# Challenge: for the Python wizzes. #FIXME
# do that in a tidy way?
# what do pandas pipes look like?

# Just the words, hold the gore

## Challenge: Insta-rrectionists

In [50]:
# how long is this?
riots_dehydrated_df = pandas.read_csv("raw_data/dehydratedCapitolRiotTweets.txt")
len(riots_dehydrated_df)

82308

In [51]:
# this takes a very long time.
# !twarc2 hydrate raw_data/dehydratedCapitolRiotTweets.txt raw_data/riots.jsonl

In [52]:
# regardless of how you slice this, it's about 
# 80 % of the content that is still there.

# these are very slow, so they are commented out
# ! twarc2 flatten raw_data/riots.jsonl output_data/riots_flat.jsonl
# ! twarc2 csv output_data/riots_flat.jsonl output_data/riots_flat.csv

! wc output_data/riots_flat.jsonl
! wc output_data/riots_flat.csv


    65554  31279893 553455111 output_data/riots_flat.jsonl
    65322  12010592 360330969 output_data/riots_flat.csv


In [53]:
! head -n 10000 output_data/riots_flat.jsonl > output_data/riots10k.jsonl
! twarc2 csv output_data/riots10k.jsonl > output_data/riots10k.csv

In [55]:
 riots_df = pandas.read_csv("output_data/riots10k.csv", low_memory=False)

In [56]:
riots_df.shape

(9997, 74)

In [57]:
riots_df.columns

Index(['id', 'conversation_id', 'referenced_tweets.replied_to.id',
       'referenced_tweets.retweeted.id', 'referenced_tweets.quoted.id',
       'author_id', 'in_reply_to_user_id', 'retweeted_user_id',
       'quoted_user_id', 'created_at', 'text', 'lang', 'source',
       'public_metrics.like_count', 'public_metrics.quote_count',
       'public_metrics.reply_count', 'public_metrics.retweet_count',
       'reply_settings', 'possibly_sensitive', 'withheld.scope',
       'withheld.copyright', 'withheld.country_codes', 'entities.annotations',
       'entities.cashtags', 'entities.hashtags', 'entities.mentions',
       'entities.urls', 'context_annotations', 'attachments.media',
       'attachments.media_keys', 'attachments.poll.duration_minutes',
       'attachments.poll.end_datetime', 'attachments.poll.id',
       'attachments.poll.options', 'attachments.poll.voting_status',
       'attachments.poll_ids', 'author.id', 'author.created_at',
       'author.username', 'author.name', 'author

In [58]:
# count the users
users_df = riots_df.author_id.unique()
(users_df.shape)

(9160,)

In [59]:
users_quoted_df = riots_df.quoted_user_id.unique()
(users_quoted_df.shape)

(230,)

# Episode 6: Search and Filter

In [60]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
# !twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" raw_data/kittens.jsonl
!twarc2 csv raw_data/kittens.jsonl output_data/kittens.csv

100%|██████████████| Processed 2.59M/2.59M of input file [00:00<00:00, 3.11MB/s]

ℹ️
Parsed 899 tweets objects from 9 lines in the input file.
Wrote 899 rows and output 74 columns in the CSV.



In [61]:
kittens_df = pandas.read_csv("output_data/kittens.csv")

In [62]:
kittens_df

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1528574826507554816,1528574826507554816,,1.526889e+18,,1339747029451419648,,3.958303e+09,,2022-05-23T03:13:32.000Z,...,,,,,,,2022-05-23T03:17:48+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
1,1528570565065379840,1528570565065379840,,1.526889e+18,,1052848710,,3.958303e+09,,2022-05-23T02:56:36.000Z,...,,,,,,,2022-05-23T03:17:48+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
2,1528558400346259456,1528558400346259456,,1.528045e+18,,616608467,,1.212552e+18,,2022-05-23T02:08:16.000Z,...,,,,,,,2022-05-23T03:17:48+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
3,1528547008151101440,1528547008151101440,,1.528454e+18,,1373558094693683204,,2.926437e+09,,2022-05-23T01:23:00.000Z,...,,,,,,,2022-05-23T03:17:48+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
4,1528544995975757827,1528544995975757827,,1.528153e+18,,264383293,,2.826013e+09,,2022-05-23T01:15:00.000Z,...,,,,,,,2022-05-23T03:17:48+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
894,1526927292382969858,1526927292382969858,,1.526889e+18,,1507717680421298181,,3.958303e+09,,2022-05-18T14:06:50.000Z,...,,,,,,,2022-05-23T03:17:55+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
895,1526927183905693696,1526927183905693696,,1.526889e+18,,862826271016931331,,3.958303e+09,,2022-05-18T14:06:24.000Z,...,,,,,,,2022-05-23T03:17:55+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
896,1526926332516515841,1526926332516515841,,1.526889e+18,,373612360,,3.958303e+09,,2022-05-18T14:03:01.000Z,...,,,,,,,2022-05-23T03:17:55+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,
897,1526926256259973121,1526926256259973121,,1.526889e+18,,1232696931408924677,,3.958303e+09,,2022-05-18T14:02:43.000Z,...,,,,,,,2022-05-23T03:17:55+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.4,


In [63]:
list(kittens_df.columns)

['id',
 'conversation_id',
 'referenced_tweets.replied_to.id',
 'referenced_tweets.retweeted.id',
 'referenced_tweets.quoted.id',
 'author_id',
 'in_reply_to_user_id',
 'retweeted_user_id',
 'quoted_user_id',
 'created_at',
 'text',
 'lang',
 'source',
 'public_metrics.like_count',
 'public_metrics.quote_count',
 'public_metrics.reply_count',
 'public_metrics.retweet_count',
 'reply_settings',
 'possibly_sensitive',
 'withheld.scope',
 'withheld.copyright',
 'withheld.country_codes',
 'entities.annotations',
 'entities.cashtags',
 'entities.hashtags',
 'entities.mentions',
 'entities.urls',
 'context_annotations',
 'attachments.media',
 'attachments.media_keys',
 'attachments.poll.duration_minutes',
 'attachments.poll.end_datetime',
 'attachments.poll.id',
 'attachments.poll.options',
 'attachments.poll.voting_status',
 'attachments.poll_ids',
 'author.id',
 'author.created_at',
 'author.username',
 'author.name',
 'author.description',
 'author.entities.description.cashtags',
 'author

# Search

In [64]:
# ! twarc2 search "(to:tinycarebot)" raw_data/tinycarebot_mentions.jsonl

# Stream

In [65]:
!twarc2 stream-rules add "#WorldGothDay"

[31m💣  [31mDuplicateRule[0m see: https://api.twitter.com/2/problems/duplicate-rules[0m


In [66]:
!twarc2 stream-rules add "gothcats"

[31m💣  [31mDuplicateRule[0m see: https://api.twitter.com/2/problems/duplicate-rules[0m


In [67]:
# press the square to interrup this!

In [68]:
# !twarc2 stream > "raw_data/streamed_goth.jsonl"

In [69]:
! wc raw_data/streamed_goth.jsonl
! twarc2 flatten raw_data/streamed_goth.jsonl > output_data/streamed_goth_flat.jsonl
! wc output_data/streamed_goth_flat.jsonl 

    4  1277 20136 raw_data/streamed_goth.jsonl
    4  1582 24554 output_data/streamed_goth_flat.jsonl


In [70]:
!twarc2 stream-rules delete "caturday"

[31m🙃  No rule could be found for "caturday"[0m


# Retweets vs. tweets
How much original content is there?
Do this for both library timeline and catsofinstagrams

In [71]:
# via pandas and plottting
retweet_count = hashtagcats_df["referenced_tweets.retweeted.id"].value_counts()
sum(retweet_count)


413

In [72]:
(sum(retweet_count) / len(hashtagcats_df))

0.6894824707846411

78% of the tweets that used #catsofinstagram were retweets.

In [73]:
# so our pipeline on a stream would look like:


# Episode 7: twarc plug-ins

In [74]:
# this reminds you what DataFrames you have in memory
%who DataFrame

ecodatasci_df	 hashtagcats_df	 kittens_df	 library_timeline_df	 riots_dehydrated_df	 riots_df	 


In [75]:
!pip install twarc-hashtags

Collecting twarc-hashtags
  Using cached twarc_hashtags-0.0.5-py3-none-any.whl
Installing collected packages: twarc-hashtags
Successfully installed twarc-hashtags-0.0.5


In [76]:
!pip install twarc-network

Collecting twarc-network
  Using cached twarc_network-0.1.1-py3-none-any.whl (8.2 kB)
Collecting networkx
  Using cached networkx-2.8.2-py3-none-any.whl (2.0 MB)
Collecting pydot
  Using cached pydot-1.4.2-py2.py3-none-any.whl (21 kB)
Installing collected packages: pydot, networkx, twarc-network
Successfully installed networkx-2.8.2 pydot-1.4.2 twarc-network-0.1.1


In [77]:
!twarc2 retweeted-by 1522543998996414464 > 'raw_data/tinycarebot_rtby.jsonl'
!twarc2 flatten raw_data/tinycarebot_rtby.jsonl > output_data/tinycarebot_rtby_flat.jsonl

In [78]:
!twarc2 hashtags raw_data/hashtagcats.jsonl output_data/hashtagcats_hashtags.csv

100%|██████████████| Processed 1.84M/1.84M of input file [00:00<00:00, 32.0MB/s]


In [79]:
# how do I print file to cell?
# print(read(output_data/hashtagcats_hashtags.csv))

In [80]:
!twarc2 hashtags raw_data/ucsblibrary_mentions.jsonl output_data/ucsblibrary_mentions_hashtags.csv

100%|██████████████| Processed 2.41M/2.41M of input file [00:00<00:00, 15.1MB/s]


In [81]:
!twarc2 network raw_data/hashtagcats.jsonl output_data/hashtagcats_network.html

In [82]:
# !twarc2 followers --limit 1 tinycarebot >  'raw_data/tcb_followers.jsonl'

In [83]:
!twarc2 flatten raw_data/tcb_followers.jsonl > output_data/tcb_followers_flat.jsonl

In [84]:
! wc output_data/tcb_followers_flat.jsonl

   1000   54050 1212647 output_data/tcb_followers_flat.jsonl


In [85]:
# tiny challenge
# ! twarc2 csv output_data/tcb_followers_flat.jsonl > output_data/tcb_followers.csv
# csv doesn't work on profiles.

In [86]:
# ! twarc2 mentions ucsblibrary raw_data/ucsblibrary_mentions.jsonl
! twarc2 csv raw_data/ucsblibrary_mentions.jsonl output_data/ucsblibrary_mentions.csv 
ucsb_library_mentions_df = pandas.read_csv('output_data/ucsblibrary_mentions.csv')

100%|██████████████| Processed 2.41M/2.41M of input file [00:00<00:00, 3.17MB/s]

ℹ️
Parsed 809 tweets objects from 9 lines in the input file.
Wrote 809 rows and output 74 columns in the CSV.



In [87]:
!twarc2 network raw_data/hashtag_gasprices.jsonl output_data/hashtag_gasprices_network.html

# Episode 8: Python text analysis

### Sentiment Analysis
To do this, we need to do a little Python

TextBlob is a text processing library that does sentiment analysis. 
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Before we use TextBlob for sentiment analysis, we need to download
datasets of words and their associated weights. These are called *corpora*.

In [88]:
# commented out because I put it up in ep 2
# !python -m textblob.download_corpora

In [89]:
# TextBlob needs a string, so this won't work.
# textblob.TextBlob(hashtagcats_df).sentiment

In [90]:
# even calling the column won't work:
# textblob.TextBlob(hashtagcats_df['text']).sentiment

In [91]:
# break tweets test column into a list, then .join into one long string 
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
# turn the string into a blob
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
# get the sentiment
hashtagcats_blob.sentiment

Sentiment(polarity=0.17576840733247795, subjectivity=0.5069781628198716)

In [92]:
# what dataframes are still here?
%whos DataFrame


Variable                   Type         Data/Info
-------------------------------------------------
ecodatasci_df              DataFrame                          id <...>\n[473 rows x 74 columns]
hashtagcats_df             DataFrame                          id <...>\n[599 rows x 74 columns]
kittens_df                 DataFrame                          id <...>\n[899 rows x 74 columns]
library_timeline_df        DataFrame                           id<...>n[3226 rows x 74 columns]
riots_dehydrated_df        DataFrame           134686307243517952<...>n[82308 rows x 1 columns]
riots_df                   DataFrame                           id<...>n[9997 rows x 74 columns]
ucsb_library_mentions_df   DataFrame                          id <...>\n[809 rows x 74 columns]


The overall sentiment of the language of kitty twitter is rather positive.
And the tweets tend to be subjective.

In [93]:
# What do you think the sentiment of gasprices might be?
# get the overall sentiment and see if it matches your prediction.

In [94]:
! twarc2 csv output_data/hashtag_gasprices_flat.jsonl > output_data/hashtag_gasprices_flat.csv
hashtag_gasprices_df = pandas.read_csv("output_data/hashtag_gasprices_flat.csv", low_memory=False)

In [95]:
gasprices_list = ' '.join(hashtag_gasprices_df['text'].tolist())
gasprices_blob = textblob.TextBlob(gasprices_list)
print("Hashtag Gas Prices: ") 
gasprices_blob.sentiment

Hashtag Gas Prices: 


Sentiment(polarity=0.0783091422742766, subjectivity=0.44952337845641754)

In [96]:
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
print("Hashtag Cats of Instagram: ") 
hashtagcats_blob.sentiment

Hashtag Cats of Instagram: 


Sentiment(polarity=0.17576840733247795, subjectivity=0.5069781628198716)

# Episode 9: Data Management

In [97]:
!twarc2 counts --granularity "day" --text "amazon"

2022-05-18T00:03:20.000Z - 2022-05-19T00:00:00.000Z: 994,210
2022-05-19T00:00:00.000Z - 2022-05-20T00:00:00.000Z: 904,386
2022-05-20T00:00:00.000Z - 2022-05-21T00:00:00.000Z: 1,000,931
2022-05-21T00:00:00.000Z - 2022-05-22T00:00:00.000Z: 891,999
2022-05-22T00:00:00.000Z - 2022-05-23T00:00:00.000Z: 884,822
2022-05-23T00:00:00.000Z - 2022-05-24T00:00:00.000Z: 971,489
2022-05-24T00:00:00.000Z - 2022-05-25T00:00:00.000Z: 1,068,666
2022-05-25T00:00:00.000Z - 2022-05-25T00:03:20.000Z: 2,546
[32m
Total Tweets: 6,719,049
[0m
