# Twitter with twarc
A UCSB Carpentry workshop

## Episode 2
You should have a taxday.jsonl file

In [1]:
# BASH commands start with a BANG!
!twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  

In [2]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv
from textblob import TextBlob

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# and of course, it's important to know where we are working
# I can send a BASH command from my notebook with a !:
!pwd

/home/jovyan


## Running twarc
Let's get the timeline of one of twarc's creators.

In [3]:
!twarc2 timeline BergisJules > raw_data/bjules.jsonl

API limit of 3200 reached:  18%|█▉         | 3142/17697 [00:36<02:50, 85.17it/s]


### Challenge
- Can you find the file called “bjules.jsonl”?
- What's the timestamp on the first one. The last one?

- How many tweets did you get from Bergis? (we can't tell without flattening)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What’s the oldest one?

# Episode 3: examining tweets
What comes along with a tweet

### Make your jsonl 1 tweet per line
This will let you do our most basic unix-y analysis

In [4]:
!wc raw_data/bjules.jsonl

     33  849333 9417257 raw_data/bjules.jsonl


## Try again
33 lines doesn't mean 33 tweets. I suspected there was more there becauce
I got an error message about hitting a limit of 3200. 

We need to either flatten our tweets, or convert them
to a csv


In [5]:
# flatten
!twarc2 flatten raw_data/bjules.jsonl output_data/bjules_flattened.jsonl

100%|██████████████| Processed 8.98M/8.98M of input file [00:00<00:00, 9.50MB/s]


In [6]:
# convert
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

100%|██████████████| Processed 8.98M/8.98M of input file [00:05<00:00, 1.58MB/s]

ℹ️
Parsed 3166 tweets objects from 33 lines in the input file.
Wrote 3166 rows and output 74 columns in the CSV.



In [7]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output_data/bjules_flattened.jsonl
! wc output_data/bjules.csv

    3166  1742043 23565579 output_data/bjules_flattened.jsonl
    3167   586043 11658052 output_data/bjules.csv


The csv is 1 line longer because it has column headers.

Can we go back further on his timeline by looking
only for Bergis's original content?

No--the same limit applies.

In [8]:
!twarc2 timeline BergisJules --exclude-retweets --exclude-replies > raw_data/bjules_original

API limit of 3200 reached:   0%|             | 47/17696 [00:00<03:20, 88.13it/s]


Let's look at out taxday.jsonl file that we prepared for you on April 18.

In [None]:
!head -n 1 raw_data/taxday.jsonl

In [9]:
!twarc2 flatten raw_data/bjules_original.jsonl output_data/bjules_original_flat.jsonl


100%|████████████████| Processed 111k/111k of input file [00:00<00:00, 25.3MB/s]


In [None]:
# A lot comes along! 
!tail -n 1 taxday.jsonl

## Challenge: tax day Tweets

In [None]:
# we harvested 3 hours worth of tweets for you on tax day.
# how many tweets?
!wc taxday.jsonl

## Next harvest
Next we'll get just Bergis' original content. 
In other words, only the tweets that he wrote, not
any retweets or replied to other people tweets.

In [None]:
# maybe the last challenge for ep. 3 is examining this shorter file?

# Episode 4

In [36]:
## Endpoints: counts
!twarc2 counts --text "(Poker OR poker)" --granularity "day"
!twarc2 counts --text "(Golf OR golf)" --granularity "day"
!twarc2 counts --text "(Basketball OR basketball)" --granularity "day"
!twarc2 counts --text "(Baseball OR baseball)" --granularity "day"
!twarc2 counts --text "(Football OR football)" --granularity "day"

2022-05-04T23:09:38.000Z - 2022-05-05T00:00:00.000Z: 520
2022-05-05T00:00:00.000Z - 2022-05-06T00:00:00.000Z: 14,928
2022-05-06T00:00:00.000Z - 2022-05-07T00:00:00.000Z: 14,851
2022-05-07T00:00:00.000Z - 2022-05-08T00:00:00.000Z: 13,038
2022-05-08T00:00:00.000Z - 2022-05-09T00:00:00.000Z: 14,368
2022-05-09T00:00:00.000Z - 2022-05-10T00:00:00.000Z: 13,196
2022-05-10T00:00:00.000Z - 2022-05-11T00:00:00.000Z: 14,321
2022-05-11T00:00:00.000Z - 2022-05-11T23:09:38.000Z: 17,018
[32m
Total Tweets: 102,240
[0m
2022-05-04T23:09:39.000Z - 2022-05-05T00:00:00.000Z: 1,864
2022-05-05T00:00:00.000Z - 2022-05-06T00:00:00.000Z: 49,995
2022-05-06T00:00:00.000Z - 2022-05-07T00:00:00.000Z: 46,706
2022-05-07T00:00:00.000Z - 2022-05-08T00:00:00.000Z: 42,001
2022-05-08T00:00:00.000Z - 2022-05-09T00:00:00.000Z: 41,864
2022-05-09T00:00:00.000Z - 2022-05-10T00:00:00.000Z: 48,382
2022-05-10T00:00:00.000Z - 2022-05-11T00:00:00.000Z: 51,410
2022-05-11T00:00:00.000Z - 2022-05-11T23:09:39.000Z: 44,767
[32m
Total

In [None]:
## convert to ucsblibrary. Cats will be ep. 6

### Converting to csv and dataframes

In [None]:
!twarc2 csv raw_data/taxday.jsonl output_data/taxday.csv

## final challenge: Cats of Instagram
Let’s make a bigger datafile. Harvest 5000 tweets that use the hashtag “catsofinstagram” and put the dataset through the pipeline to answer the following questions:

- Did you get exactly 5000?
- How far back in time did you get?
- What is the most re-tweeted recent tweet on #catsofinstagram?
- Which person has the most number of followers in your dataset?
- Is it really a person?

In [None]:
!twarc2 search --limit 500 "#catsofinstagram" hashtagcats.jsonl

In [1]:
!twarc2 search --limit 5000 "#catsofinstagram" raw_data/hashtagcats.jsonl

Set --limit of 5000 reached:  41%|▍| Processed 2 days/6 days [00:45<01:04, 5085 


In [None]:
list(kittens_df.columns)

# Episode 5: Ethics & Twitter

In [12]:
columns(kittens_df)

NameError: name 'columns' is not defined

# Episode 6: Working with our Tweets

1. followers
1. retweeted-by
1. wall
1. retweets
1. emojis

In [None]:
# Episode 6
# make the utils folder and get them from the repo
# we will make folder for learners:
# https://github.com/DocNow/twarc/tree/main/utils

!python utils/wall.py raw_data/taxday.jsonl > output_data/taxday_wall.html

In [None]:
# what does this do?
# it shows the tweet ID's of the Retweets in your dataset, and how much
!python utils/retweets.py raw_data/taxday.jsonl > output_data/taxday_retweets.csv

In [None]:
# how much is that?
tax_retweets_df = pandas.read_csv("output_data/taxday_retweets.csv", names=["tweetid", "retweets"])
tax_retweets_df.head()
tax_retweets_df["retweets"].plot(kind = "hist")


In [None]:
retweet_total = sum(tax_retweets_df["retweets"])


In [None]:
tax_retweets_df["retweets"].plot(kind = "hist", loglog=True)

In [None]:
# TextBlob needs a string, so this won't work.
TextBlob(kittens_df).sentiment

# Episode 7: Search and Filter

In [None]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
!twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" kittens.jsonl

In [6]:
kittens_df = pandas.read_csv("output_data/kittens.csv")

In [None]:
kittens_df

In [None]:
list(kittens_df.columns)

# Episode 8: Analysis Tools

In [None]:
built-in should come first

### Sentiment Analysis
To do this, we need to do a little Python

TextBlob is a text processing library that does sentiment analysis. 
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Before we use TextBlob for sentiment analysis, we need to download
datasets of words and their associated weights. These are called *corpora*.

In [None]:
!python -m textblob.download_corpora

In [42]:
# break tweets test column into a list, then .join into one long string 
kittens_string = ' '.join(kittens_df['text'].tolist())
# turn the string into a blob
kittens_blob = TextBlob(kittens_string)
# get the sentiment
kittens_blob.sentiment

Sentiment(polarity=0.4313034129204853, subjectivity=0.7558201654919634)

The overall sentiment of the language of our kittens tweets is rather positive.
And the tweets tend to be subjective.

In [None]:
# What do you think the sentiment of tax day might be?
# get the overall sentiment and see if it matches your prediction.

In [None]:
# Episode 9: Data Management

In [None]:
# Episode 10: Don't Map Twitter