# Twitter with twarc
A UCSB Carpentry workshop

## Episode 2
You should have a taxday.jsonl file

In [1]:
# BASH commands start with a BANG!
!twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  

In [1]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv
# import TextBlob

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# and of course, it's important to know where we are working:
!pwd

/home/jovyan


## Running twarc
Let's get the timeline of one of twarc's creators.

In [3]:
!twarc2 timeline BergisJules > raw_data/bjules.jsonl

API limit of 3200 reached:  18%|█▉         | 3142/17697 [00:36<02:50, 85.17it/s]


### Challenge
- Can you find the file called “bjules.jsonl”?
- What's the timestamp on the first one. The last one?

- How many tweets did you get from Bergis? (we can't tell without flattening)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What’s the oldest one?

# Episode 3: examining tweets
What comes along with a tweet

### Make your jsonl 1 tweet per line
This will let you do our most basic unix-y analysis

In [3]:
!wc raw_data/bjules.jsonl

     33  849333 9417257 raw_data/bjules.jsonl


## Try again
33 lines doesn't mean 33 tweets. I suspected there was more there becauce
I got an error message about hitting a limit of 3200. 

We need to either flatten our tweets, or convert them
to a csv


In [5]:
# flatten
!twarc2 flatten raw_data/bjules.jsonl output_data/bjules_flattened.jsonl

100%|██████████████| Processed 8.98M/8.98M of input file [00:01<00:00, 6.00MB/s]


In [6]:
# convert
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

100%|██████████████| Processed 8.98M/8.98M of input file [00:02<00:00, 3.57MB/s]

ℹ️
Parsed 3166 tweets objects from 33 lines in the input file.
Wrote 3166 rows and output 74 columns in the CSV.



In [7]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output_data/bjules_flattened.jsonl
! wc output_data/bjules.csv

    3166  1742043 23565579 output_data/bjules_flattened.jsonl
    3167   586043 11658052 output_data/bjules.csv


The csv is 1 line longer because it has column headers.

Can we go back further on his timeline by looking
only for Bergis's original content?

No--the same limit applies.

In [8]:
!twarc2 timeline BergisJules --exclude-retweets --exclude-replies > raw_data/bjules_original

API limit of 3200 reached:   0%|             | 47/17696 [00:00<03:20, 88.13it/s]


In [None]:
!head -n 1 taxday.jsonl

In [9]:
!twarc2 flatten raw_data/bjules_original.jsonl output_data/bjules_original_flat.jsonl


100%|████████████████| Processed 111k/111k of input file [00:00<00:00, 25.3MB/s]


In [None]:
# A lot comes along! 
!tail -n 1 taxday.jsonl

## Challenge: tax day Tweets

In [None]:
# we harvested 3 hours worth of tweets for you on tax day.
# how many tweets?
!wc taxday.jsonl

## Next harvest
Next we'll get just Bergis' original content. 
In other words, only the tweets that he wrote, not
any retweets or replied to other people tweets.

In [None]:
# maybe the last challenge for ep. 3 is examining this shorter file?

# Episode 4

In [None]:
!twarc2 search --limit 500 "#catsofinstagram" hashtagcats.jsonl

In [None]:
!twarc2 search --limit 5000 "#catsofinstagram" hashtagcats.jsonl

In [None]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
!twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" kittens.jsonl

### Converting to csv and dataframes

In [10]:
!twarc2 csv raw_data/taxday.jsonl output_data/taxday.csv

 92%|████████████▊ | Processed 87.4M/95.4M of input file [00:13<00:02, 3.47MB/s][31m💔 ERROR: 314 Unexpected items in data! 
Are you sure you specified the correct --input-data-type?
If the object type is correct, add extra columns with:
--extra-input-columns "quoted_status.user.id,quoted_status.user.profile_background_image_url,quoted_status.in_reply_to_status_id,retweeted_status.user.created_at,quoted_status.user.entities.description.urls,retweeted_status.quoted_status.possibly_sensitive,user.contributors_enabled,retweeted_status.quoted_status.lang,quoted_status.user.location,retweeted_status.quoted_status.source,quoted_status.user.follow_request_sent,retweeted_status.quoted_status.display_text_range,in_reply_to_screen_name,user.id_str,retweeted_status.quoted_status.user.profile_image_url_https,quoted_status.in_reply_to_screen_name,quoted_status.user.listed_count,retweeted_status.quoted_status.user.entities.description.urls,quoted_status.user.profile_sidebar_border_color,quoted_statu

In [13]:
!twarc2 csv raw_data/kittens.jsonl output_data/kittens.csv

100%|██████████████| Processed 1.93M/1.93M of input file [00:00<00:00, 3.43MB/s]

ℹ️
Parsed 664 tweets objects from 7 lines in the input file.
Wrote 664 rows and output 74 columns in the CSV.



In [14]:
kittens_df = pandas.read_csv("output_data/kittens.csv")

In [15]:
kittens_df

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
0,1521622781858770944,1521622781858770944,,,,2304335737,,,,2022-05-03T22:48:36.000Z,...,,,,,,,2022-05-03T23:06:33+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
1,1521603504963559424,1521603504963559424,,,,18323954,,,,2022-05-03T21:32:00.000Z,...,,,,,,,2022-05-03T23:06:33+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
2,1521597792069857282,1521597792069857282,,1.520388e+18,,3254809073,,8.854780e+08,,2022-05-03T21:09:18.000Z,...,,,,,,,2022-05-03T23:06:33+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
3,1521585693792817155,1521585693792817155,,1.521583e+18,,1076129354432819200,,1.832395e+07,,2022-05-03T20:21:13.000Z,...,,,,,,,2022-05-03T23:06:33+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
4,1521585079520305152,1521585079520305152,,1.521583e+18,,286942658,,1.832395e+07,,2022-05-03T20:18:47.000Z,...,,,,,,,2022-05-03T23:06:33+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
659,1519129505016729602,1519129505016729602,,1.519039e+18,,20392809,,1.517269e+18,,2022-04-27T01:41:12.000Z,...,,,,,,,2022-05-03T23:06:38+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
660,1519126846100709376,1519126846100709376,,1.515818e+18,,1183728633606758400,,3.958303e+09,,2022-04-27T01:30:38.000Z,...,,,,,,,2022-05-03T23:06:38+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
661,1519118524526080006,1519118524526080006,,,,15702359,,,,2022-04-27T00:57:34.000Z,...,,,,,,,2022-05-03T23:06:38+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,
662,1519108303275663362,1519108303275663362,,1.515818e+18,,1375922717539569664,,3.958303e+09,,2022-04-27T00:16:57.000Z,...,,,,,,,2022-05-03T23:06:38+00:00,https://api.twitter.com/2/tweets/search/recent...,2.10.3,


# Episode 5: Ethics & Twitter

# Episode 6: Working with our Tweets

In [None]:
# Episode 6
# make the utils folder and get them from the repo
# we will make folder for learners:
# https://github.com/DocNow/twarc/tree/main/utils

!python utils/wall.py raw_data/taxday.jsonl > output_data/taxday_wall.html

In [None]:
# what does this do?
# it shows the tweet ID's of the Retweets in your dataset, and how much
!python utils/retweets.py raw_data/taxday.jsonl > output_data/taxday_retweets.csv

In [None]:
# how much is that?
tax_retweets_df = pandas.read_csv("output_data/taxday_retweets.csv", names=["tweetid", "retweets"])
tax_retweets_df.head()
tax_retweets_df["retweets"].plot(kind = "hist")


In [None]:
retweet_total = sum(tax_retweets_df["retweets"])


In [None]:
tax_retweets_df["retweets"].plot(kind = "hist", loglog=True)

# Episode 7: Search and Filter

# Episode 8: Analysis Tools

In [None]:
# Episode 9: Data Management

In [None]:
# Episode 10: Don't Map Twitter