# Twitter with twarc
A UCSB Carpentry workshop

## Episode 2
You should have a taxday.jsonl file

In [1]:
# BASH commands start with a BANG!
!twarc2 --help

Usage: twarc2 [OPTIONS] COMMAND [ARGS]...

  Collect data from the Twitter V2 API.

Options:
  --consumer-key TEXT         Twitter app consumer key (aka "App Key")
  --consumer-secret TEXT      Twitter app consumer secret (aka "App Secret")
  --access-token TEXT         Twitter app access token for user
                              authentication.
  --access-token-secret TEXT  Twitter app access token secret for user
                              authentication.
  --bearer-token TEXT         Twitter app access bearer token.
  --app-auth / --user-auth    Use application authentication or user
                              authentication. Some rate limits are higher with
                              user authentication, but not all endpoints are
                              supported.  [default: app-auth]
  -l, --log TEXT
  --verbose
  --metadata / --no-metadata  Include/don't include metadata about when and
                              how data was collected.  [default: metadata]
  

In [2]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# and of course, it's important to know where we are working:
!pwd

/home/jovyan


## Running twarc
Let's get the timeline of one of twarc's creators.

In [3]:
!twarc2 timeline BergisJules > source_data/bjules.jsonl

API limit of 3200 reached:  18%|█▉         | 3142/17697 [00:36<02:50, 85.17it/s]


### Challenge
- Can you find the file called “bjules.jsonl”?
- What's the timestamp on the first one. The last one?

- How many tweets did you get from Bergis? (we can't tell without flattening)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What’s the oldest one?

# Episode 3: examining tweets
What comes along with a tweet

### Make your jsonl 1 tweet per line
This will let you do our most basic unix-y analysis

In [7]:
!wc source_data/bjules.jsonl

     33  849333 9417257 source_data/bjules.jsonl


## Try again
33 lines doesn't mean 33 tweets. I suspected there was more there becauce
I got an error message about hitting a limit of 3200. 

We need to either flatten our tweets, or convert them
to a csv


In [8]:
# flatten
!twarc2 flatten source_data/bjules.jsonl output_data/bjules_flattened.jsonl

100%|██████████████| Processed 8.98M/8.98M of input file [00:00<00:00, 11.4MB/s]


In [10]:
# convert
!twarc2 csv source_data/bjules.jsonl output_data/bjules.csv

100%|██████████████| Processed 8.98M/8.98M of input file [00:05<00:00, 1.62MB/s]

ℹ️
Parsed 3166 tweets objects from 33 lines in the input file.
Wrote 3166 rows and output 74 columns in the CSV.



In [12]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output_data/bjules_flattened.jsonl
! wc output_data/bjules.csv

    3166  1742043 23565579 output_data/bjules_flattened.jsonl
    3167   586043 11658052 output_data/bjules.csv


The csv is 1 line longer because it has column headers.

Can we go back further on his timeline by looking
only for Bergis's original content?

No--the same limit applies.

In [13]:
!twarc2 timeline BergisJules --exclude-retweets --exclude-replies > source_data/bjules_original

API limit of 3200 reached:   0%|             | 47/17697 [00:00<03:38, 80.86it/s]


In [None]:
!head -n 1 taxday.jsonl

In [16]:
!twarc2 flatten source_data/bjules_original.jsonl output_data/bjules_original_flat.jsonl


Usage: twarc2 flatten [OPTIONS] [INFILE] [OUTFILE]
Try 'twarc2 flatten --help' for help.

Error: Invalid value for '[INFILE]': 'source_data/bjules_original.jsonl': No such file or directory


In [None]:
# A lot comes along! 
!tail -n 1 taxday.jsonl

## Challenge: tax day Tweets

In [None]:
# we harvested 3 hours worth of tweets for you on tax day.
# how many tweets?
!wc taxday.jsonl

## Next harvest
Next we'll get just Bergis' original content. 
In other words, only the tweets that he wrote, not
any retweets or replied to other people tweets.

In [None]:
# maybe the last challenge for ep. 3 is examining this shorter file?

# Episode 4

In [None]:
!twarc2 search --limit 500 "#catsofinstagram" hashtagcats.jsonl

In [None]:
!twarc2 search --limit 5000 "#catsofinstagram" hashtagcats.jsonl

In [None]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
!twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" kittens.jsonl

### Converting to csv and dataframes

In [None]:
!twarc2 csv source_data/taxday.jsonl output_data/taxday.csv

In [None]:
!twarc2 csv kittens.jsonl output_data/kittens.csv

In [None]:
kittens_df = pandas.read_csv("output_data/kittens.csv")

In [None]:
kittens_df

# Episode 5: Ethics & Twitter

# Episode 6: Working with our Tweets

In [None]:
# Episode 6
# make the utils folder and get them from the repo
# we will make folder for learners:
# https://github.com/DocNow/twarc/tree/main/utils

!python utils/wall.py source_data/taxday.jsonl > output_data/taxday_wall.html

In [None]:
# what does this do?
# it shows the tweet ID's of the Retweets in your dataset, and how much
!python utils/retweets.py source_data/taxday.jsonl > output_data/taxday_retweets.csv

In [None]:
# how much is that?
tax_retweets_df = pandas.read_csv("output_data/taxday_retweets.csv", names=["tweetid", "retweets"])
tax_retweets_df.head()
tax_retweets_df["retweets"].plot(kind = "hist")


In [None]:
retweet_total = sum(tax_retweets_df["retweets"])


In [None]:
tax_retweets_df["retweets"].plot(kind = "hist", loglog=True)

# Episode 7: Search and Filter

# Episode 8: Analysis Tools

In [None]:
# Episode 9: Data Management

In [None]:
# Episode 10: Don't Map Twitter