# Twitter with twarc
A UCSB original Carpentry workshop

You should have a hashtag_gasprices.jsonl file. We made it for you using the code below.

In [None]:
# we made this file for you
# ! twarc2 search "#gasprices" > raw_data/hashtag_gasprices.jsonl




In [None]:
# administravia
# upon re-start we need to install twarc2 and other extensions
! pip install twarc-csv
! pip install emoji

# Episode 2


In [None]:
# BASH commands start with a BANG!
!twarc2 --help

In [None]:
#  what libraries will we need to be loading in our notebook?
#  we need to always distinguish between 
#  running BASH vs. running a line of python.

import pandas
import twarc_csv
import textblob
import nltk
import os
import emoji

In [None]:
# this comes into play for ep 8
!python -m textblob.download_corpora
nltk.download('stopwords')

In [None]:
# and of course, it's important to know where we are working
# I can send a BASH command from my notebook with a !:
!pwd

In [None]:
# you can also do this with Python
os.getcwd()

In [None]:
# we can change if we need
# os.chdir(".....")

In [None]:
os.getcwd()

## Running twarc
Let's get the timeline of one of twarc's creators.

In [None]:
# !twarc2 timeline BergisJules > raw_data/bjules.jsonl

### Challenge 1
- Can you find the file called “bjules_flat.jsonl”?
- How many tweets did you get from Bergis? (we can't tell without flattening or looking at the output)
- Download a timeline for a person of your choice. How many tweets did you get? 
- What’s the oldest one?

In [None]:
# !twarc2 timeline ecodatasci > raw_data/ecodatasci.jsonl
! twarc2 flatten raw_data/ecodatasci.jsonl > output_data/ecodatasci_flat.jsonl
! wc output_data/ecodatasci_flat.jsonl

A straight harvest using search or stream doesn't need to be flattened 
to do our most basic analysis: wc. Do gas prices here?

## To flatten or not flatten

### Make your jsonl 1 tweet per line
Flattening will let you do our most basic unix-y analysis, turn
timelines into countable lists, and enable you to run twarc1
utilities later on in the workshop

In [None]:
# timeline objects need to be flattened in order to be analyzed as tweets
!twarc2 flatten raw_data/bjules.jsonl output_data/bjules_flat.jsonl

## Convert to csv

In [None]:
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

## When we look at bjules, we really do need to flatten it.

In [None]:
! wc raw_data/bjules.jsonl

33 lines doesn't mean 33 tweets. I suspected there was more there because
I got an error message about hitting a limit of 3200. 

And below, the csv converter tells us there are 3143 tweets.

In [None]:
# convert
!twarc2 csv raw_data/bjules.jsonl output_data/bjules.csv

In [None]:
# once I flatten it, my wc will show the correct number
! wc output_data/bjules_flat.jsonl

In [None]:
# When I did this, I got 3166 tweets (as opposed to the 33 lines that the original file was)
! wc output_data/bjules_flat.jsonl
! wc output_data/bjules.csv

The csv is 1 line longer because it has column headers.
twarc2 csv takes flat or unflattened Twitter data files.

### Challenge 2

In [None]:
# commented line is a solution to challenge 1
# !twarc2 timeline ecodatasci > raw_data/ecodatasci.jsonl

!twarc2 flatten raw_data/ecodatasci.jsonl > output_data/ecodatasci_flat.jsonl
!twarc2 csv output_data/ecodatasci_flat.jsonl > output_data/ecodatasci.csv 
ecodatasci_df = pandas.read_csv("output_data/ecodatasci.csv")


# Episode 3: examining tweets
What comes along with a tweet
- Look at one_tweet in Jupyter viewer
- Look at one_tweet with nano
- Look at tweet as csv
- Look at all the entities of a tweet

In [None]:
! wc raw_data/hashtag_gasprices.jsonl

! twarc2 flatten raw_data/hashtag_gasprices.jsonl > output_data/hashtag_gasprices_flat.jsonl

! wc output_data/hashtag_gasprices_flat.jsonl

In [None]:
### Let's look at a single tweet as a csv:
!twarc2 flatten raw_data/one_tweet.jsonl output_data/one_tweet_flat.jsonl
!twarc2 csv output_data/one_tweet_flat.jsonl output_data/one_tweet.csv




In [None]:
!head -n 2 'output_data/hashtag_gasprices_flat.jsonl' > 'output_data/4_tweets.jsonl'
!tail -n 2 'output_data/hashtag_gasprices_flat.jsonl' >> 'output_data/4_tweets.jsonl'

In [None]:
! cat output_data/4_tweets.jsonl

## Next harvest
Next we'll get just Bergis' original content. 
In other words, only the tweets that he wrote, not
any retweets or replied to other people tweets.

Can we go back further on his timeline by looking
only for Bergis's original content?

Not really--it looks like the limit applies to the search,
not the results. 


In [None]:
# !twarc2 timeline BergisJules --exclude-retweets --exclude-replies > raw_data/bjules_original.jsonl

In [None]:
!twarc2 flatten raw_data/bjules_original.jsonl output_data/bjules_original_flat.jsonl


But this does tell us that Jules is a prolific re-tweeter and/or replier. 

In [None]:
! wc output_data/bjules_original_flat.jsonl
! wc output_data/bjules_flat.jsonl

In [None]:
# save it as a csv so we can easily see the original writings of Jules
!twarc2 csv output_data/bjules_original_flat.jsonl output_data/bjules_original.csv

# Episode 4

In [None]:
# fishing around for good searches
# you can count without harvesting.
# kittens is an evergreen search. you should always see at lease
# dozens of mentions per hour
!twarc2 counts --text "kittens"

In [None]:
# recent search with granularity.
!twarc2 counts --granularity "day" --text "(#UCSBLibrary OR UCSBLibrary OR ucsblibrary OR #ucsblibrary OR davidsonlibrary OR #davidsonlibrary)"

In [None]:
# phrase searching????
# this one isnt working
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb OR ("UC Santa Barbara"))"
!twarc2 counts --granularity "day" --text "("uc santa barbara")"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

In [None]:
# search for hashtags when you really want hashtags. 
# search for a string returns both text and hashtage (an OR)
# NOT case sensitive
!twarc2 counts --granularity "day" --text "(#UCSB OR UCSB OR ucsb)"
!twarc2 counts --granularity "day" --text "(#ucsb)"
!twarc2 counts --granularity "day" --text "(UCSB)"

In [None]:
## Endpoints: counts
!twarc2 counts --text "(Poker OR poker)" --granularity "day"
!twarc2 counts --text "(Golf OR golf)" --granularity "day"
!twarc2 counts --text "(Basketball OR basketball)" --granularity "day"
!twarc2 counts --text "(Baseball OR baseball)" --granularity "day"
!twarc2 counts --text "(Football OR football)" --granularity "day"

In [None]:
## What's a lot?
!twarc2 counts --text "dog" --granularity "day"
!twarc2 counts --text "cat" --granularity "day"
!twarc2 counts --text "amazon" --granularity "day"
!twarc2 counts --text "right" --granularity "day"
!twarc2 counts --text "good" --granularity "day"


In [None]:
## a SFW timeline
# !twarc2 timeline ucsblibrary raw_data/library_timeline.jsonl
!twarc2 flatten raw_data/library_timeline.jsonl output_data/library_timeline_flat.jsonl
!twarc2 csv output_data/library_timeline_flat.jsonl output_data/library_timeline_flat.csv
library_timeline_df = pandas.read_csv("output_data/library_timeline_flat.csv")

In [None]:
# confirm the dataframe's existance
len(library_timeline_df)

In [None]:
# and view all column headers
list(library_timeline_df.columns)

## final challenge: Cats of Instagram
Let’s make a bigger datafile. Harvest 5000 tweets that use the hashtag “catsofinstagram” and put the dataset through the pipeline to answer the following questions:

- Did you get exactly 5000?
- How far back in time did you get?
- What is the most re-tweeted recent tweet on #catsofinstagram?
- Which person has the most number of followers in your dataset?
- Is it really a person?

In [None]:
# !twarc2 search --limit 500 "#catsofinstagram" raw_data/hashtagcats.jsonl
!twarc2 flatten raw_data/hashtagcats.jsonl output_data/hashtagcats_flat.jsonl
!twarc2 csv raw_data/hashtagcats.jsonl > output_data/hashtagcats.csv
hashtagcats_df = pandas.read_csv("output_data/hashtagcats.csv")
! wc output_data/hashtagcats.csv
hashtagcats_df["created_at"].head()

In [None]:
hashtagcats_df["created_at"].tail()

In [None]:
hashtagcats_df

In [None]:
list(hashtagcats_df.columns)

In [None]:
# what dataframes do we have at this point?
%whos DataFrame

# Episode 5: Ethics & Twitter

In [None]:
# what dataframes do we have?
%who DataFrame
# which column is the text?

In [None]:
# our first full-text analysis
# a list of words with TextBlob

# first we need to munge the data. remember from:
# list(library_df.columns)
# the tweet is library_df['text']

# TextBlob has its own data format.

# break tweets test column into a list, 
# then .join into one long string 
library_string = ' '.join(library_timeline_df['text'].tolist())
# turn the string into a blob
library_blob = textblob.TextBlob(library_string)


In [None]:
#This produces a mess when we output it. 
(library_blob)

In [None]:
# Let's count the words and sort by their frequency of use:
library_freq = library_blob.word_counts
library_sorted_freq = sorted(library_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_sorted_freq[1:25]

We can at least get the english stopwords out. but this all didn't reall produce anything cleaner:

In [None]:
# load the stopwords to use
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')

In [None]:
# create a new text list that does
# NOT contain stopwords
library_str_stopped = [word for word in library_string.split() 
                       if word.lower() not in sw_nltk]
library_words_stopped = " ".join(library_str_stopped)

In [None]:
library_blob_stopped = textblob.TextBlob(library_words_stopped)
library_blob_stopped_freq = library_blob_stopped.word_counts
library_blob_stopped_sorted_freq = sorted(library_blob_stopped_freq.items(), 
                             key = lambda kv: kv[1], 
                             reverse = True)
library_blob_stopped_sorted_freq[1:50]

In [None]:
# a more meaningful segment
library_blob_stopped_sorted_freq[7:57]

In [None]:
# Challenge: for the Python wizzes. #FIXME
# do that in a tidy way?
# what do pandas pipes look like?

# Just the words, hold the gore

## Challenge: Insta-rrectionists

In [None]:
# how long is this?
riots_dehydrated_df = pandas.read_csv("raw_data/dehydratedCapitolRiotTweets.txt")
len(riots_dehydrated_df)

In [None]:
# this takes a very long time.
# !twarc2 hydrate raw_data/dehydratedCapitolRiotTweets.txt raw_data/riots.jsonl

In [None]:
# regardless of how you slice this, it's about 
# 80 % of the content that is still there.

# these are very slow, so they are commented out
# ! twarc2 flatten raw_data/riots.jsonl output_data/riots_flat.jsonl
# ! twarc2 csv output_data/riots_flat.jsonl output_data/riots_flat.csv

! wc output_data/riots_flat.jsonl
! wc output_data/riots_flat.csv


In [None]:
! head -n 10000 output_data/riots_flat.jsonl > output_data/riots10k.jsonl
! twarc2 csv output_data/riots10k.jsonl > output_data/riots10k.csv

In [None]:
 riots_df = pandas.read_csv("output_data/riots10k.csv", low_memory=False)

In [None]:
riots_df.shape

In [None]:
riots_df.columns

In [None]:
# count the users
users_df = riots_df.author_id.unique()
(users_df.shape)

In [None]:
users_quoted_df = riots_df.quoted_user_id.unique()
(users_quoted_df.shape)

# Episode 6: Search and Filter

In [None]:
# use Twitter advanced search syntax (everthing in quotes!)
# to get tailored results
# !twarc2 search --limit 800 "(cute OR fluffy OR haircut) (#catsofinstagram) lang:en" raw_data/kittens.jsonl
!twarc2 csv raw_data/kittens.jsonl output_data/kittens.csv

In [None]:
kittens_df = pandas.read_csv("output_data/kittens.csv")

In [None]:
kittens_df

In [None]:
list(kittens_df.columns)

# Search

In [None]:
# ! twarc2 search "(to:tinycarebot)" raw_data/tinycarebot_mentions.jsonl

# Stream

In [None]:
!twarc2 stream-rules add "#WorldGothDay"

In [None]:
!twarc2 stream-rules add "gothcats"

In [None]:
# press the square to interrup this!

In [None]:
# !twarc2 stream > "raw_data/streamed_goth.jsonl"

In [None]:
! wc raw_data/streamed_goth.jsonl
! twarc2 flatten raw_data/streamed_goth.jsonl > output_data/streamed_goth_flat.jsonl
! wc output_data/streamed_goth_flat.jsonl 

In [None]:
!twarc2 stream-rules delete "caturday"

# Retweets vs. tweets
How much original content is there?
Do this for both library timeline and catsofinstagrams

In [None]:
# via pandas and plottting
retweet_count = hashtagcats_df["referenced_tweets.retweeted.id"].value_counts()
sum(retweet_count)


In [None]:
(sum(retweet_count) / len(hashtagcats_df))

78% of the tweets that used #catsofinstagram were retweets.

In [None]:
# so our pipeline on a stream would look like:


# Episode 7: twarc plug-ins

In [None]:
# this reminds you what DataFrames you have in memory
%who DataFrame

In [None]:
!pip install twarc-hashtags

In [None]:
!pip install twarc-network

In [None]:
!twarc2 retweeted-by 1522543998996414464 > 'raw_data/tinycarebot_rtby.jsonl'
!twarc2 flatten raw_data/tinycarebot_rtby.jsonl > output_data/tinycarebot_rtby_flat.jsonl

In [None]:
!twarc2 hashtags raw_data/hashtagcats.jsonl output_data/hashtagcats_hashtags.csv

In [None]:
# how do I print file to cell?
# print(read(output_data/hashtagcats_hashtags.csv))

In [None]:
!twarc2 hashtags raw_data/ucsblibrary_mentions.jsonl output_data/ucsblibrary_mentions_hashtags.csv

In [None]:
!twarc2 network raw_data/hashtagcats.jsonl output_data/hashtagcats_network.html

In [None]:
# this is slow and uses quota
# !twarc2 followers --limit 1 tinycarebot >  'raw_data/tcb_followers.jsonl'

In [None]:
!twarc2 flatten raw_data/tcb_followers.jsonl > output_data/tcb_followers_flat.jsonl

In [None]:
! wc output_data/tcb_followers_flat.jsonl

In [None]:
# tiny challenge
# look at the help!
! twarc2 csv --input-data-type users output_data/tcb_followers_flat.jsonl > output_data/tcb_followers.csv
# csv doesn't work on profiles.

In [None]:
# ! twarc2 mentions ucsblibrary raw_data/ucsblibrary_mentions.jsonl
# what comes next now that users csv is fixed!????? #FIXME
! twarc2 csv raw_data/ucsblibrary_mentions.jsonl output_data/ucsblibrary_mentions.csv 
ucsb_library_mentions_df = pandas.read_csv('output_data/ucsblibrary_mentions.csv')

In [None]:
!twarc2 network raw_data/hashtag_gasprices.jsonl output_data/hashtag_gasprices_network.html

# Episode 8: Python text analysis

### Sentiment Analysis
To do this, we need to do a little Python

TextBlob is a text processing library that does sentiment analysis. 
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Before we use TextBlob for sentiment analysis, we need to download
datasets of words and their associated weights. These are called *corpora*.

In [None]:
# commented out because I put it up in ep 2
# !python -m textblob.download_corpora

In [None]:
# TextBlob needs a string, so this won't work.
# textblob.TextBlob(hashtagcats_df).sentiment

In [None]:
# even calling the column won't work:
# textblob.TextBlob(hashtagcats_df['text']).sentiment

In [None]:
# break tweets test column into a list, then .join into one long string 
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
# turn the string into a blob
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
# get the sentiment
hashtagcats_blob.sentiment

In [None]:
# what dataframes are still here?
%whos DataFrame


The overall sentiment of the language of kitty twitter is rather positive.
And the tweets tend to be subjective.

In [None]:
# What do you think the sentiment of gasprices might be?
# get the overall sentiment and see if it matches your prediction.

In [None]:
! twarc2 csv output_data/hashtag_gasprices_flat.jsonl > output_data/hashtag_gasprices_flat.csv
hashtag_gasprices_df = pandas.read_csv("output_data/hashtag_gasprices_flat.csv", low_memory=False)

In [None]:
gasprices_list = ' '.join(hashtag_gasprices_df['text'].tolist())
gasprices_blob = textblob.TextBlob(gasprices_list)
print("Hashtag Gas Prices: ") 
gasprices_blob.sentiment

In [None]:
hashtagcats_list = ' '.join(hashtagcats_df['text'].tolist())
hashtagcats_blob = textblob.TextBlob(hashtagcats_list)
print("Hashtag Cats of Instagram: ") 
hashtagcats_blob.sentiment

# Episode 9: Data Management

In [None]:
!twarc2 counts --granularity "day" --text "amazon"