<img src="https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/cl.jpeg" width=800>

<br>
<br>

**From last time - looking at the tweets**

Using the Premium API we pulled tweets per day containing the hashtag `#MayorCheat`. Most of the action took place on the 4th of February, so let's start there. We have a file of just over 100k tweets from that day, each one containing the term #MayorCheat. We have put them up on [Dropbox](https://www.dropbox.com/s/x1alcns5mxga60c/mayorcheat_202002040000_202002050000.json?dl=0). Download the file and put it in the same folder as this notebook. 

Recall that each line in the file is a JSON string representing a tweet from February 4 containing the hashtag `#MayorCheat`. Let's read the data into a list, one tweet-string per entry.

In [None]:
day1 = open("mayorcheat_202002040000_202002050000.json").readlines()

In [None]:
len(day1)

We tried to make some of the information in the file easier to work with by flattening it into a DataFrame. We created a CSV [located here](https://github.com/computationaljournalism/columbia2020/raw/master/data/mc/mayorcheat_all_04.csv.gz) so you don't have to do the steps below. But you should see what we did and think about why it works.

In [None]:
# don't need to execute this one

#build = []

#for tweet_str in day1:

#    tweet = loads(tweet_str)

#    who_rt = ""
#    text_rt = ""
    
#    if "retweeted_status" in tweet:
#        who_rt = tweet["retweeted_status"]["user"]["screen_name"]
#        text_rt = tweet["retweeted_status"]["text"]
        
#    newdata = {"created_at":tweet["created_at"],
#               "screen_name":tweet["user"]["screen_name"],
#               "text":tweet["text"],
#               "followers_count":tweet["user"]["followers_count"],
#               "friends_count":tweet["user"]["friends_count"],
#               "retweeted_user":who_rt,
#               "retweeted_text":text_rt}
    
#    build.append(newdata)
               
#from pandas import DataFrame

# build a dataframe and output a CSV
#day1_df = DataFrame(build)
#day1_df.to_csv("mayorcheat_all_04.csv")

In [None]:
from pandas import read_csv

day1 = read_csv("mayorcheat_all_04.csv")
day1.head()

As a reminder, we didn't do this last time but the `.str` object in a DataFrame column lets us do string-like things to entire columns. Here we test to see which tweet text contains `"Cernovich"`.

In [None]:
from pandas import set_option
set_option('max_colwidth', -1)

day1[day1["text"].str.contains("Cernovich")]

Now, let's look at those entries in our DataFrme that represent retweets and pull them into a separate structure. We use the option `.copy(deep=True)` to create an entirely independent copy of our data frame. Whatever changes we make to this, stay with this copy.

In [None]:
retweets = day1[~day1["retweeted_user"].isnull()].copy(deep=True)
retweets.shape

In [None]:
retweets["retweeted_user"].value_counts()

We can now look to see if there are well-worn retweet patterns. We can just paste together two columns with a space in between to get us a string that holds the person retweeting, a space, and then the person being retweeted.

In [None]:
retweets["fromto"] = retweets["screen_name"]+" "+retweets["retweeted_user"]
retweets["fromto"].value_counts()

In [None]:
from plotly.express import histogram

fig = histogram(retweets, x="eastern",nbins=200)
fig.show()

Now, this data frame of retweets is one way to represent the activity taking place around the conversation. So we can think of users as nodes in a network with an arrow running from one to the other if the first node was retweeted by the second node. So to do this, let's break time up into chunks. Here's "Hour 1" or the first hour into the life of the hashtag. Simple subsetting gives us all the `retweets` rows that occurred before 3am EST.

In [None]:
hour1 = retweets[retweets["eastern"]< "2020-02-04 03:00:00"].copy(deep=True)
hour1.shape

We then looked at [graphcommons.com](http://graphcommons.com), a site for making shared network graphs. I love this tool. So we need a CSV with columns FromType, FromName, Edge, ToType, ToName, Weight. We'll do that below, making three new columns (FromType and ToType and Edge type), and then rename two columns to FromName and ToName.

In [None]:
# FromType, FromName, Edge, ToType, ToName

hour1["FromType"] = "User"
hour1["ToType"] = "User"
hour1["Edge"] = "Retweeted by"
hour1 = hour1.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour1.head()

The next bit of code is slightly advanced, but we'll narrate it and come back later. It basically take repeated retweet events (someone retweets the same person 10 times) and replaces the 10 entries with just one having a Weight of 10.  

In [None]:
hour1_weights = hour1[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour1_weights.head()

We then output the file to `hour1.csv` that we can read into graph commons. 

In [None]:
hour1_weights.to_csv("hour1.csv",index=False)

We have done this for seven hours into the event which takes us basically up to the first peak. Now, you can either download the data in CSV format and upload it to graph commons, or use the graphs I have created linked here. The code for each CSV is in a separate cell -- although wouldn't a loop be better?!?!


* [Hour 1](https://graphcommons.com/graphs/57a029ac-3eea-4ff9-ada4-1d4b3b0fd171)
* [Hour 2](https://graphcommons.com/graphs/9cb5b4a8-06f2-4ba1-88d4-a44f5b6d9351)
* [Hour 3](https://graphcommons.com/graphs/f93f4c4c-23f5-411a-8bd1-52f9428b7499)
* [Hour 4](https://graphcommons.com/graphs/58fd0297-a720-4567-b54f-f3a066e4d80c)
* [Hour 5](https://graphcommons.com/graphs/79fbf890-030b-4440-bd50-091a8e929680)
* [Hour 6](https://graphcommons.com/graphs/cac35b27-8289-49f9-9178-6698ad681e27)
* [Hour 7](https://graphcommons.com/graphs/c028797d-f692-4394-b9c1-7dff0dc9cb31)

We'll talk about what do do with Graph Commons next.

In [None]:
hour2 = retweets[(retweets["eastern"]< "2020-02-04 04:00:00") &
                 (retweets["eastern"]> "2020-02-04 03:00:00")].copy(deep=True)

hour2["FromType"] = "User"
hour2["ToType"] = "User"
hour2["Edge"] = "Retweeted by"

hour2 = hour2.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour2_weights = hour2[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour2_weights.to_csv("hour2.csv",index=False)

In [None]:
hour3 = retweets[(retweets["eastern"]< "2020-02-04 05:00:00") &
                 (retweets["eastern"]> "2020-02-04 04:00:00")].copy(deep=True)

hour3["FromType"] = "User"
hour3["ToType"] = "User"
hour3["Edge"] = "Retweeted by"

hour3 = hour3.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour3_weights = hour3[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour3_weights.to_csv("hour3.csv",index=False)

In [None]:
hour4 = retweets[(retweets["eastern"]< "2020-02-04 06:00:00") &
                 (retweets["eastern"]> "2020-02-04 05:00:00")].copy(deep=True)

hour4["FromType"] = "User"
hour4["ToType"] = "User"
hour4["Edge"] = "Retweeted by"

hour4 = hour4.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour4_weights = hour4[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour4_weights.to_csv("hour4.csv",index=False)

In [None]:
hour5 = retweets[(retweets["eastern"]< "2020-02-04 07:00:00") &
                 (retweets["eastern"]> "2020-02-04 06:00:00")].copy(deep=True)

hour5["FromType"] = "User"
hour5["ToType"] = "User"
hour5["Edge"] = "Retweeted by"

hour5 = hour5.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour5_weights = hour5[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour5_weights.to_csv("hour5.csv",index=False)

In [None]:
hour6 = retweets[(retweets["eastern"]< "2020-02-04 07:00:00") &
                 (retweets["eastern"]> "2020-02-04 06:00:00")].copy(deep=True)

hour6["FromType"] = "User"
hour6["ToType"] = "User"
hour6["Edge"] = "Retweeted by"

hour6 = hour6.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour6_weights = hour6[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour6_weights.to_csv("hour6.csv",index=False)

In [None]:
hour7 = retweets[(retweets["eastern"]< "2020-02-04 07:00:00") &
                 (retweets["eastern"]> "2020-02-04 06:00:00")].copy(deep=True)

hour7["FromType"] = "User"
hour7["ToType"] = "User"
hour7["Edge"] = "Retweeted by"

hour7 = hour7.rename(columns={"retweeted_user":"FromName","screen_name":"ToName"})
hour7_weights = hour7[["FromType","FromName","Edge","ToType","ToName"]].groupby(["FromType","FromName","Edge","ToType","ToName"]).size().reset_index().rename(columns={0:'Weight'})
hour7_weights.to_csv("hour7.csv",index=False)

This code is here to help you skim through the users as you identify them, going back to the original data. You can also skim through the hourly files you created your network graphs from...

In [None]:
day1[day1["screen_name"]=="jonrocks69"]