# Exercise W7D1: Review and Putting it All Together

This exercise aims to draw together the topics we have covered in the _Base Camp_ portion of the Digital Methods class. At the end of the exercise, you should have a `DataFrame` with each row containing information on a Twitter account including their tweets, friends, followers, hashtags and mentions as well as some descriptive statistics.

You will be able to reuse and modify this code for the second half of digital methods to download and analyze tweets for your projects. So, this exercise should provide you with a solid review of different things we have learned and help you for the the rest of the course.

In [21]:
import tweepy
print(tweepy.__version__)

from AppCred import CONSUMER_KEY, CONSUMER_SECRET
from AppCred import ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)

auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

3.8.0


**Exercise 1. Identify a topic, authenticate, and get data.** First, identify a topic of interest to you and think about a keyword or hashtag capturing the topic. Possible topics could be Corona or Climate, but you are welcome to choose something else. Then load the `tweepy` module and use the built-in functionality to `search` Twitter for your keyword or hashtag. Create a variable that contains the data returned by your search.

See [here](http://docs.tweepy.org/en/latest/api.html#help-methods) for more information about the `search` method.

In [22]:
demdeb_tweets = api.search(q = "DemDebate", since="2020-03-15")



Now we have an object containing a number of tweets pertaining to our topic of interest. As you might remember, by default the Twitter API returns the data to us in JSON format. Now that we know about the elegance and beauty of `DataFrames`, we would prefer to work with that format of data rather than a dictionary-style JSON. 

**Exercise 2. Turn raw API data into a DataFrame.** Your search returned a set of tweets about your chosen topic. Construct a `DataFrame` from your Twitter search object of the people who are tweeting about that topic that, at minimum, contains the unique `screen_names`, `followers_count`, `friends_count`, and `statuses_count` returned from your search. 

There are a number of ways to do this so you might want to review how to construct `DataFrames` (W6D1-Demo). You may also want to review navigating JSON objects (W4D2-Exercise_solutions). Also, your returned data might include the same account multiple times, so you will want to make sure that you are listing the account only once in your `DataFrame`.

In [23]:
usernames = []
tweet_text = []
truncated =[]
post_date =[]
followers=[]

for tweet in demdeb_tweets:
    username = tweet._json["user"]["screen_name"]
    usernames.append(username)
    
    text = tweet._json["text"]
    tweet_text.append(text)
    
    trunc = tweet._json["truncated"]
    truncated.append(trunc)

    posted_on = tweet._json["created_at"]
    post_date.append(posted_on)

    
    followercount = tweet._json["user"]["followers_count"]
    followers.append(followercount)
    

demdeb_mar15 = {"username": usernames,
               "text": tweet_text,
               "truncated?": truncated,
               "date posted": post_date,
               "follower count": followers}


In [24]:
import pandas as pd

demdeb_mar = pd.DataFrame(demdeb_mar15)

demdeb_mar

Unnamed: 0,username,text,truncated?,date posted,follower count
0,sonyaliloquy,RT @GottaBernNow: To be clear:\nWe are not emp...,False,Tue Mar 17 13:13:19 +0000 2020,434
1,Electricknight5,RT @MarkDice: The Bernie vs Biden #DemDebate i...,False,Tue Mar 17 13:13:08 +0000 2020,879
2,DireChange2,"RT @RBReich: If Biden gets the nomination, not...",False,Tue Mar 17 13:13:03 +0000 2020,18
3,FuqdInTheUSA,RT @sunrisemvmt: .@JoeBiden just repeated a sm...,False,Tue Mar 17 13:12:51 +0000 2020,0
4,Gary04337542,RT @TeamTrump: While Sleepy Joe and Crazy Bern...,False,Tue Mar 17 13:12:47 +0000 2020,24
5,KittyCat11231,RT @NaomiAKlein: Let this sink in. Understand ...,False,Tue Mar 17 13:12:45 +0000 2020,50
6,DeploraBIL,RT @HOLYSMKES: #DemDebate recap https://t.co/Y...,False,Tue Mar 17 13:12:44 +0000 2020,6914
7,MoveOnTC,RT @OurRevolution: Joe - LYING to the America...,False,Tue Mar 17 13:12:33 +0000 2020,249
8,ColejrJs,RT @RBReich: Your #DemDebate reminder that Med...,False,Tue Mar 17 13:12:22 +0000 2020,276
9,MelindaTanner16,RT @SethAbramson: RETWEET if you too would rat...,False,Tue Mar 17 13:12:21 +0000 2020,11


With our neat `DataFrame` we can now easily find out details about the data we collected from Twitter.

**Exercise 3. Get information about our data.** Use the `print` function and string operations to make Python tell you in plain language: a) How many unique accounts there are in your data, b) what the name of the _last_ account in your data is, and c) what the sum of followers is for all accounts in your data. That is make Python print out full sentences with the relevant information.

In [27]:
print("There are " + str(demdeb_mar.username.count()) + 
      " accounts in my dataset. The last account is " + str(len(demdeb_mar.username)) + 
      ". In total, the accounts have " + str(sum(followers)) + " followers.")

There are 13 accounts in my dataset. The last account is 13. In total, the accounts have 14453 followers.


**Exercise 4. Adding data to our DataFrame.** Loop through the indices of your `DataFrame`, collect the timeline for each account using the `user_timeline` method from tweepy, and store them in a new list "timelines". Note that you will want to build in some `sleep` time to avoid running into rate limits. You can find the syntax for how to do this on page 155 and the logic and examples on pages 209-12 in Brooker (2020).

In [28]:
import time

timelines = []

for n in demdeb_mar.index:
    a = api.user_timeline(demdeb_mar.username[n])
    timelines.append(a)
    
    time.sleep(5)
    print("loop")

loop
loop
loop
loop
loop
loop
loop
loop
loop
loop
loop
loop
loop


In [81]:
len(timelines)

13

Add your list "timelines" to your current `DataFame`. To do this, we first need to turn our list into a new `DateFrame` with one column labeled `timelines` and then join our two `DateFrames` horizontally, i.e. along `axis = 1`.

In [39]:
timelines_ = pd.DataFrame({"timelines": timelines})
demdeb_mar_tl = pd.concat([demdeb_mar, timelines_], axis=1)

### Take a deep breath. This was a major piece of coding. 

Now you have the timeline, that is the statuses, for each of your accounts in the `DataFrame`. But these are still in the raw format which the Twitter API returns, so we need to transform them into a format that allows us to to work with them more easily. In the end, we want to get at the the topics and persons our accounts tweet about.

**Exercise 5. Getting the tweet texts from the timeline.** Extract the text from the tweets in each account's timeline, combine them into a list, turn the list of lists into a `DataFrame`, and join the new and old `DataFrames`. One way to do this is to 1) create an empty list 'tweets', 2) loop through the indices in your `DataFrame`, 3) for each index/row loop through the timeline, 4) create a temporary list, append the text for each timeline element to that list, then append the temporary list to 'tweets' 5) turn the list into a `DataFrame` and 6) merge the two `DataFrames` horizontally.

In [83]:
timeline_tweets=[]

for timeline in timelines:    
    text = []
    text.append(tweet._json["text"])
    
    timeline_tweets.append(text)
    

"""
timeline_tweets=[]

for i in demdeb_mar_tl.index:
    tweets_no = len(demdeb_mar_tl.timelines[i])
    temp = []
    for n in range(0,tweets_no):
        temp.append(demdeb_mar_tl.timelines[i][n]._json["text"])

    timeline_tweets.append(temp)

tl_tweets = pd.DataFrame({"tweets":timeline_tweets})

demdeb_mar_tl = pd.concat([demdeb_mar_tl, tl_tweets], axis=1)

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-83-8c99ca5e7475>, line 23)

**Exercise 6. Turning our list of tweet texts into a long string.** To get a sense of what our accounts usually tweet about, it might be useful to have their tweets in one long string that allows us to easily count the words they use. Create a list that holds the long string of tweets for each user. We can concatenate our list of tweets/strings using the [join](https://docs.python.org/2/library/string.html#string.join) command for which you can find a usage example [here](https://stackoverflow.com/a/493842).

Turn the list into a `DataFrame` and merge it with your main `DataFrame` horizontally. 

In [79]:
text=[]

for i in demdeb_mar_tl.index:
    text.append(" ".join(demdeb_mar_tl["tweets"][i]))
    
    
text = pd.DataFrame({"text": text})

text

TypeError: '<' not supported between instances of 'str' and 'int'

**Exercise 7. Finding hashtags and mentions.** Now that we have all the tweets for each account in one long string, we can start looking at the topics the accounts are tweeting about and who they are interacting with. To do so, you can use the [`findall`](https://docs.python.org/3/library/re.html#re.findall) function from the `re` package to extract all hashtags (starting with a "#") and mentions (starting with an "@"). Add one column for hashtags and mentions respectively to your `DataFrame`.

**Exercise 8. Writing your insights to a file.** You have just generated some really awesome insights about the accounts you identified earlier. To share your insights, that is the topics/hashtags your accounts tweet about, you should now write the hashtags to a text file–if you want to remind yourself, we covered this in week 3 day 1. Can you make it so the text file first lists the name and then the hashtags the account uses?

**Exercise 9. Descriptive statistics about your accounts.** We closed last week with talking about descriptive statistics. For the accounts you gathered, there are at least three variables that you might be interested to know more about. What are the minimum, maximum, and mean for the number of followers, friends, and posted statuses in you data?

**Exercise 10. Visualizing influence.** To round off this exercise, let's plot some data from the accounts you collected. Make a bar plot to show which of the accounts has the most influence on Twitter. _Hint:_ You might want to look at `followers_count`.

**Exercise 11. Understanding influence.** Now that you know who is most influential among your accounts, try to see if the data you get from Twitter allows you to explore what might explain that influence. Look into your data and plot the follower count against another variable. Is there a pattern?

**THERE IS ALWAYS MORE.** If you got all the way through this exercise and are still hungry for more, here are some suggestions for other things you could do:

1. To get an even better sense of what your accounts tweet about than just using hashtags, you could count the most used words. Create a list that, for each account has a dictionary of the frequency of each word with stop words removed. Remember, you can reuse your code from W3D1. You can get a list of stop words from [here](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words). These are also stored in `stop_words.txt`. Add a column to your dataframe for most used words. 
2. Extract the number of favorites and retweets from the timelines you gathered. Is there any relationship between the number of followers and these figures? How about between these figures and the number of friends?
3. Researchers often use Twitter because we can do respondent-driven sampling, i.e. we start with a few accounts and then collect the accounts that follow these accounts to get a broader picture of the network. Start exploring the networks of the accounts you collected using the [`followers`](https://tweepy.readthedocs.io/en/latest/api.html#API.followers) command.
4. Given that the accounts you collected are similar in that they tweet about your topic of choice, it might be interesting know if there are issues that distinguish the accounts. Researchers often use term frequency-inverse document frequency to study such differences. [Here](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/) is a primer on the concept and a tutorial on how to implement it in Python. Can you find distinguishes your accounts from one another?