# Overview

This is the last week and our goal today is to use what we have learned so far to analyze new data: the tweets of the members of the House of Representatives. 

Remember that the Wikipedia network showed a nuanced and complex picture of the various kinds of connections between politicians that arise from the political system. And how our text-analysis also showed evidence of the many small cases and issues that fill the days of real-world American politicians.

Well, as we all know *nuance is sometimes a little bit boring*, so today, we're going to a place without much nuance. **Twitter**. Twitter has become one of the main communication channels between politicians and the electorate, and we hope you'll find that the things that are going on on Twitter will fit your prejudices better ... it will have a lot less nuance. 

We will also learn about _sentiment analysis_, a topic which is pretty useless when it comes to Wikipedia (because all Wikipedia text is designed to be neutral), but which is highly useful to analyze Twitter data (and many other things).

The overview for today is:

* (Optional) Download the Twitter data.
* Analyze and visualize the retweets network of the members of the House.
* Learn about sentiment analysis.
* Analyze the text of the tweets (using TF-IDF and sentiment analysis).

# Part 0 - (Optional) Download the Twitter data

In the first exercise, we will use the Twitter API to download the 200 most recent tweets of the members of the House of Representatives. 

Twitter data could be useful for your future data projects, so we encourage you to go through this exercise. However, ***note that this exercise is optional,*** meaning it will not be included in your second assignment. 

* If you want to *skip the Twitter data download part*, jump to Part 1 of the exercises and use the data provided by us. You will find the Twitter handles of the members of the house [here](https://github.com/suneman/socialgraphs2018/blob/master/files/data_twitter/H115_tw.csv), and the 200 most recent tweets for each member [here](https://github.com/suneman/socialgraphs2018/tree/master/files/data_twitter/tweets.zip). Each file is named as the Twitter handle of a member, and each line of a file contains a tweet (from the most recent to the oldest). By _handle_ we mean a user's id on Twitter.

If, instead, you want to learn how to download Twitter data, follow the steps below for success.

_Exercise_ 1: Download Twitter Data.
> To get access to the Twitter API, you will need to create an app. You can follow these steps:
>
> * Create a Twitter account (you can use your one if you already have one).
> * Apply for a Twitter developer account [here](https://developer.twitter.com/en/apply-for-access.html)
> * Create an app that interacts with the Twitter API. Go to https://developer.twitter.com/en/apps and click _"Create an app"_.
> * Fill out the form, agree to the terms, and click _“Create”_. **Note: you can use the link to your Twitter page as "_Website URL_" (e.g. ht<span>tps://</span>twitter.com/my_twitter_handle)**.
> * In the next page, click on _Keys and Access Tokens_ tab, and copy your _API key_ and _API secret_. Scroll down and click _Create my access token_, and copy your _Access token_ and _Access token secret_.

> We are almost set to use the Twitter API! We will use a Python library called [python-twitter](https://github.com/bear/python-twitter) to connect to Twitter API and download data. There are [many other libraries](https://developer.twitter.com/en/docs/developer-utilities/twitter-libraries.html) that let you use Twitter API. We chose python-twitter because it is simple to use (and it fully supports the Twitter API).
> * Install python-twitter using one of the following: 
>    * `conda install -c jacksongs python-twitter` 
>    * `pip install python-twitter`
> * Check out python-twitter [documentation](https://python-twitter.readthedocs.io/en/latest/getting_started.html) and [examples](https://github.com/bear/python-twitter/tree/master/examples) to get started with the API. Use the API keys and tokens for the app you created above to create an instance of the [`twitter.Api`](https://python-twitter.readthedocs.io/en/latest/twitter.html#twitter.api.Api) class.
> * Download the twitter handles of the members of the list [_u-s-representatives_](https://twitter.com/cspan/lists/u-s-representatives/members?lang=en). This list contains the handles of the current members of the house of representatives. _Hint:_ Use the method [`twitter.api.Api.GetListMembers`](https://python-twitter.readthedocs.io/en/latest/twitter.html?highlight=getlistmembers#twitter.api.Api.GetListMembers). 
> *  Retrieve the  _name_ associated to each Twitter handle (the one displayed in a user's Twitter page under the profile picture in bold). _Hint:_ Use the method [`twitter.api.Api.UsersLookup`](https://python-twitter.readthedocs.io/en/latest/twitter.html?highlight=getlistmembers#twitter.api.Api.UsersLookup).
> * Tricky bit! Find the party associated to each Twitter handle using the [list of the house of representatives members on Wikipedia](https://github.com/suneman/socialgraphs2018/blob/master/files/data_US_congress/H115.csv). What you need to do is to match  _names_ with the Wikipedia page names. Be creative to find a solution! _Note:_ Some members don't have a Twitter account, but others have two. In the latter case, prefer the account that is related to the house of representatives (e.g. prefer https://twitter.com/RepRoKhanna over https://twitter.com/rokhanna). Create a `pandas.Dataframe` with twitter handles and corresponding parties and save it.
> * Download the 200 most recent Tweets for each member of the house. Save the tweets of each member in a different file. _Hint:_ Use the method [`twitter.api.Api.GetUserTimeline`](https://python-twitter.readthedocs.io/en/latest/twitter.html?highlight=getlistmembers#twitter.api.Api.GetUserTimeline).

# Part 1 - The network of retweets.

Retweets are re-posting of tweets that were often originated by another user. Often, they indicate trust in the message included in the original tweet and in the original author, and agreement with the message contents ([as found also by scientific studies](http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/download/10555/10467)). This suggests that, by looking at how representatives retweet each other, we can understand something about the relations between them.

_Exercise_ 2: Build the network of retweets.
We will now build a network that has as nodes the Twitter handles of the members of the house, and direct edges between nodes A and B if A has retweeted content posted by B. We will build a weighted network, where the weight of an edge is equal to the number of retweets. You can build the network following the steps below (and you should  be able to reuse many of the functions you wrote in previous weeks):

> * Consider the 200 most recent tweets written by each member of the house (click "Download" to get the zip file [here](https://github.com/suneman/socialgraphs2018/tree/master/files/data_twitter/tweets.zip), or use the ones you produced in Part 1). For each file, use a regular expression to find retweets and to extract the Twitter handle of the user whose content was retweeted. All retweets begin with "_RT @originalAuthor:_", where "_originalAuthor_" is the handle of the user whose content was retweeted (and the part of the text you want to extract).
> * For each retweet, check if the handle retweeted is the one of a member of the house. If yes, keep it. If no, discard it.
> * Use a NetworkX [`DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) to store the network. Use weighted edges to account for multiple retweets. Store also the party of each member as a node attribute (use the data in [this file](https://github.com/suneman/socialgraphs2018/blob/master/files/data_twitter/H115_tw.csv), or the data you downloaded in Part 1). Remove self-loops (edges that connect a node with itself).


 _Exercise_ 3: Visualize the network of retweets and investigate differences between the parties.
> * Visualize the network using the [Networkx draw function](https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw.html#networkx.drawing.nx_pylab.draw), and nodes coordinates from the force atlas algorithm (see Week 5, Exercise 2). _Hint: use the undirected version of the graph to find the nodes positions for better results, but stick to the directed version for all measurements._ Plot nodes in colors according to their party (e.g. 'red' for republicans and 'blue' for democrats) and set the nodes' size proportional to their total degree. 
>   * Compare the network of Retweets with the network of Wikipedia pages (Week 5, exercise 2). Do you observe any difference? How do you explain them?
> * Now set the nodes' size proportional to their betweenness centrality. What do you observe?
> * Repeat the point above using eigenvector centrality instead. Is there any difference? Can you explain why?
> * Who are the three nodes with highest degree within each party? And eigenvector centrality? And betweenness centrality?
> * Plot on the same figure the distribution of outgoing strength for the republican and democratic nodes (e.g. the sum of the weight on outgoing links). Which party is more active in retweeting other members of the house?
> * Find the 3 members of the republican party that have retweet more often tweets from democratic members. Repeat the measure for the democratic members. Can you explain your results by looking at the Wikipedia pages of these members of the house?

 _Exercise_ 4: Community detection.
> * Use [the Python Louvain-algorithm implementation](http://perso.crans.org/aynaud/communities/) to find communities in the full house of representatives network. Report the value of modularity found by the algorithm. Is it higher or lower than what you found for the Wikipedia network? Comment on your result.
    >   * \[**Note**: This implementation is now available as Anaconda package. Install with `conda` as explained [here](https://anaconda.org/auto/python-louvain)\]. 
    >   * You can also try the *Infomap* algorithm instead if you're curious. Go to [this page]. (http://www.mapequation.org/code.html) and search for 'python'. It's harder to install, but a better community detection algorithm.
> * Visualize the network, using the Force Atlas algorithm (see Lecture 5, exercise 2). This time assign each node a different color based on their _community_. Describe the structure you observe.
> * Compare the communities found by your algorithm with the parties by creating a matrix $\mathbf{D}$ with dimension $(B \times C$, where $B$ is the number of parties and $C$ is the number of communities. We set entry $D(i,j)$ to be the number of nodes that party $i$ has in common with community $j$. The matrix $\mathbf{D}$ is what we call a [**confusion matrix**](https://en.wikipedia.org/wiki/Confusion_matrix). 
> * [Plot the confusion matrix](https://scipython.com/book/chapter-7-matplotlib/examples/visualizing-a-matrix-with-imshow/) and explain how well the communities you've detected correspond to the parties. Consider the following questions
>   * Are there any republicans grouped with democrats (and vice versa)?
>   * Does the community detection algorithm sub-divide the parties? Do you know anything about American politics that could explain such sub-divisions? Answer in your own words.

# Part 2 - What do republican and democratic members tweet about?

We will now put to use all we have learned on language processing to find out the content of the tweets of democratic and republican members. 

_Exercise_ 5: TF-IDF of the republican and democratic tweets.
> We will create two documents, one containing the words extracted from tweets of republican members, and the other for Democratic members. We will then use TF-IDF to compare the content of these two documents and create a word-cloud. The procedure you should use is exactly the same you used in exercise 2 of week 7. The main steps are summarized below: 
> * Create two large documents, one for the democratic and one for the republican party. Tokenize the pages, and combine the tokens into one long list including all the pages of the members of the same party. 
>   * Exclude all twitter handles.
>   * Exclude punctuation.
>   * Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).
>   * Exclude numbers (since they're difficult to interpret in the word cloud).
>   * Set everything to lower case.
>   * Compute the TF-IDF for each document.
> * Now, create word-cloud for each party. Are these topics less "boring" than the wikipedia topics? Why?  Comment on the results.

# Part 3 - Sentiment analysis

Sentiment analysis is another highly useful technique which we'll use to make sense of the Twitter data. Experience shows that it might well be very useful when you get to the project stage of the class.



> **Video Lecture**: Uncle Sune talks about sentiment and his own youthful adventures.



In [1]:
from IPython.display import YouTubeVideo

YouTubeVideo("JuYcaYYlfrI",width=800, height=450)

> Reading: [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) 


> _Exercise_ 6: Sentiment over the Twitter data.
> 
> * Download the LabMT wordlist. It's available as supplementary material from [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) (Data Set S1). Describe briefly how the list was generated.
> * Based on the LabMT word list, write a function that calculates sentiment given a list of tokens (the tokens should be lower case, etc).
> * Create two lists: one including the tokenized tweets written by democratic members, and the other including the tokenized tweets written by republican members. Calculate the sentiment of each tweet and plot the distribution of tweet sentiment for each of the two lists. Are there significant differences between the two? Which party post more positive tweets?
> * Compute the average _m_ and standard deviation $\sigma$  of the tweets sentiment (considering tweets by both republican and democrats). 
> * Now consider only tweets with sentiment lower than m-2$\sigma$. We will refer to them as _negative_ tweets.  Build a list containing the _negative_ tweets written by democrats, and one for republicans. Compute the TF-IDF for these two lists (use the same pre-processing steps in Exercise 5). Create a word-cloud for each of them. Are there differences between the positive contents posted by republicans and democrats?
> * Repeat the point above, but considering _positive_ tweets instead (e.g. with sentiment larger than m+2$\sigma$). Comment on your results.
