# Formalia:

Please read the [assignment overview page](https://github.com/suneman/socialgraphs2018/wiki/Assignments) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. 

_If you fail to follow these simple instructions, it will negatively impact your grade!_

**Due date and time**: The assignment is due on Tuesday November 6th at 23:55. Hand in your IPython notebook file (with extension `.ipynb`) via http://peergrade.io/.

# Part 1: Twitter Network Analysis

_Exercise_ 1: Build the network of retweets.
We will now build a network that has as nodes the Twitter handles of the members of the house, and direct edges between nodes A and B if A has retweeted content posted by B. We will build a weighted network, where the weight of an edge is equal to the number of retweets. You can build the network following the steps below (and you should  be able to reuse many of the functions you wrote in previous weeks):

* Consider the 200 most recent tweets written by each member of the house (use the files [here](https://github.com/suneman/socialgraphs2018/tree/master/files/data_twitter/tweets.zip), or the ones you produced in Part 1). For each file, use a regular expression to find retweets and to extract the Twitter handle of the user whose content was retweeted. All retweets begin with "_RT @originalAuthor:_", where "_originalAuthor_" is the handle of the user whose content was retweeted (and the part of the text you want to extract).
* For each retweet, check if the handle retweeted is the one of a member of the house. If yes, keep it. If no, discard it.
* Use a NetworkX [`DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) to store the network. Use weighted edges to account for multiple retweets. Store also the party of each member as a node attribute (use the data in [this file](https://github.com/suneman/socialgraphs2018/blob/master/files/data_twitter/H115_tw.csv), or the data you downloaded in Part 1). Remove self-loops (edges that connect a node with itself).

In [1]:
# Exercise 1

import re
import io #read file
import os
import csv

# array that holds the name of the house members
house_members = [];

for filename in os.listdir("tweets/"):
    # store name of house member
    house_members.append(filename);


# save parties with handles
with open('parties.csv', 'rb') as csvfile:
    # Skip first line (if any)
    next(csvfile, None)
    parties = [];
    party_handles = [];
    
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in spamreader:
        row_content = row[0].split(',')
        parties.append(row_content[0]);
        party_handles.append(row_content[1]);
        
# array of members
members = [];

# create class for members
class Member():
    handle = "";
    party = "";
    retweet_handles = [];
    
    def __init__(self, handle, party, retweet_handles):
        self.handle = handle;
        self.party = party;
        self.retweet_handles = retweet_handles;

# find all files in folder
for filename in os.listdir("tweets/"):    
    # open file
    filepath = "tweets/" + filename
    file = io.open(filepath, 'r', encoding='utf-8');
    
    handles = [];
    # search for retweet
    for line in file: 
        if re.match(r'^RT', line):
            line_in_array = line.split(); # put words in array cells
            handle = line_in_array[1]; # second cell contains the handle with @ and :
            handle_clean = handle.replace("@", "").replace(":", ""); # remove @ and : from handle
            if not handle_clean in handles: # avoid duplicates
                if handle_clean in house_members: # discared handles that do not exist in house members
                    handles.append(handle_clean);
    
    
    # find member party
    for index, member in enumerate(party_handles, start=0):   # default is zero
        if filename == party_handles[index]:
            handle_party = "";
            handle_party = parties[index];
            break;
    
    members.append(Member(filename, handle_party, handles));
    
for member in members:
    print "Member handle: ", member.handle
    print "Member party: ", member.party
    print "Retweet handles: ", member.retweet_handles
    print "\n"
    


Member handle:  AustinScottGA08
Member party:  Republican
Retweet handles:  [u'AustinScottGA08']


Member handle:  BennieGThompson
Member party:  Democratic
Retweet handles:  [u'RepCummings', u'RepJeffries', u'RepDebDingell', u'RepJohnYarmuth', u'RepBonnie', u'RepCheri', u'RepVeasey', u'FrankPallone', u'RepBarbaraLee', u'RepRaskin', u'BobbyScott', u'RepStephMurphy', u'RepGwenMoore', u'RepDanKildee', u'RepDennyHeck', u'RepJerryNadler', u'GKButterfield', u'RepAnnieKuster', u'RepKarenBass', u'RepBeatty', u'RepJuanVargas', u'WhipHoyer', u'RepWilson', u'NancyPelosi', u'RepJayapal', u'RepEspaillat', u'RepTedLieu', u'RepTimWalz', u'RepKClark', u'RepMarciaFudge', u'RepMcEachin', u'RepAlGreen', u'RepRichmond', u'RepEsty']


Member handle:  BettyMcCollum04
Member party:  Democratic
Retweet handles:  [u'RepBeatty', u'RepCummings', u'RepJeffries', u'RepDebDingell', u'RepJohnYarmuth', u'repmarkpocan', u'RepRaulGrijalva', u'RepRichardNeal', u'USRepMikeDoyle', u'FrankPallone', u'RepDelBene']


Member

Retweet handles:  [u'SteveScalise', u'cathymcmorris', u'RepChuck']


Member handle:  RepDavidRouzer
Member party:  Republican
Retweet handles:  [u'SteveScalise', u'SpeakerRyan']


Member handle:  repdavidscott
Member party:  Democratic
Retweet handles:  [u'RepHankJohnson', u'BobbyScott', u'RepLawrence']


Member handle:  RepDavidValadao
Member party:  Republican
Retweet handles:  [u'RepSeanDuffy', u'RepJimmyPanetta']


Member handle:  RepDavidYoung
Member party:  Republican
Retweet handles:  [u'chelliepingree', u'RepStefanik']


Member handle:  RepDebDingell
Member party:  Democratic
Retweet handles:  [u'RepMcGovern', u'RepPaulMitchell', u'RepWalberg', u'JacksonLeeTX18', u'RepJayapal', u'RepAdams', u'RepSpeier', u'RepRaulGrijalva', u'RepHankJohnson', u'RepFredUpton']


Member handle:  RepDelBene
Member party:  Democratic
Retweet handles:  [u'RepBonamici', u'RepTerriSewell']


Member handle:  RepDennisRoss
Member party:  Republican
Retweet handles:  [u'SteveScalise', u'RepMikeRogersAL',

Member party:  Democratic
Retweet handles:  [u'RepBeatty', u'RepJeffries', u'RepJoeKennedy', u'RepDebDingell', u'RepBrady', u'RepJohnYarmuth', u'WhipHoyer', u'RepDanDonovan', u'SusanWBrooks', u'NitaLowey', u'USRepMikeDoyle', u'RepKarenBass', u'RepSarbanes', u'chelliepingree', u'RepJayapal', u'RepTerriSewell', u'RepCheri', u'JuliaBrownley26', u'RepMikeQuigley', u'USRepKCastor', u'RepLawrence', u'RepSpeier', u'NancyPelosi', u'RepRaulGrijalva', u'RepGwenMoore', u'RepRichardNeal']


Member handle:  RepLouBarletta
Member party:  Republican
Retweet handles:  [u'SpeakerRyan', u'RepPeteKing']


Member handle:  RepLouCorrea
Member party:  Democratic
Retweet handles:  [u'RepMikeCapuano', u'USRepKCastor', u'repsandylevin', u'RepAnthonyBrown', u'BillPascrell', u'repjoecrowley', u'RepTimWalz', u'RepSchneider', u'WhipHoyer', u'RepMattGaetz']


Member handle:  RepLoudermilk
Member party:  Republican
Retweet handles:  [u'RepTomGraves']


Member handle:  replouiegohmert
Member party:  Republican
Retwee

Retweet handles:  [u'SteveScalise', u'RepKinzinger', u'RosLehtinen']


Member handle:  RepTerriSewell
Member party:  Democratic
Retweet handles:  [u'RepSarbanes', u'RepTerriSewell', u'GKButterfield', u'RepVeasey', u'RepRoybalAllard']


Member handle:  RepThomasMassie
Member party:  Republican
Retweet handles:  [u'justinamash', u'Jim_Jordan', u'RepMarkMeadows']


Member handle:  RepThompson
Member party:  Democratic
Retweet handles:  []


Member handle:  RepTimRyan
Member party:  Democratic
Retweet handles:  [u'RepBarbaraLee', u'RepMarcyKaptur']


Member handle:  RepTimWalz
Member party:  Democratic
Retweet handles:  [u'RepTimWalz', u'RepDelBene', u'RepScottPeters']


Member handle:  RepTipton
Member party:  Republican
Retweet handles:  [u'SpeakerRyan', u'RepTomGraves', u'cathymcmorris']


Member handle:  RepTomEmmer
Member party:  Republican
Retweet handles:  [u'SpeakerRyan', u'PatrickMcHenry']


Member handle:  RepTomGarrett
Member party:  Republican
Retweet handles:  [u'RepMattGaetz'

 _Exercise_ 2: Visualize the network of retweets and investigate differences between the parties.
 * Visualize the network using the [Networkx draw function](https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw.html#networkx.drawing.nx_pylab.draw), and nodes coordinates from the force atlas algorithm (see Week 5, Exercise 2). _Hint: use the undirected version of the graph to find the nodes positions for better results, but stick to the directed version for all measurements._ Plot nodes in colors according to their party (e.g. 'red' for republicans and 'blue' for democrats) and set the nodes' size proportional to their total degree. 
  * Compare the network of Retweets with the network of Wikipedia pages (Week 5, exercise 2). Do you observe any difference? How do you explain them?
* Now set the nodes' size proportional to their betweenness centrality. What do you observe?
* Repeat the point above using eigenvector centrality instead. Is there any difference? Can you explain why?
* Who are the three nodes with highest degree within each party? And eigenvector centrality? And betweenness centrality?
* Plot on the same figure the distribution of outgoing strength for the republican and democratic nodes (e.g. the sum of the weight on outgoing links). Which party is more active in retweeting other members of the house?
* Find the 3 members of the republican party that have retweet more often tweets from democratic members. Repeat the measure for the democratic members. Can you explain your results by looking at the Wikipedia pages of these members of the house?

_Exercise_ 3: Community detection.

* Use your favorite method of community detection to find communities in the full house of representatives network. Report the value of modularity found by the algorithm. Is it higher or lower than what you found for the Wikipedia network? Comment on your result.
* Visualize the network, using the Force Atlas algorithm (see Lecture 5, exercise 2). This time assign each node a different color based on their _community_. Describe the structure you observe.
* Compare the communities found by your algorithm with the parties by creating a matrix $\mathbf{D}$ with dimension $(B \times C$, where $B$ is the number of parties and $C$ is the number of communities. We set entry $D(i,j)$ to be the number of nodes that party $i$ has in common with community $j$. The matrix $\mathbf{D}$ is what we call a [**confusion matrix**](https://en.wikipedia.org/wiki/Confusion_matrix). 
*  [Plot the confusion matrix](https://scipython.com/book/chapter-7-matplotlib/examples/visualizing-a-matrix-with-imshow/) and explain how well the communities you've detected correspond to the parties. Consider the following questions:
  * Are there any republicans grouped with democrats (and vice versa)?
  * Does the community detection algorithm sub-divide the parties? Do you know anything about American politics that could explain such sub-divisions? Answer in your own words.

# Part 2: What do republican and democratic members tweet about?

_Exercise_ 4: TF-IDF of the republican and democratic tweets.

We will create two documents, one containing the words extracted from tweets of republican members, and the other for Democratic members. We will then use TF-IDF to compare the content of these two documents and create a word-cloud. The procedure you should use is exactly the same you used in exercise 2 of week 7. The main steps are summarized below: 
* Create two large documents, one for the democratic and one for the republican party. Tokenize the pages, and combine the tokens into one long list including all the pages of the members of the same party. 
  * Exclude all twitter handles.
  * Exclude punctuation.
  * Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).
  * Exclude numbers (since they're difficult to interpret in the word cloud).
  * Set everything to lower case.
  * Compute the TF-IDF for each document.
* Now, create word-cloud for each party. Are these topics less "boring" than the wikipedia topics from a few weeks ago? Why?  Comment on the results.

In [2]:
import nltk

dem_tokens = [];
dem_tweets = "";
rep_tokens = [];
rep_tweets = "";

# find all files in folder
for filename in os.listdir("tweets/"):    
    # open file
    filepath = "tweets/" + filename
    file = io.open(filepath, 'r', encoding='utf-8');
    
    tweets = "";
    # search for retweet
    for line in file: 
        tweets = tweets + line;
    
    # find member party
    for index, member in enumerate(party_handles, start=0):
        if filename == party_handles[index]:
            if parties[index] == "Democratic":
                dem_tokens += nltk.word_tokenize(tweets);
                dem_tweets += tweets;
            else:
                rep_tokens += nltk.word_tokenize(tweets);
                rep_tweets += tweets;
            break;

In [3]:
# Exclude twitter handles.
dem_handles = []
for i, e in enumerate(dem_tokens): 
    if e == "@":
        dem_handles.append(dem_tokens[i+1]);

for handle in dem_handles:
    dem_tokens.remove(handle);
        
rep_handles = []
for i, e in enumerate(rep_tokens): 
    if e == "@":
        rep_handles.append(rep_tokens[i+1]);
        
for handle in rep_handles:
    rep_tokens.remove(handle);

In [4]:
#remove punctuation and lowercase
dem_tokens = [word.lower() for word in dem_tokens if word.isalpha()]
rep_tokens = [word.lower() for word in rep_tokens if word.isalpha()]

In [5]:
#remove stopwords
nltk.download('stopwords');
from nltk.corpus import stopwords #remove of stopwords

stop = set(stopwords.words('english'))
dem_tokens = [word for word in dem_tokens if word not in stopwords.words('english')]
rep_tokens = [word for word in rep_tokens if word not in stopwords.words('english')]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# exclude numbers
dem_tokens = [item for item in dem_tokens if not item.isdigit()]
rep_tokens = [item for item in rep_tokens if not item.isdigit()]

In [7]:
# print tokens to text files
dem_tweets_file = open('dem_tweets.txt','w');
for word in dem_tokens:
    word_w = word.encode("UTF-8") + "\n";
    dem_tweets_file.write(word_w);

rep_tweets_file = open('rep_tweets.txt','w');
for word in rep_tokens:
    word_w = word.encode("UTF-8") + "\n";
    rep_tweets_file.write(word_w);

In [8]:
# calculate the TF
def freq(word, doc):
    return "".join(text).count(word)


def word_count(doc):
    return len(doc)


def tf(word, doc):
    return (freq(word, doc) / float(word_count(doc)))

dem_tfList = []
for i in range(len(dem_tokens)):
    dem_tfList.append(tf(dem_tokens[i], dem_tweets))

rep_tfList = []
for i in range(len(rep_tokens)):
    rep_tfList.append(tf(rep_tokens[i], rep_tweets))

NameError: name 'text' is not defined

In [None]:
# calculate the IDF
def idf(self, term):
    """ The number of texts in the corpus divided by the
    number of texts that the term appears in.
    If a term does not appear in the corpus, 0.0 is returned. """
    # idf values are cached for performance.
    idf = self._idf_cache.get(term)
    if idf is None:
        matches = len([True for text in self._texts if term in text])
        # FIXME Should this raise some kind of error instead?
        idf = (log(len(self._texts) / matches) if matches else 0.0)
        self._idf_cache[term] = idf
    return idf
    
def tf_idf(self, term, text):
    return self.tf(term, text) * self.idf(term)
    
dem_tf_idf("", dem_tokens[1], dem_tweets)
rep_tf_idf("", rep_tokens[1], rep_tweets)

# Part 3: Sentiment analysis

In [15]:
import nltk
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')

from nltk.book import * # importing texts

from nltk.corpus import * # importing nltk
from string import punctuation
import nltk
stopwords = nltk.corpus.stopwords.words('english')

from itertools import izip 
import matplotlib.pyplot as plt # plot

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\fpegios\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.
*** Introductory Examples for the NLTK Book ***


_Exercise_ 5: Sentiment over the Twitter data.

* Download the LabMT wordlist. It's available as supplementary material from [Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) (Data Set S1). Describe briefly how the list was generated.

The data were collected from users from Mechanical Turk. This wordlist has a set of 10222 words and their average happiness evaluations according to Mechanical Turk users. Words are ordered according to average happiness which is calculated by various source: (1) word, (2) rank, (3) average happiness (50 user evalutions), (4) standard deviation of happiness, (5) Twitter rank, (6) Google Books rank, (7) New York Times rank, (8) Music Lyrics rank.

In [20]:
def dictionary(fileName):
    rows = {}
    columns = {};
    for num, line in enumerate(fileName): # reading the file
        
        if num == 0: # Get the headers line
            headings = line.split("\t") ## split it with the char "\t"
            k = 0
            for heading in headings:
                heading = heading.strip() # remove unwanted characters like \n
                rows[heading] = []
                columns[k] = heading
                k += 1
                
        else:
            cells = line.split("\t") # split it with the char "\t"
            k = 0
            for cell in cells:
                cell = cell.strip() # remove unwanted characters like \n
                rows[columns[k]] += [cell]
                k += 1
                
    return rows, columns

with open(dataset.txt) as f:
    f = file(os.getcwd()+ '/dataset.txt', 'r')
rows, columns = dictionary(f) # should create a dictionary for easy reading of the wordlist
f.close()

TypeError: '_io.TextIOWrapper' object is not callable

* Based on the LabMT word list, write a function that calculates sentiment given a list of tokens (the tokens should be lower case, etc).

* Create two lists: one including the tweets written by democratic members, and the other including the tweets written by republican members (in the text files, tweets are separated by newlines). 

* Calculate the sentiment of each tweet and plot the distribution of sentiment for each of the two lists. Are there significant differences between the two? Which party post more positive tweets?

* Compute the average _m_ and standard deviation $\sigma$  of the tweets sentiment (considering tweets by both republican and democrats). 

* Now consider only tweets with sentiment lower than m-2$\sigma$. We will refer to them as _negative_ tweets.  Build a list containing the _negative_ tweets written by democrats, and one for republicans. Compute the TF-IDF for these two lists (use the same pre-processing steps in Exercise 5). Create a word-cloud for each of them. Comment on the differences between the _negative_ contents posted by republicans and democrats.

* Repeat the point above, but considering _positive_ tweets instead (e.g. with sentiment larger than m+2$\sigma$). Comment on your results.