### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 5-3: Analysis of Non-Rectangular Data

## Overview

* Network analysis with NetworkX
* Text analysis with NLTK

## Network analysis with [NetworkX](https://networkx.org/documentation/stable/)
* NetworkX is a Python package for network analysis.
* We can use it to study social networks!
    - Networks can be explicitly social, e.g., who is friends with who.
    - We can also study networks created through social processes, e.g., hyperlink network on the WWW, retweet networks, citation networks.

### Network basics
* **Graph**
    - Mathematical term for a network.
* **Node** (or **Vertex**)
    - The 'points' or individuals in the network.
* **Edge**
    - The connections between nodes.
* **Directed / Undirected**
    - Edges can have a direction (e.g., @Dave2008 follows @BarackObama on Twitter), or not have a direction (e.g., Dustin and Steve are friends).
* **Weighted / Unweighted**
    - Edges can have an associated weight or be unweighted (e.g., if we study a network of emails within an organisation, we can choose to include the number of emails exchanged as the edge weight).
    
    
![network_diagram](figs/network_diagram.svg "network_diagram")

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

# We can manually create a network (graph)
g = nx.Graph() # By default an unweighted, undirected graph

g.add_nodes_from(["Milena", "Patrick", "Julius", "Erica", "Helen", "Barack"])
print(g.nodes()) # we have nodes, but no edges
print(g.edges())

In [None]:
# Let's add some edges
edgelist = [("Milena", "Patrick"),
           ("Patrick", "Julius"),
           ("Erica", "Helen"),
           ("Patrick", "Helen"),
           ("Erica", "Milena"),
           ("Erica", "Helen")]

g.add_edges_from(edgelist) # add edges in the edgelist to our network
# (we could skip the add_nodes_from stage, but would miss Barack from the network)

nx.draw(g, with_labels=True) # plot the network (this is just a matplotlib axes!)

In [None]:
# We can also import data
# This is a school social network where students name who they are friends with
# A directed edge goes from the naming to the named student

dixon_g = nx.read_edgelist("data/dixon_edgelist.txt", create_using=nx.DiGraph()) # Read as a directed graph
print(nx.info(dixon_g)) # get some info

plt.figure(figsize=(16,10))
nx.draw(dixon_g, node_size=20, width=.2) # draw the graph (custom node size and edge width)
plt.show()

In [None]:
# More information, like node labels and attributes can be imported if we use another file type
# Several different graph file types can be read by networkx (GraphML, Pajek, GML, JSON, Pickle...)

dixon_g = nx.read_graphml('data/dixon_network.graphml') # read the graphml file
print(nx.info(dixon_g)) # get some info


In [None]:
# We can look at the full list of nodes
print(dixon_g.nodes(data=True))

# or a specific node
print(dixon_g.nodes()['n100'])

# or an attribute for every node
print(dixon_g.nodes(data='race'))


### Centrality
* Which are the most important nodes in the network?
* How do we measure the importance of nodes?

In [None]:
# Degree is a count of how many edges a node has
# In-degree is a count of how many edges into a node there are (number that name the student as a friend)
# Out-degree is a count of how many edges out from a node there are (number that the student names as a friend)

deg = dixon_g.degree() # get degree, indegree, outdegree 
indeg = dixon_g.in_degree()
outdeg = dixon_g.out_degree()
print(deg)

plt.hist(dict(deg).values(), bins=20) # plot distribution
plt.ylabel('Count')
plt.xlabel('Degree')
plt.title('Degree distribution of students at Dixon High')
plt.show()

print('Max degree =', max(dict(deg).values()))

In [None]:
# PageRank is a similar centrality measure
# PageRank takes into account not just number of neighbours, but importance of neighbours too
# If popular people name you as their friend, you have higher PageRank centrality than if unpopular people name you

pr = nx.pagerank(dixon_g) # get PageRank of nodes

plt.hist(pr.values(), bins=20) # plot distribution
plt.ylabel('Count')
plt.xlabel('PageRank')
plt.title('PageRank distribution of students at Dixon High')
plt.show()

In [None]:
# Let's find the mean PageRank for white vs black students
pr_w = [v for k, v in pr.items() if dixon_g.nodes(data='race')[k]=='W']
pr_b = [v for k, v in pr.items() if dixon_g.nodes(data='race')[k]=='B']

print('White student mean PageRank', sum(pr_w)/len(pr_w))
print('Black student mean PageRank', sum(pr_b)/len(pr_b))

### Much more to networkx
* Community detection, shortest paths, graph similarity, connectivity, k-core, ...
* Other Python network packages: graph-tool, igraph

## 🏋️‍♀️ PRACTICE

In [None]:
# Q1: Plot a scatter graph of indegree vs outdegree for students at Dixon High


In [None]:
# Q2: Print the mean PageRank centrality for students at Dixon High in each grade 7-12


## Text analysis with [NLTK](https://www.nltk.org/)
* Natural Language Tool Kit is a suite of libraries for Natural Language Processing (NLP) in Python.
* NLTK can perform part of speech tagging, named entity recognition, sentiment analysis, word embeddings, etc...
* Integrates well with many other NLP packages

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

# nltk.download() # NLTK relies on lots of datasets (popular texts, stopwords, languages), download all with this
nltk.download("stopwords") # 

# Read Biden speech
with open('data/biden_inauguration_millercenter.txt', 'r') as f:
    bidenspeech = f.read()


In [None]:
# Let's tokenize the speech, and remove punctuation, stopwords

words = word_tokenize(bidenspeech) # tokenize the speech
print(words[:20])
words = [x for x in words if x.isalpha()] # remove words with punctuation
print(words[:20])
words = [x for x in words if x.lower() not in stopwords.words("english")] # remove stopwords
print(words[:20])

In [None]:
# We can also 'Stem' the words (extract the root word)
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed[:20])

In [None]:
# And lemmatize the words (extract the root word, mapped to dictionary version, i.e., no chopped off ends)
# Note that names are handled better here
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed[:20])

In [None]:
# Let's count the number of words, how often they're used

import pandas as pd
wordcounts = pd.Series(lemmed).value_counts()
wordcounts.head(10)

## 🏋️‍♀️ PRACTICE

In [None]:
# Q3: Repeat the tasks above in NLTK with the Trump speech (data/trump_inauguration_millercenter.txt)
# Who used more unique words? Compare the top words used.
# There are some mistakes/quirks to the stemmed & lemmed word lists, can you explain any of them?
