The code is implemented as part of the project. This notebook template is a material of course CS-E4730 Computational Social Science at Aalto University, which is available for download at A+ via plus.cs.aalto.fi. The notebook also contains functions to fetch data from wikipedia and parse through some of the source data.

For Wikipedia API see documentation here:
https://pypi.org/project/Wikipedia-API/

In [1]:
import wikipediaapi, time, json,requests,os
#from tqdm import tqdm # you can import this for progress bar instead if you are not using notebooks
from tqdm.notebook import tqdm

def ensure_person_data():
    """Ensures the existence of the person-data.tsv file.
    
    For downloading the file 'person-data.tsv', please go to https://search.gesis.org/research_data/SDN-10.7802-1515

    Raises:
        Exception: If the person-data.tsv file is not found in the current directory.
    """
    if not os.path.isfile("person-data.tsv"):
        raise Exception("For downloading the file 'person-data.tsv', please go to https://search.gesis.org/research_data/SDN-10.7802-1515")
    
def ensure_gender_data():
    """Ensures the existence of the gender data file and downloads it from a remote URL if it is not found.
    
    The file is downloaded from http://www.cs.cmu.edu/~ark/bio/data/wiki.genders.txt
    """
    if not os.path.isfile("wiki.genders.txt"):
        print("Downloading the gender data file...")
        open('wiki.genders.txt', 'wb').write(requests.get("http://www.cs.cmu.edu/~ark/bio/data/wiki.genders.txt", allow_redirects=True).content)
    
    
def filter_persons_by(occupation=None,birth_less=None,birth_more=None,nationality=None):
    """
    Filters persons from the person-data.tsv file based on specified criteria.

    Args:
        occupation (str, optional): The occupation of the person. Defaults to None.
        birth_less (int, optional): The upper bound of the birth year of the person. Defaults to None.
        birth_more (int, optional): The lower bound of the birth year of the person. Defaults to None.
        nationality (str, optional): The nationality of the person. Defaults to None.

    Returns:
        dict: A dictionary of persons that match the specified criteria. The keys are the person names and the values are 
        dictionaries containing the person's attributes.
    """
    ensure_person_data()
    pfile=open("person-data.tsv",'r')
    titles=pfile.readline().strip().split("\t")
    i=0
    persons={}
    for line in pfile:
        person=dict(zip(titles,line.strip().split("\t")))
        if person["birthDate"]=='NA':
            birthYear=None
        else:
            birthDate=person["birthDate"].strip("[]\t' ")
            birthYear=int(birthDate.strip("-").split("-")[0])
            if birthDate[0]=="-":
                birthYear=-birthYear
        
        occupation_ok=occupation==None or occupation in person["occupation"] 
        nationality_ok=nationality==None or nationality in person["nationality"] 
        birth_less_ok=birth_less==None or birthYear!=None and birthYear<birth_less
        birth_more_ok=birth_more==None or birthYear!=None and birthYear>birth_more
               
        if occupation_ok and nationality_ok and birth_less_ok and birth_more_ok:
            name=person["WikiURL"][len("http://en.wikipedia.org/wiki/"):]
            persons[name]=person
    return persons

def get_genderdata():
    """Reads a tab-separated file containing Wikipedia article information and returns a dictionary of gender data.

    The function reads a file named "wiki.genders.txt" and extracts the gender data for each name in the file, using the first letter of the gender field. The gender data is then stored in a dictionary with the name as the key and the gender abbreviation as the value.

    Returns:
        A dictionary containing gender data for each name in the file.

    Raises:
        FileNotFoundError: If the input file cannot be found or opened.

    Example:
        >>> gender_data = get_genderdata()
        >>> gender_data['Albert_Einstein']
        'M'
    """
    ensure_gender_data()
    genderdata={}
    with open("wiki.genders.txt", "r") as inputfile:
        inputfile.readline()
        for line in inputfile:
            wid,gender,name=line.strip().split("\t")
            name=name.replace(" ","_")
            genderdata[name]=gender[:1]
    return genderdata

def fill_in_genders(persons):
    """
    Fills in the gender information of persons in a dictionary.

    Args:
        persons (dict): A dictionary containing information about persons.

    Returns:
        None. The function modifies the input dictionary in place.

    Examples:
        >>> persons = {'Alice': {'age': 25}, 'Bob': {'age': 30}}
        >>> fill_in_genders(persons)
        >>> persons
        {'Alice': {'age': 25, 'gender': 'F'}, 'Bob': {'age': 30, 'gender': 'M'}}

    """
    genderdata=get_genderdata()
    for person in list(persons.keys()):
        if person in genderdata:
            gender=genderdata[person]
        else:
            gender="NA"
        persons[person]["gender"]=gender
        
def fetch_links(people,batch_size=None,lang='en'):
    """Uses the Wikipedia API to fetch Wikipedia links between the given people.
    
    The links are filled into the people dictionary in place.
    
    Note that only links between the people are saved, and if you want to inspect other links
    you should write your own fetching function.

    Args:
        people (dict): A dictionary containing names of people as keys and attributes as values.
        batch_size (int, optional): The maximum number of people to fetch links for in a single batch. Defaults to None, which means there is no maximum.
        lang (str, optional): The language in which to fetch Wikipedia links. Defaults to 'en'.

    Returns:
        bool: True if the links were not fetched for every person due to the batch size, False otherwise.
    """
    wiki = wikipediaapi.Wikipedia(lang)
    i=0
    print('Fetching link data from Wikipedia')
    pbar=tqdm(total=len(people))
    for name,attributes in people.items():
        pbar.update(1)
        if "links" not in attributes:
            page=wiki.page(name)
            links=list(map(lambda x:x.replace(" ","_"),page.links.keys()))
            plinks=list(filter(lambda x:x in people,links))
            #print(name,plinks)
            people[name]["links"]=plinks
            i+=1
            time.sleep(0.1)
        if i==batch_size:
            return True
    return False

def fetch_langs(people,batch_size=None,lang='en'):
    """Uses the Wikipedia API to fetch list of Wikipedia language editions where each person in the people 
    dictionary appears.
    
    The language editions are filled into the people dictionary in place.

    Args:
        people (dict): A dictionary containing names of people as keys and attributes as values.
        batch_size (int, optional): The maximum number of people to fetch links for in a single batch. Defaults to None, which means there is no maximum.
        lang (str, optional): The language in which to fetch Wikipedia links. Defaults to 'en'.

    Returns:
        bool: True if the language editions were not fetched for every person due to the batch size, False otherwise.
    """
    wiki = wikipediaapi.Wikipedia(lang)
    i=0
    print('Fetching language editions data from Wikipedia')
    pbar=tqdm(total=len(people))
    for name,attributes in people.items():
        pbar.update(1)
        if "langs" not in attributes:
            page=wiki.page(name)
            langs=list(page.langlinks.keys())
            #print(name,langs)
            people[name]["langs"]=langs
            i+=1
            time.sleep(0.1)
        if i==batch_size:
            return True
    return False

def fetch_summaries(people,batch_size=None,lang='en'):
    """Uses the Wikipedia API to fetch summary texts for each person in the people dictionary.
    
    The summary texts are filled into the people dictionary in place.

    Args:
        people (dict): A dictionary containing names of people as keys and attributes as values.
        batch_size (int, optional): The maximum number of people to fetch links for in a single batch. Defaults to None, which means there is no maximum.
        lang (str, optional): The language in which to fetch Wikipedia links. Defaults to 'en'.

    Returns:
        bool: True if the summaries were not fetched for every person due to the batch size, False otherwise.
    """

    wiki = wikipediaapi.Wikipedia(lang)
    i=0
    print('Fetching summary text data from Wikipedia')
    pbar=tqdm(total=len(people))
    for name,attributes in people.items():
        pbar.update(1)
        if "summary" not in attributes:
            page=wiki.page(name)
            summary=page.summary
            #print(name,summary)
            people[name]["summary"]=summary
            i+=1
            time.sleep(0.1)
        if i==batch_size:
            return True
    return False

def save_people_json(people,filename):
    with open(filename, "w") as pfile: json.dump(people,pfile)
        
def load_people_json(filename):
    with open(filename, "r") as pfile: 
        return json.load(pfile)

In the next cell, you will find the code for loading the politician data to a dictionary from the json file that you can download through A+. The commented out code was used to parse and fetch the data. You can inspect how the data was created using that code and the functions in the previous cell.

In [2]:
filename="politicians.json"
if not os.path.isfile(filename):
    politicians=filter_persons_by(occupation="politician")
    fill_in_genders(politicians)
    save_people_json(politicians,filename)
    
politicians=load_people_json(filename)

## The code below fills in summaries, language editions and links from wikipedia.
## The fetching takes place in batches of 1000 queries after which the data is saved to disk.
#while fetch_summaries(politicians,batch_size=1000): save_people_json(politicians,filename)
#while fetch_langs(politicians,batch_size=1000): save_people_json(politicians,filename)
#while fetch_links(politicians,batch_size=1000): save_people_json(politicians,filename)
#save_people_json(politicians,filename)

Use the next cell to inspect how the data looks like for a single politician.

In [3]:
politicians['Benedict_Calvert,_4th_Baron_Baltimore']

{'#DBpURL': 'http://dbpedia.org/resource/Benedict_Calvert,_4th_Baron_Baltimore',
 'ID': '21',
 'WikiURL': 'http://en.wikipedia.org/wiki/Benedict_Calvert,_4th_Baron_Baltimore',
 'gender': 'M',
 'name': "[' (the right honourable) ', ' the lord baltimore ']",
 'birthDate': "[' 1679-03-21 ']",
 'deathDate': "[' 1715-04-16 ']",
 'occupation': "[' politician ']",
 'nationality': 'NA',
 'party': 'NA',
 'summary': "Benedict Leonard Calvert, 4th Baron Baltimore (21 March 1679 – 16 April 1715) was an English nobleman and politician. He was the second son of Charles Calvert, 3rd Baron Baltimore (1637–1715) by Jane Lowe, and became his father's heir upon the death of his elder brother Cecil in 1681. The 3rd Lord Baltimore was a devout Roman Catholic, and had lost his title to the Province of Maryland shortly after the events of the Glorious Revolution in 1688, when the Protestant monarchs William III and Mary II acceded to the British throne. Benedict Calvert made strenuous attempts to have his fa

**Revision of prior analysis**

From the dictionary of politicians created above, we create a network by running a loop where for each policitians, we create edges from that politicians to the links.

In [4]:
import math
import networkx as nx

def construct_network():
    """
    This function constructs a social network from the data of politicians.

    Args: filename (str) - The filename of the politicians' data file.
    Returns: net (nx.Graph) - A networkx graph object representing the social network.
    """
    net = nx.Graph()
    
    for person, data in politicians.items():
        net.add_node(person)
        neighbors = data['links']
        for neighbor in neighbors:
            if neighbor != person:
                net.add_edge(person, neighbor)

    return net

Here we calculate the graph of the network, including the numbers of nodes, edges, average degree and the clustering coefficient. Zero-degree nodes are also included for the calculation of male and female nodes' degrees.

In [5]:
from matplotlib import pyplot as plt
import numpy as np

net = construct_network()

# Print out some basic statistics of the network
print("The network has:")
print(len(net), "nodes")
print(net.number_of_edges(), "edges")
print(2*net.number_of_edges()/len(net), "average degree")
print(nx.average_clustering(net), "average clustering coefficient")
print(nx.average_shortest_path_length(net.subgraph(max(nx.connected_components(net), key=len))), "average shortest path length")

# Plot the network
# plt.figure()
# positions = nx.spring_layout(net)
# nx.draw(net, positions, node_size=1)

The network has:
6880 nodes
10294 edges
2.9924418604651164 average degree
0.14363897936984674 average clustering coefficient
7.290627875646906 average shortest path length


Below is the calculation of the average degree of male nodes and female nodes. The result implies that in general female nodes have higher degrees than male nodes, indicating more nobility for the women included on Wikipedia. The average degree inequality is describe as:

\begin{equation}
\frac{1}{N_F}\sum_{i} d_{i}^F \ge \frac{1}{N_M}\sum_{i} d_{i}^M
\end{equation}

In [6]:
male_sum = 0       # Total degree of male nodes
male_count = 0     # Total degree of female nodes
female_sum = 0     # Number of male nodes
female_count = 0   # Number of female nodes

# Loop through the network to calculate the average degree for each gender.
for node in net:
    value = politicians[node]
    if value['gender'] == 'M':
        male_sum += net.degree(node)
        male_count += 1
    elif value['gender'] == 'F':
        female_sum += net.degree(node)
        female_count += 1

print("The average degree of males in the network is:", male_sum / male_count)
print("The average degree of females in the network is:", female_sum / female_count)

The average degree of males in the network is: 4.087830687830688
The average degree of females in the network is: 7.176165803108808


**Gender homophily**

Next we will analyse the gender homophily of our data. For simplicity for our model, the function below creates a new network with only two genders (male and female) with no self-edge and no zero-degree nodes.

In [7]:
def binary_network():
    """
    This function constructs a new social network from the data of politicians, consisting of only two genders and no zero-degree nodes.

    Returns: net (nx.Graph) - A networkx graph object representing the social network.
    """
    net = nx.Graph()
    
    for person, data in politicians.items():
        if data['gender'] != 'NA':         # Only consider politicians that are either male or female
            neighbors = data['links']
            for neighbor in neighbors:
                nei_gender = politicians[neighbor]['gender']
                if neighbor != person and nei_gender != 'NA':
                    net.add_edge(person, neighbor)

    return net

The codes below create the new network defined above.

In [8]:
new_net = binary_network()

print("The new network has:")
print(len(new_net), "nodes")
print(new_net.number_of_edges(), "edges")
print(2*new_net.number_of_edges()/len(new_net), "average degree")
print(nx.average_clustering(new_net), "average clustering coefficient")
print(nx.average_shortest_path_length(new_net.subgraph(max(nx.connected_components(new_net), key=len))), "average shortest path length")

# Plot the network
# plt.figure()
# positions = nx.spring_layout(new_net)
# nx.draw(new_net, positions, node_size=1)

The new network has:
1573 nodes
5736 edges
7.293070565797839 average degree
0.339796557267688 average clustering coefficient
6.817727056393223 average shortest path length


First we can check that for our new network, notability difference between male and female still applies by the calculation of inequality (1).

In [9]:
new_male_sum = 0
new_male_count = 0
new_female_sum = 0
new_female_count = 0

for node in new_net:
    value = politicians[node]
    if value['gender'] == 'M':
        new_male_sum += new_net.degree(node)
        new_male_count += 1
    elif value['gender'] == 'F':
        new_female_sum += new_net.degree(node)
        new_female_count += 1

print("The average degree of males in the new network is:", new_male_sum / new_male_count)
print("The average degree of females in the new network is:", new_female_sum / new_female_count)

The average degree of males in the new network is: 6.66908037653874
The average degree of females in the new network is: 11.78125


Next we analyse the homophily of the whole network and for each gender. For the full explanation please refer to the report.

In [10]:
fm_count = 0
mm_count = 0
ff_count = 0

# Empirical network data between men and women
for edge in list(new_net.edges):                 # Loop through every edges (connections) between the politicians
    name_1, name_2 = edge                    # Get the name of the two politicians in the edge
    gender_1, gender_2 = politicians[name_1]['gender'], politicians[name_2]['gender'] # Get the gender of the politicians
    concat = gender_1 + gender_2             # Create a concatenation of the two genders
    if concat == 'FM' or concat == 'MF':     # A connection between different genders can only be a man-woman connection and vice-versa
        fm_count += 1
    elif concat == 'MM':
        mm_count += 1
    elif concat == 'FF':
        ff_count += 1

The analysis of our network homophily, which is:

\begin{equation}
e_{FM} - \bar{e}_{FM} = \frac{E}{3} - \bar{e}_{FM} > 0.1e_{FM} = 191.2
\end{equation}

In [11]:
new_net.number_of_edges()/3 - fm_count

290.0

The analysis of male homophily, which is:

\begin{equation}
\frac{\bar{e}_{FM}}{\bar{e}_{MM} + \bar{e}_{FM}} < 0.45
\end{equation}

In [12]:
fm_count/(fm_count + mm_count)

0.2994830132939439

Similarly, the analysis for female homophily is:

\begin{equation}
\frac{\bar{e}_{FM}}{\bar{e}_{FF} + \bar{e}_{FM}} < 0.45
\end{equation}

In [13]:
fm_count/(fm_count + ff_count)

0.835221421215242