[Home](index.ipynb) > [Data Collections](data_collections.ipynb) > Social Network Science

<img style='float: left;' src='https://www.gesis.org/fileadmin/styles/img/gs_home_logo_en.svg'>

### ``compsoc`` – *Notebooks for Computational Sociology* (alpha)

# Social Network Science (1916-2012): Collaboration and language use in a scholarly field
Authors: [Haiko Lietz](https://www.gesis.org/person/haiko.lietz)

Version: 0.91 (14.09.2020)

Please cite as: Lietz, Haiko (2020). Social Network Science (1916-2012): Collaboration and language use in a scholarly field. Version 0.91 (14.09.2020). *compsoc – Notebooks for Computational Sociology*. GESIS. url:[github.com/gesiscss/compsoc](https://github.com/gesiscss/compsoc)

<div class="alert alert-info">
<big><b>Significance</b></big>

This data collection is a delineation of the multidisciplinary and very heterogeneous Social Network Science field using the Web of Science database. It has been produced to enable studies of network stability and change in both social and cultural dimensions. The field consists of 25,760 publications and has a historical dimension (1916–2012). Data is clean and disambiguated.
</div>

## Introduction
Bibliographic data is an early case of behavioral data as it consists of traces of behavior that are collected by a database provider. Behavioral traces typically take the form of co-authorship, citation, and language use.

**In this notebook**, the Social Network Science collection is introduced. It has been carefully retrieved from the Web of Science for the purpose of studying its historical socio-cultural evolution ([Lietz, 2020](https://doi.org/10.1007/s11192-020-03527-0)). Data is publically available under a CreativeCommons license ([Lietz, 2019](https://doi.org/10.7802/1.1954)). The dataset is normalized and fully mapped to `compsoc`'s unified data model. As such it is an idealtypical case of a mapping of quantifiable things like publications to transactions and authors, cited references, or words to facts.

## Dependencies and Settings

In [None]:
import compsoc as cs
import networkx as nx
import pandas as pd

## Unified data structure

In [None]:
publications = pd.read_csv('data/sns/publications.txt', sep='\t', encoding='utf-8')
subfields = pd.read_csv('data/sns/subfields.txt', sep='\t', encoding='utf-8')
authors = pd.read_csv('data/sns/authors.txt', sep='\t', encoding='utf-8')
authorships = pd.read_csv('data/sns/authorships.txt', sep='\t', encoding='utf-8')
words = pd.read_csv('data/sns/words.txt', sep='\t', encoding='utf-8')
usages = pd.read_csv('data/sns/usages.txt', sep='\t', encoding='utf-8')

The Dataset is fully normalized. Tables with primary keys contain entities. Their relationships are specified in tables that merely consist of foreign keys.

|<img src='images/data_model_sns.png' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 1**: Entity-relationship model for the Social Network Science collection</em>|

Transactions as elementary pieces of communication:

In [None]:
publications.head()

They contain important variables like the year the publication was produced or the subfield tha publication belongs to. There are five subfields:

In [None]:
subfields.head()

### Authorship
The unified data model states that "transactions select facts". The first translation of this modeling principle is that "publications are authored by authors". Authors are the senders of communications to an unspecified set of receivers. The `authors` entity table is a mere list of which author has which identifier, where the identifier is an integer between $0$ and $N$. In case of an author network, $N$ is the number of nodes.

In [None]:
authors.head()

The information which publication is actually authored by which author is stored in the `authorships` relationship table. The beauty of these tables is that they can directly be used as edge lists for network construction:

In [None]:
authorships.head()

Authorship information is used to study the social dimension of identity dynamics.
### Word usage
The second translation of "transactions select facts" is that "publications use words". Words resemble concepts that, as part of emergent patterns, influence future transactions and give the field a direction. Entities in `words` are cultural facts:

In [None]:
words.head()

The `usages` table tells which linguistic concept is used in which publication:

In [None]:
usages.head()

Linguistic information is used to study the cultural dimension of identity dynamics.

Unfortunately, the `references` and `citations` tables cannot be shared.

## Function
This function loads all data in one step:

In [None]:
def sns_collection(path='data/sns/'):
    '''
    Description: Loads the normalized Social Network Science data collection.
    
    Output: Six dataframes in this order: publications, subfields, authors, authorships, 
        words, usages
    '''
    
    import pandas as pd
    
    publications = pd.read_csv('data/sns/publications.txt', sep='\t', encoding='utf-8')
    subfields = pd.read_csv('data/sns/subfields.txt', sep='\t', encoding='utf-8')
    authors = pd.read_csv('data/sns/authors.txt', sep='\t', encoding='utf-8')
    authorships = pd.read_csv('data/sns/authorships.txt', sep='\t', encoding='utf-8')
    words = pd.read_csv('data/sns/words.txt', sep='\t', encoding='utf-8')
    usages = pd.read_csv('data/sns/usages.txt', sep='\t', encoding='utf-8')
    
    return publications, subfields, authors, authorships, words, usages

## Example analysis
This is a standard workflow from loading the collection to drawing a network:

In [None]:
publications, subfields, authors, authorships, words, usages = cs.sns_collection()

Define all publications published between 2010 and 2012:

In [None]:
publications_2010 = publications[publications['time'].between(2010, 2012)]['publication_id']

Identify all authorships in which those publications were authored:

In [None]:
authorships_2010 = authorships[authorships['publication_id'].isin(publications_2010)].copy()

Assign a unit weight:

In [None]:
authorships_2010.loc[:, 'weight'] = 1

Project the `authorships_2010` selection matrix to the fact `co_authorships_2010` matrix using `compsoc`'s `meaning_structures()` function:

In [None]:
_, authors, co_authorships_2010, _ = cs.meaning_structures(
    selections=authorships_2010, 
    transaction_id='publication_id', 
    fact_id='author_id', 
    multiplex=True, 
    transactions=publications, 
    domain_id='subfield_id', 
    facts=authors
)

Construct an undirected multiplex graph:

In [None]:
G = cs.construct_graph(
    directed=False, 
    multiplex=True, 
    graph_name='co_authorships_2010', 
    node_list=authors, 
    node_size='degree', 
    edge_list=co_authorships_2010[['author_id_from', 'author_id_to', 'weight', 'subfield_id']], 
    node_label='author'
)

Extract the graph's largest connected component:

In [None]:
G_lcc = G.subgraph(max(nx.connected_components(G), key=len))

Draw the graph:

In [None]:
cs.draw_graph(
    G_lcc, 
    node_size_factor=5, 
    edge_width_factor=5, 
    edge_transparency=.5, 
    figsize='large'
)

Rank authors by the number of publications:

In [None]:
authors.sort_values('weight', ascending=False)[:20]