[Home](index.ipynb) > [Notebooks](notebooks.ipynb) > Social Network Science

<img style='float: left;' src='https://www.gesis.org/typo3conf/ext/gesis_web_ext/Resources/Public/webpack/dist/img/logo_gesis_en.svg' width='150'>

### ``compsoc`` – Computational Social Methods in Python

# Social Network Science (1916-2012): Collaboration and language use in a research domain

**Author**: [Haiko Lietz](https://www.gesis.org/person/haiko.lietz)

**Affiliation**: [GESIS - Leibniz Institute for the Social Sciences](https://www.gesis.org/), Cologne, Germany

**Publication date**: XX.XX.XXXX (version 1.0)

***

## Introduction

Bibliographic data is digital behavioral data in the broader sense because it represents records of events, and it is voluminous, dynamic, and rich. The networks that can are typically constructed from it are co-authorship, citation, and word co-occurrence.

**In this notebook**, the Social Network Science collection is introduced. It has been carefully retrieved from the Web of Science for the purpose of studying its historical socio-cultural evolution ([Lietz, 2020](https://doi.org/10.1007/s11192-020-03527-0)). Except for citations, data is publically available under a CreativeCommons license ([Lietz, 2019](https://doi.org/10.7802/1.1954)). The dataset is normalized and fully compatible with `compsoc`'s routines.

## Dependencies and Settings

In [1]:
import compsoc as cs
import graph_tool.all as gt
import os
import pandas as pd

In [2]:
path = 'data/sns/'

## Unified data structure

In [3]:
publications = pd.read_csv(os.path.join(path, 'publications.txt'), sep='\t')
subfields = pd.read_csv(os.path.join(path, 'subfields.txt'), sep='\t')
authors = pd.read_csv(os.path.join(path, 'authors.txt'), sep='\t')
authorships = pd.read_csv(os.path.join(path, 'authorships.txt'), sep='\t')
words = pd.read_csv(os.path.join(path, 'words.txt'), sep='\t')
usages = pd.read_csv(os.path.join(path, 'usages.txt'), sep='\t')

The Dataset is fully normalized. Tables with primary keys contain entities. Their relationships are specified in tables that merely consist of foreign keys.

|<img src='images/data_model_sns.png' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 1**: Entity-relationship model for the Social Network Science collection</em>|

Publications are the elementary events of communication:

In [4]:
publications.head()

Unnamed: 0,publication_id,publication,time,type,subfield_id
0,0,HANIFAN_1916_A_130,1916,ARTICLE,0
1,1,YULE_1925_P_21,1925,ARTICLE,1
2,2,KERMACK_1927_P_700,1927,ARTICLE,1
3,3,ECKART_1936_P_211,1936,ARTICLE,2
4,4,COASE_1937_E_386,1937,ARTICLE,1


They contain important variables like the year the publication was produced or the subfield the publication belongs to. There are five subfields inferred from community detection:

In [5]:
subfields.head()

Unnamed: 0,subfield_id,subfield,subfield_name
0,0,CSS,Computational Social Science
1,1,ES,Economic Sociology
2,2,NS,Network Science
3,3,SNA,Social Network Analysis
4,4,SPE,Social Psychology & Epidemiology


### Authorship

Publications are authored by authors. Authors are the senders of communications to an unspecified set of receivers. The `authors` entity table is a mere list of which author has which identifier, where the identifier is an integer between $0$ and $N-1$ where $N$ is the number of authors.

In [6]:
authors.head()

Unnamed: 0,author_id,author
0,0,"HANIFAN,_L_J"
1,1,"YULE,_G_U"
2,2,"KERMACK,_W_O"
3,3,"MCKENDRICK,_A_G"
4,4,"ECKART,_CARL"


The information which publication is actually authored by which author is stored in the `authorships` relationship table. The beauty of these tables is that they can directly be used as bipartite edge lists for construction co-occurrence networks:

In [7]:
authorships.head()

Unnamed: 0,publication_id,author_id
0,0,0
1,1,1
2,2,2
3,2,3
4,3,4


Authorship information can be used to study the social dimension of socio-cultural structures and dynamics.

### Word usage

Publications, that is, the authors of them, also use words. These can be $n$-grams, linguistic concepts that consist of $n$ tokens ('RURAL_SCHOOL' is a 2-gram):

In [8]:
words.head()

Unnamed: 0,word_id,word
0,0,COMMUNITY
1,1,RURAL_SCHOOL
2,2,WAY_OF_LIFE
3,3,GROUP_STRUCTURE
4,4,THEORY_OF_COMMUNICATION


The `usages` table tells which publication uses which word:

In [9]:
usages.head()

Unnamed: 0,publication_id,word_id
0,0,0
1,0,1
2,7,2
3,8,0
4,12,3


Linguistic information can be used to study the cultural dimension of socio-cultural structures and dynamics.

Unfortunately, the `references` and `citations` tables cannot be shared.

## Function
This function loads all data in one step:

In [10]:
def sns_collection(
    path = 'data/sns/'
):
    '''
    Description: Loads the normalized Social Network Science data collection.
    
    Input:
        path: relative directory where the data is; set to 'data/sns/' by default.
    
    Output: Six dataframes in this order: publications, subfields, authors, authorships, 
        words, usages
    '''
    import os
    import pandas as pd
    
    publications = pd.read_csv(os.path.join(path, 'publications.txt'), sep='\t')
    subfields = pd.read_csv(os.path.join(path, 'subfields.txt'), sep='\t')
    authors = pd.read_csv(os.path.join(path, 'authors.txt'), sep='\t')
    authorships = pd.read_csv(os.path.join(path, 'authorships.txt'), sep='\t')
    words = pd.read_csv(os.path.join(path, 'words.txt'), sep='\t')
    usages = pd.read_csv(os.path.join(path, 'usages.txt'), sep='\t')
    
    return publications, subfields, authors, authorships, words, usages

## Data exploration

Apply the function:

In [11]:
publications, subfields, authors, authorships, words, usages = cs.sns_collection()

These are the ten most productive authors:

In [21]:
pd.merge(left=authorships, right=authors, on='author_id').value_counts('author').head(10)

author
LATKIN,_CARL            74
CARLEY,_KATHLEEN        49
BARABASI,_ALBERT        48
NEWMAN,_M_E_J           46
BERKMAN,_LISA           44
VALENTE,_THOMAS         39
KAZIENKO,_PRZEMYSLAW    38
WELLMAN,_BARRY          38
LEYDESDORFF,_LOET       38
DUNBAR,_ROBIN_I_M       38
Name: count, dtype: int64

These are the ten most used words:

In [22]:
pd.merge(left=usages, right=words, on='word_id').value_counts('word').head(10)

word
COMMUNITY                  3091
USER                       2806
SOCIAL_NETWORK_ANALYSIS    2239
SOCIAL_CAPITAL             1681
FRIEND                     1598
SOCIAL_SUPPORT             1452
INTERNET                   1310
OPPORTUNITY                1185
TRUST                      1170
WEB                         993
Name: count, dtype: int64

***

## About this notebook

**License**: CC BY 4.0. Distribute, remix, adapt, and build upon ``compsoc``, even commercially, as long as you credit us for the original creation.

**Suggested citation**: Lietz, H. (2025). Social Network Science (1916-2012): Collaboration and language use in a research domain. Version 1.0 (XX.XX.XXXX). *compsoc – Computational Social Methods in Python*. Cologne: GESIS – Leibniz Institute for the Social Sciences. https://github.com/gesiscss/compsoc