[Home](index.ipynb) > [Notebooks](notebooks.ipynb) > Web browsing

<img style='float: left;' src='https://www.gesis.org/typo3conf/ext/gesis_web_ext/Resources/Public/webpack/dist/img/logo_gesis_en.svg' width='150'>

### ``compsoc`` – Computational Social Methods in Python

# Web browsing: Sequential URL visits over a month

**Author**: [Haiko Lietz](https://www.gesis.org/person/haiko.lietz)

**Affiliation**: [GESIS - Leibniz Institute for the Social Sciences](https://www.gesis.org/), Cologne, Germany

**Publication date**: XX.XX.XXXX (version 1.0)

***

## Introduction

...

**In this notebook**, ...

https://zenodo.org/records/4757574

## Dependencies and Settings

In [1]:
import compsoc as cs
import os
import pandas as pd

In [2]:
path = 'data/web_browsing/'

## Unified data structure

In [3]:
browsing = pd.read_csv(os.path.join(path, 'browsing_with_gap.csv.gz'))
browsing['category_names_top'] = browsing['category_names_top'].str.split(',')

In [4]:
browsing.head()

Unnamed: 0,id,prev_id,panelist_id,used_at,left_at,active_seconds,gap_seconds,top_level_domain,category_names_top,sub_level_domain,subdomain,category_names_sub,domain,category_names
0,1009076504,0,0,2018-10-01 00:03:29,2018-10-01 00:03:31,2,0,youtube.com,"[entertainment, streaming-media]",,,,youtube.com,"entertainment,streaming-media"
1,1009076508,1009076504,0,2018-10-01 00:03:31,2018-10-01 00:03:37,6,0,youtube.com,"[entertainment, streaming-media]",,,,youtube.com,"entertainment,streaming-media"
2,1009076512,1009076508,0,2018-10-01 00:03:37,2018-10-01 00:03:43,6,0,youtube.com,"[entertainment, streaming-media]",,,,youtube.com,"entertainment,streaming-media"
3,1009076516,1009076512,0,2018-10-01 00:03:43,2018-10-01 00:03:49,6,0,youtube.com,"[entertainment, streaming-media]",,,,youtube.com,"entertainment,streaming-media"
4,1009076520,1009076516,0,2018-10-01 00:03:49,2018-10-01 00:03:53,4,0,youtube.com,"[entertainment, streaming-media]",,,,youtube.com,"entertainment,streaming-media"


In [5]:
min(browsing.used_at)

'2018-10-01 00:00:00'

In [6]:
max(browsing.left_at)

'2018-11-01 00:14:15'

### Node lists

In [7]:
panelists = pd.DataFrame(browsing['panelist_id'].drop_duplicates()).reset_index(drop=True)

In [8]:
browsing['top_level_domain'] = browsing['top_level_domain'].astype('category')

In [9]:
top_level_domains = pd.DataFrame(browsing['top_level_domain'].cat.categories).reset_index()
top_level_domains.columns = ['top_level_domain_id', 'top_level_domain']

In [10]:
browsing['top_level_domain'] = browsing['top_level_domain'].cat.codes
browsing.rename(columns={'top_level_domain': 'top_level_domain_id'}, inplace=True)

### Aggregates

In [11]:
browsing_sum = browsing[['panelist_id', 'top_level_domain_id', 'active_seconds']].groupby(['panelist_id', 'top_level_domain_id']).sum().reset_index()

In [12]:
browsing_sum.head()

Unnamed: 0,panelist_id,top_level_domain_id,active_seconds
0,0,154,558
1,0,325,64
2,0,349,22
3,0,429,108
4,0,997,58


### Categories

In [13]:
browsing_cat = browsing.copy()

In [14]:
browsing_cat = browsing_cat.explode('category_names_top').reset_index(drop=True)
browsing_cat.rename(columns={'category_names_top': 'category_name_top'}, inplace=True)

In [15]:
browsing_cat.head()

Unnamed: 0,id,prev_id,panelist_id,used_at,left_at,active_seconds,gap_seconds,top_level_domain_id,category_name_top,sub_level_domain,subdomain,category_names_sub,domain,category_names
0,1009076504,0,0,2018-10-01 00:03:29,2018-10-01 00:03:31,2,0,33042,entertainment,,,,youtube.com,"entertainment,streaming-media"
1,1009076504,0,0,2018-10-01 00:03:29,2018-10-01 00:03:31,2,0,33042,streaming-media,,,,youtube.com,"entertainment,streaming-media"
2,1009076508,1009076504,0,2018-10-01 00:03:31,2018-10-01 00:03:37,6,0,33042,entertainment,,,,youtube.com,"entertainment,streaming-media"
3,1009076508,1009076504,0,2018-10-01 00:03:31,2018-10-01 00:03:37,6,0,33042,streaming-media,,,,youtube.com,"entertainment,streaming-media"
4,1009076512,1009076508,0,2018-10-01 00:03:37,2018-10-01 00:03:43,6,0,33042,entertainment,,,,youtube.com,"entertainment,streaming-media"


In [16]:
browsing_cat['category_name_top'] = browsing_cat['category_name_top'].astype('category')

In [17]:
category_names_top = pd.DataFrame(browsing_cat['category_name_top'].cat.categories).reset_index()
category_names_top.columns = ['category_name_top_id', 'category_name_top']

In [18]:
category_names_top.head()

Unnamed: 0,category_name_top_id,category_name_top
0,0,adult
1,1,advertising
2,2,alcohol and tobacco
3,3,black-list
4,4,blogs and personal


In [19]:
browsing_cat['category_name_top'] = browsing_cat['category_name_top'].cat.codes
browsing_cat.rename(columns={'category_name_top': 'category_name_top_id'}, inplace=True)

In [20]:
browsing_cat.head()

Unnamed: 0,id,prev_id,panelist_id,used_at,left_at,active_seconds,gap_seconds,top_level_domain_id,category_name_top_id,sub_level_domain,subdomain,category_names_sub,domain,category_names
0,1009076504,0,0,2018-10-01 00:03:29,2018-10-01 00:03:31,2,0,33042,13,,,,youtube.com,"entertainment,streaming-media"
1,1009076504,0,0,2018-10-01 00:03:29,2018-10-01 00:03:31,2,0,33042,34,,,,youtube.com,"entertainment,streaming-media"
2,1009076508,1009076504,0,2018-10-01 00:03:31,2018-10-01 00:03:37,6,0,33042,13,,,,youtube.com,"entertainment,streaming-media"
3,1009076508,1009076504,0,2018-10-01 00:03:31,2018-10-01 00:03:37,6,0,33042,34,,,,youtube.com,"entertainment,streaming-media"
4,1009076512,1009076508,0,2018-10-01 00:03:37,2018-10-01 00:03:43,6,0,33042,13,,,,youtube.com,"entertainment,streaming-media"


In [21]:
browsing_cat_sum = browsing_cat[['panelist_id', 'top_level_domain_id', 'category_name_top_id', 'active_seconds']].groupby(['panelist_id', 'top_level_domain_id', 'category_name_top_id']).sum().reset_index()

In [22]:
browsing_cat_sum.head()

Unnamed: 0,panelist_id,top_level_domain_id,category_name_top_id,active_seconds
0,0,154,13,558
1,0,325,0,64
2,0,349,5,22
3,0,429,13,108
4,0,997,5,58


## Function
This function loads all data in one step:

In [23]:
def web_browsing_collection(
    path = 'data/web_browsing/'
):
    '''
    Description: ...
    
    Input:
        path: relative directory where the data is; set to 'data/web_browsing/' by default.
    
    Output: ...
    '''
    import os
    import pandas as pd
    
    browsing = pd.read_csv(os.path.join(path, 'browsing_with_gap.csv.gz'))
    browsing['category_names_top'] = browsing['category_names_top'].str.split(',')
    panelists = pd.DataFrame(browsing['panelist_id'].drop_duplicates()).reset_index(drop=True)
    browsing['top_level_domain'] = browsing['top_level_domain'].astype('category')
    top_level_domains = pd.DataFrame(browsing['top_level_domain'].cat.categories).reset_index()
    top_level_domains.columns = ['top_level_domain_id', 'top_level_domain']
    browsing['top_level_domain'] = browsing['top_level_domain'].cat.codes
    browsing.rename(columns={'top_level_domain': 'top_level_domain_id'}, inplace=True)
    #browsing_sum = browsing[['panelist_id', 'top_level_domain_id', 'active_seconds']].groupby(['panelist_id', 'top_level_domain_id']).sum().reset_index()
    browsing_cat = browsing.copy()
    browsing_cat = browsing_cat.explode('category_names_top').reset_index(drop=True)
    browsing_cat.rename(columns={'category_names_top': 'category_name_top'}, inplace=True)
    browsing_cat['category_name_top'] = browsing_cat['category_name_top'].astype('category')
    category_names_top = pd.DataFrame(browsing_cat['category_name_top'].cat.categories).reset_index()
    category_names_top.columns = ['category_name_top_id', 'category_name_top']
    browsing_cat['category_name_top'] = browsing_cat['category_name_top'].cat.codes
    browsing_cat.rename(columns={'category_name_top': 'category_name_top_id'}, inplace=True)
    #browsing_cat_sum = browsing_cat[['panelist_id', 'top_level_domain_id', 'category_name_top_id', 'active_seconds']].groupby(['panelist_id', 'top_level_domain_id', 'category_name_top_id']).sum().reset_index()
    
    return panelists, browsing_cat, top_level_domains, category_names_top

In [24]:
panelists, browsing, top_level_domains, category_names_top = cs.web_browsing_collection()

***

## About this notebook

**License**: CC BY 4.0. Distribute, remix, adapt, and build upon ``compsoc``, even commercially, as long as you credit us for the original creation.

**Suggested citation**: Lietz, H. (2025). Web browsing: Sequential URL visits over a month. Version 1.0 (XX.XX.XXXX). *compsoc – Computational Social Methods in Python*. Cologne: GESIS – Leibniz Institute for the Social Sciences. https://github.com/gesiscss/compsoc