# Aggregate

So we've done some work in other notebooks to collect URLs related to COVID-19 in social bookmarking sites/projects. Let's use this notebook to aggregate it together into a single dataset.

In [69]:
import pandas

reddit = pandas.read_csv('data/reddit.csv')
pinboard = pandas.read_csv('data/pinboard.csv')
ncovmem = pandas.read_csv('data/ncovmem.csv')
iipc = pandas.read_csv('data/iipc.csv')

While some of the details of these datasets are different they all contain columns for `url`, `title` and `created`. In the case of ncovmem the created time is stored in a column called `updated` so lets update that.

In [70]:
ncovmem.columns = ncovmem.columns.map(lambda c: 'created' if c == 'update' else c)

Next lets add a column to each dataframe that indicates the source so when we combine them together we will know where the data came from.

In [71]:
reddit['source'] = 'reddit'
pinboard['source'] = 'pinboard'
ncovmem['source'] = 'ncovmem'
iipc['source'] = 'iipc'

In [72]:
def prune(df):
    for col in df.columns:
        if col not in ['url', 'title', 'created', 'source']:
            df = df.drop(col, 1)
    return df

reddit = prune(reddit)
ncovmem = prune(ncovmem)
pinboard = prune(pinboard)
iipc = prune(iipc)

Now we are ready to combine them together!

In [73]:
seeds = pandas.concat([iipc, ncovmem, pinboard, reddit], ignore_index=True)
seeds

Unnamed: 0,url,created,title,source
0,http://coronavirus.fr/,2020-02-21T03:43:18.662353Z,Epicorem. Ecoépidémiologie,iipc
1,http://english.whiov.cas.cn/,2020-02-21T03:43:18.706571Z,"Wuhan Institute of Virulogy, official page in ...",iipc
2,http://www.china-embassy.or.jp/chn/,2020-02-21T03:43:18.739126Z,中华人民共和国驻日本大使馆,iipc
3,http://www.china-embassy.or.jp/jpn/,2020-02-21T03:43:18.766308Z,中華人民共和国駐日本国大使館,iipc
4,https://cadenaser.com/tag/ncov/a/,2020-02-21T03:43:18.791716Z,Coronavirus de Wuhan,iipc
...,...,...,...,...
139924,https://www.google.co.uk/amp/s/www.liverpoolec...,2020-03-25 11:53:00,21 year old dies with no existing condition,reddit
139925,https://www.gearbest.com/braces---supports/pp_...,2020-03-25 11:53:20,"In case you still have time, here are some N95...",reddit
139926,https://thefederalist.com/2020/03/25/how-medic...,2020-03-25 11:53:36,The Federalist - It is time to think outside t...,reddit
139927,https://www.chron.com/local/komo/article/King-...,2020-03-25 11:53:53,Seattle - King County preparing to release hun...,reddit


## Massage

One thing about the Reddit dataset is that many of the URLs include posts to Reddit, rather than links.

In [77]:
#seeds[seeds.url.str.match('https://reddit')]
seeds[seeds.url.str.startswith('https')]

ValueError: Cannot mask with non-boolean array containing NA / NaN values

Obviously this is up to someone doing the archiving, but since we are mostly interested in archiving content on the web outside of these social bookmarking tools we can remove those fairly easily.