# Aggregate

So we've done some work in other notebooks to collect URLs related to COVID-19 in social bookmarking sites/projects. Let's use this notebook to aggregate it together into a single dataset.

In [143]:
import pandas

reddit = pandas.read_csv('data/reddit.csv')
pinboard = pandas.read_csv('data/pinboard.csv')
ncovmem = pandas.read_csv('data/ncovmem.csv')
iipc = pandas.read_csv('data/iipc.csv')

While some of the details of these datasets are different they all contain columns for `url`, `title` and `created`. In the case of ncovmem the created time is stored in a column called `updated` so lets update that.

In [144]:
ncovmem.columns = ncovmem.columns.map(lambda c: 'created' if c == 'update' else c)

Next lets add a column to each dataframe that indicates the source so when we combine them together we will know where the data came from.

In [145]:
reddit['source'] = 'reddit'
pinboard['source'] = 'pinboard'
ncovmem['source'] = 'ncovmem'
iipc['source'] = 'iipc'

In [146]:
def prune(df):
    for col in df.columns:
        if col not in ['url', 'title', 'created', 'source']:
            df = df.drop(col, 1)
    return df

reddit = prune(reddit)
ncovmem = prune(ncovmem)
pinboard = prune(pinboard)
iipc = prune(iipc)

Now we are ready to combine them together!

In [147]:
seeds = pandas.concat([iipc, ncovmem, pinboard, reddit], ignore_index=True)
seeds

Unnamed: 0,url,created,title,source
0,http://coronavirus.fr/,2020-02-21T03:43:18.662353Z,Epicorem. Ecoépidémiologie,iipc
1,http://english.whiov.cas.cn/,2020-02-21T03:43:18.706571Z,"Wuhan Institute of Virulogy, official page in ...",iipc
2,http://www.china-embassy.or.jp/chn/,2020-02-21T03:43:18.739126Z,中华人民共和国驻日本大使馆,iipc
3,http://www.china-embassy.or.jp/jpn/,2020-02-21T03:43:18.766308Z,中華人民共和国駐日本国大使館,iipc
4,https://cadenaser.com/tag/ncov/a/,2020-02-21T03:43:18.791716Z,Coronavirus de Wuhan,iipc
...,...,...,...,...
143330,https://twitter.com/DarrenPlymouth/status/1220...,2020-01-23 16:48:54,Can anyone confirm if this is real?,reddit
143331,https://www.reddit.com/r/Coronavirus/comments/...,2020-01-23 17:08:53,Doctor at Wuhan hospital states “ the virus is...,reddit
143332,https://www.nature.com/news/inside-the-chinese...,2020-01-23 17:18:46,This raises a question to me as to the true or...,reddit
143333,https://www.reddit.com/r/Coronavirus/comments/...,2020-01-23 17:26:39,Would the flu shot provide any protection agai...,reddit


## Reddit Posts

There are actually a large number of posts that don't link out to the web and are just questions and comments.

In [148]:
len(seeds[seeds.url.str.contains('reddit.com')])

18730

In [149]:
seeds = seeds[~seeds.url.str.contains('reddit.com')]
len(seeds)

124605