Github Research

Wang Cheng-Jun edited this page Dec 19, 2016 · 1 revision

计算传播学是计算社会科学的重要分支。它主要关注人类传播行为的可计算性基础,以传播网络分析、传播文本挖掘、数据科学等为主要分析工具,(以非介入地方式)大规模地收集并分析人类传播行为数据,挖掘人类传播行为背后的模式和法则,分析模式背后的生成机制与基本原理,可以被广泛地应用于数据新闻和计算广告等场景,注重编程训练、数学建模、可计算思维。

Clone this wiki locally

Team Science

http://wiki.swarma.net/index.php/%E7%82%B9%E5%87%BB%E6%B5%81%E7%BD%91%E7%BB%9C%E7%9A%84%E8%80%97%E6%95%A3%E4%B8%8E%E5%BC%82%E9%80%9F%E5%A2%9E%E9%95%BF

Table of Contents

GithubArchive数据

400px

数据源

http://www.githubarchive.org/这里可以下载github自2011年2月12日到现在的数据。数据以Json的压缩包形式提供(截至2014-07-16,数据的物理体积为56G)。每一个小时一个数据文件,例如2012年3月11日12点的数据文件为:http://data.githubarchive.org/2012-03-11-12.json.gz。

数据下载 根据这个特点,可以构造随时间变化的所有下载链接并获取相应数据。在githubarchive网站提供了ruby的下载数据方式。虽然谷歌的bigquery也提供了迅速计算该数据的方法,但有计算量和分析方式的限制,适合于做数据的描述性分析,而不是深入的数据挖掘。此处推荐熟悉python的研究者使用编写python script的形式下载数据,可参考Mazieres所编写的python代码,见这里:https://github.com/mazieres/github_archive 。数据下载之后,可以较为自由地分割数据并进行处理。

Data storytelling using the github archive https://www.oreilly.com/learning/data-storytelling-using-the-github-archive

Githut http://githut.info/

数据预处理

把数据按照行为的类别(types)分解,获取每种行为的 actor, 对应的repo,和时间。每个类别的行为存放在一个文件夹,一天一个数据文件。

import json

def saveData(ad):
    f = gzip.open(ad, 'rb') 
    f = f.read().split(' ')
    num = 0
    for line in f:
        num += 1
        try:
            line = json.loads(line)
            types = line['type']
            if 'repo' in line:
                repo = line['repo']['name']
                actor = line['actor']['login']
            else:
                repo = line['repository']
                repo = repo['owner'] + '/' + repo['name']
                actor = line['actor']
            time = line['created_at']
            date = time[0:10]
            ts = time[11:19].split(':')
            ts = ts[0] + ts[1] + ts[2]
            record = actor+"\t"+repo +"\t"+ts
            newpath = path +'days/'+ types  + '/'
            if not os.path.exists(newpath): os.makedirs(newpath)
            with open(newpath + date,'a') as p:
                p.write(record+"\n")
        except:
            pass

path='D:/chengjun/githubArchive/'
ads = glob.glob(path + "*")
ads = [f for f in ads if f[-2:] == 'gz']
ads =[f for f in ads if f[41:45] == '2012' and int(f[46:48]) > 6]
for ad in ads:
    n=ads.index(ad)
    print n, ad
    saveData(ad)

数据读取

import gzip, json, re

def readData(gz_path):
    f = gzip.open(gz_path, 'rb')
    files = f.readlines()
    length = len(files)
    if  length > 1:
        acts = []
        for subs in files:
            acts.append(json.loads(subs))
    else:
        f2 = files[0]
        r = re.split('(\{.*?\})(?= *\{)', f2)
        r = [i for i in r if i] # delete the 
        accumulator = 
        acts = []
        for subs in r:
            accumulator += subs
            try:
                acts.append(json.loads(accumulator))
                accumulator = 
            except Exception, e:
                print e
                pass
    return acts

数据初步清洗和提取的代码 :File:github_lingfei_chengjun.pdf

def get_member(act):
    if 'repo' in act:#old version data before 2013
        date=str(act['created_at'].split('T')[0])
        try:
            author=act['payload']['member']['login']
        except:
            author=act['payload']['member']
        repo_id=int(act['repo']['id'])
        repo_name=act['repo']['name']
    elif 'repository' in act:#new version data after 2013
        date=str(act['created_at'].split('T')[0])
        author=act['payload']['member']['login']
        repo_id=int(act['repository']['id'])
        repo_name = act['repository']['owner']+'/' + act['repository']['name']
    return date, author, repo_id, repo_name

重新提取后的最大团队的member数量是570。

Toward a metabolic theory of ecology

C. Cattuto1,, V. Loreto and V. D. P. Servedio. A Yule-Simon process with memory. Europhysics Letters PREPRINT[1]

Brown, J. H., Gillooly, J. F., Allen, A. P., Savage, V. M., & West, G. B. (2004). Toward a metabolic theory of ecology. Ecology, 85(7), 1771-1789.[2]

 In short, the metabolic rate was limited by the efficiency with which the organism could distribute resources to the cells. John Holland, Complexity: A Very Short Introduction, P17

Theory

Team Science

Machine Science

研究发现

这里首先处理的数据是watchevent这个Github的数据,因为2012年的json数据缺少换行符,所以这里仅仅用了2011年的数据。

400px400px

研究发现

400px On GitHub’s Programming Languages

github的数据,考察各种universal patterns, 参考towarding the metabolic theory of ecology, 先有一个模型的突破点比较好 如果可以把各种pattern 比如开源项目成长周期曲线 之类 都从一个preferential return那种微观模型推出来。

清洗MemberEvent,提取一个repo添加成员的信息,注意有很多团队其实没有添加成员,就只有一个创始人,不在这个统计之中。所以它的头比较平。 400px400px

下图的team size的分布有误!

400px400px

800px

Klug M, Bagrow JP. 2016 Understanding the group dynamics and success of teams. R. Soc. open sci. 3: 160007. http://dx.doi.org/10.1098/rsos.160007[3]

- Science of science & Team science -- Why small team could make break-throughs?

400px400px 400px400px

Wikipedia Clickstream data

Page editors as a group, page can cite each other.

https://figshare.com/articles/Wikipedia_Clickstream/1305770

参考文献