Experimental GItHub project analysis based on GHTorrent
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
images
README.md
all_languages.csv
ght-restore-mysql-projects
ght_commits2es.py
ght_projects2es.py

README.md

ghtorrent

Experimental GitHub project analysis based on GHTorrent.

The goal of this project is to explore the use of GHTorrent to understand better the life of Open Source software projects.

It is in the initial conception phase so probably only useful to exchange ideas and early prototypes.

Feedback

Don't hesitate to open GitHub issues if you have any kind of problem or you want to feedback or exchange thoughts.

Requirements

It is needed a machine with 16GB of RAM, 4 cores and 1TB of hard disk to execute all without issues.

You need also a running Elasticsearch and Kibana.

And you will need around 12h of compute time (<1h human time) to execute all processes.

Initial Step

The first step is to download GHTorrent data. The testing has been done with: mysql-2018-09-01.tar.gz.

Now, there are two analysis. The first open loading the projects table and the second one loading projects, users and commits.

Projects (Repositories) analysis

To load the GitHub projects data in Elasticsearch:

  • Import in MySQL the projects table using ght-restore-mysql-projects
  • Import in Elasticsearch the projects data using ght_projects2es.py
ght_projects2es.py -e <elastic_url> -i ghtprojects --db-name ghtorrent_projects
  • Use Kibana to visualize the projects data and start analyzing it

You will need around 4h to complete the above steps.

Evolution in time of new projects

Evolution in time of non fork new projects

Evolution in time of new projects top 10 languages

The language detection is based on https://github.com/github/linguist

Ranking with all languages

CSV data

Compare with:

Commits

  • Import in MySQL the commits, projects and users tables: it has been tested teh loading of commits only for 2018 (220MM). For doing that:
grep '"2018-' commits.csv > commits-2018.csv

You must load this commits-2018.csv 22GB file instead of the commits.csv which is (100GB).

  • Import in Elasticsearch the projects data using ght_commits2es.py
ght_commits2es.py -e <elastic_url> -i ghtcommits --db-name ghtorrent_commits
  • Use Kibana to visualize the projects data and start analyzing it

You will need around 12h to complete the above steps.

Evolution in time of commits for software projects

Evolution in time of commits for Grimoirelab ecosystem