mirror - Tools for software project analysis

Setup

Prepare python environment and install package
For development please use pip install -r requirements.dev.txt
Copy sample.env to dev.env, fill it with required variables and source it

export GITHUB_TOKEN="<your GitHub token>"
export LANGUAGES_DIR="<directory with cloned languages repos>"
export MIRROR_CRAWL_INTERVAL_SECONDS=1
export MIRROR_CRAWL_MIN_RATE_LIMIT=500 (for search better set as 5)
export MIRROR_CRAWL_BATCH_SIZE="<how often save data>"
export MIRROR_CRAWL_DIR="<where to save crawled data>"
export MIRROR_LANGUAGES_FILE="<json file with languauges>"
export SNIPPETS_DIR="<dir for snippets dataset>"

To avoid block from GitHub prepare Rate Limit watcher

watch -d -n 5 'curl https://api.github.com/rate_limit -s -H "Authorization: Bearer $GITHUB_TOKEN" "Accept: application/vnd.github.v3+json"'

Module commands

python -m mirror.cli --help

  clone              Clone repos from search api to output dir.
  commits            Read repos json file and upload all commits for that...
  crawl              Processes arguments as parsed from the command line
                     and...

  generate_snippets  Create snippets dataset from cloned repos
  nextid             Prints ID of most recent repository crawled and
                     written...

  sample             Writes repositories sampled from a crawl directory to...
  search             Crawl via search api.
  validate           Prints ID of most recent repository crawled and
                     written...

Extract all repos metadata

Run the crawl command to extract all repositories metadata and save in a .json file.

python -m mirror.cli crawl \
  --crawldir $MIRROR_CRAWL_DIR \
  --interval $MIRROR_CRAWL_INTERVAL_SECONDS \
  --min-rate-limit $MIRROR_CRAWL_MIN_RATE_LIMIT \
  --batch-size $MIRROR_CRAWL_BATCH_SIZE

Extract repos metadata via search api

Say you need to extract only a small pool of repositories for analysis then you can set more precise criteria that you need via search command.

python -m mirror.cli search --crawldir "$MIRROR_CRAWL_DIR/search" -L "python" -s ">500" -l 5

Clone repos to local machine for analysis

The clone command uses the standard git clone to extract search results of repositories and clones to local machine.

Clone from search

python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"

Clone from crawl

python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR"

Structure of $LANGUAGES_DIR directory:

> $LANGUAGES_DIR
  > language 1
    > repo 1
    > repo 2
    ...
  > language 2
    > repo 1
    > repo 2
    ...
  ...

Also, there is possibility to upload popular repositories with python code. See example in ex_clone.py

Create commits from repo search

Command commits extract all commits from repository and save .json files with commits for each repository.

python -m mirror.cli commits -d "$MIRROR_CRAWL_DIR\commits" -l 5 -r "$MIRROR_CRAWL_DIR/search"

Convert json data to csv for analysis

It creates .csv file with flat json structure.

python -m mirror.github.utils --json-files-folder "$MIRROR_CRAWL_DIR" --output-csv "$MIRROR_CRAWL_DIR/output.csv" --command commits

Generate snippets dataset from downloaded repo

python -m mirror.github.generate_snippets -r "$OUTPUT_DIR" -f "examples/languages.json" -L "$LANGUAGES_DIR"

Workflow of generate snippet dataset from prepered file with languages and they extentions

Create search result

python -m mirror.cli search -d "$MIRROR_CRAWL_DIR/search" -f $MIRROR_LANGUAGES_FILE -s ">500" -l 5

Clone repos from search result it's take time and maybe good idea not add stdout from git clone to terminal.

python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"

Generate snippets

python -m mirror.cli generate_snippets  -d $SNIPPETS_DIR -r $LANGUAGES_DIR

It return sqlite db with snippets and they metadata.

For use accross allrepos result clone and commits have option argument

 --start-id --end-id

parameters must be set togrther. That id add for ability processing part of repo from allrepos result.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.github/workflows		.github/workflows
docs		docs
mirror		mirror
notebooks		notebooks
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
sample.env		sample.env
setup.py		setup.py
test_mirror.sh		test_mirror.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mirror - Tools for software project analysis

Setup

Module commands

Extract all repos metadata

Extract repos metadata via search api

Clone repos to local machine for analysis

Create commits from repo search

Convert json data to csv for analysis

Generate snippets dataset from downloaded repo

Workflow of generate snippet dataset from prepered file with languages and they extentions

About

Releases

Packages

Contributors 3

Languages

License

bugout-dev/mirror

Folders and files

Latest commit

History

Repository files navigation

mirror - Tools for software project analysis

Setup

Module commands

Extract all repos metadata

Extract repos metadata via search api

Clone repos to local machine for analysis

Create commits from repo search

Convert json data to csv for analysis

Generate snippets dataset from downloaded repo

Workflow of generate snippet dataset from prepered file with languages and they extentions

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages