Skip to content

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Notifications You must be signed in to change notification settings

a2i2/mining-data-science-repositories

Repository files navigation

Mining Data Science Repositories

Research on mining Data Science repositories.

Steps to Run

Figshare: Extract contents of results.tar.gz to output directory, then jump to Analyse results (in Jupyter) section.

From scratch: Clone this repository then follow steps below to identify, clone, and analyse the repositories.

Install Docker

If docker is not present

Link to install (docker install)

Datasets

We have four directories: data, input_drive, input, and output:

  • The data folder holds project metadata fetched from GitHub (97 MB, committed to this Git repo for convenience)

  • The input_drive folder is for the cloned repositories (4.6 TB in total, so we suggest using a network storage drive)

  • The symlink_input task will create symlinks within the input folder to the input_drive.

  • The output directory holds metrics and the final analysis results (2 GB when compressed, shared on Figshare).

GitHub Access Token

Go to https://github.com/settings/tokens/new to generate a new token with the perimissions public_repo and read:packages, and update mining_nlp_repositories/github.py with your ACCESS_TOKEN.

Tasks

  • To list all tasks

    surround run list

  • Build the docker image

    surround run build

  • Fetch project meta-data from GitHub (requires GitHub Access Token)

    surround run fetch_data_science_projects
    surround run fetch_non_data_science_projects
    
  • Clone projects from GitHub

    surround run clone_data_science_projects
    surround run clone_non_data_science_projects
    

    Move data/boa/cloned-repos to input_drive/cloned-repos/boa

    Move data/non-data-science/cloned-repos to input_drive/cloned-repos/non-data-science

    Manually create input_drive/cloned-repos/boa-zip-download and extract any unclonable DS repos here

    Manually create input_drive/cloned-repos/non-data-science-zip-download and extract any unclonable non-DS repos here

  • Specify list of repositories to extract metrics for

    Run notebooks/create-lists-to-extract.ipynb notebook and move results from data/selected to input_drive/selected

    Manually modify lists as needed. E.g. repo_ids_ds_chunk_000801-001552_filt.csv excludes repo 858127 as it contains a file that causes Pylint to hang indefinitely.

  • Populate input directory with symlinks (requires repos in input_drive directory)

    surround run symlink_input

  • Extract metrics (requires input directory to be populated)

    surround run analyse_imports
    surround run analyse_2to3
    surround run analyse_pylint
    surround run analyse_radon_cc
    surround run analyse_loc
    surround run analyse_git
    

    Each of the analyse tasks support an optional argument to limit the list of repositories analysed, e.g. surround run analyse_pylint input/repos-ids.csv (useful for splitting up large jobs). If not provided, all repos will be analysed.

    The exact commands used are listed below. Due to a limitation of Surround (Issue #230) it was necessary to call doit directly in order to run multiple Surround tasks simultaneously:

    mkdir -p output/ds-t1; nohup time doit --backend sqlite3 analyse_2to3 --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t1" > output/ds-t1/nohup.out &
    mkdir -p output/ds-t2; nohup time doit --backend sqlite3 analyse_imports --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t2" > output/ds-t2/nohup.out &
    mkdir -p output/ds-t3; nohup time doit --backend sqlite3 analyse_radon_cc --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t3" > output/ds-t3/nohup.out &
    # Skipped: Takes 302 hours:
    # mkdir -p output/ds-t4; nohup time doit --backend sqlite3 analyse_radon_raw --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t4" > output/ds-t4/nohup.out &
    mkdir -p output/ds-t5; nohup time doit --backend sqlite3 analyse_version --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t5" > output/ds-t5/nohup.out &
    mkdir -p output/ds-t6; nohup time doit --backend sqlite3 analyse_loc --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t6" > output/ds-t6/nohup.out &
    mkdir -p output/ds-t7; nohup time doit --backend sqlite3 analyse_git --args "input_drive/selected/repo_ids_full_ds.csv output/ds-t7" > output/ds-t7/nohup.out &
    
    mkdir -p output/nonds-t1; nohup time doit --backend sqlite3 analyse_2to3 --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t1" > output/nonds-t1/nohup.out &
    mkdir -p output/nonds-t2; nohup time doit --backend sqlite3 analyse_imports --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t2" > output/nonds-t2/nohup.out &
    mkdir -p output/nonds-t3; nohup time doit --backend sqlite3 analyse_radon_cc --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t3" > output/nonds-t3/nohup.out &
    # Skipped: Hangs indefinitely on repo 67065438:
    # mkdir -p output/nonds-t4; nohup time doit --backend sqlite3 analyse_radon_raw --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t4" > output/nonds-t4/nohup.out &
    mkdir -p output/nonds-t5; nohup time doit --backend sqlite3 analyse_version --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t5" > output/nonds-t5/nohup.out & 
    mkdir -p output/nonds-t6; nohup time doit --backend sqlite3 analyse_loc --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t6" > output/nonds-t6/nohup.out &
    mkdir -p output/nonds-t7; nohup time doit --backend sqlite3 analyse_git --args "input_drive/selected/repo_ids_full_nonds.csv output/nonds-t7" > output/nonds-t7/nohup.out &
    
    mkdir -p output/ds-chunk11; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_ds_chunk_000001-000800.csv output/ds-chunk11" > output/ds-chunk11/nohup.out &
    # Revised: Hangs indefinitely on repo 858127:
    # mkdir -p output/ds-chunk2; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_ds_chunk_000801-001552.csv output/ds-chunk2" > output/ds-chunk2/nohup.out &
    mkdir -p output/ds-chunk13; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_ds_chunk_000801-001552_filt.csv output/ds-chunk13" > output/ds-chunk13/nohup.out &
    
    mkdir -p output/nonds-chunk11; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_000001-000800.csv output/nonds-chunk11" > output/nonds-chunk11/nohup.out &
    mkdir -p output/nonds-chunk12; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_000801-001600.csv output/nonds-chunk12" > output/nonds-chunk12/nohup.out &
    mkdir -p output/nonds-chunk13; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_001601-002400.csv output/nonds-chunk13" > output/nonds-chunk13/nohup.out &
    mkdir -p output/nonds-chunk14; nohup time doit --backend sqlite3 analyse_pylint --args "input_drive/selected/repo_ids_nonds_chunk_002401-002511.csv output/nonds-chunk14" > output/nonds-chunk14/nohup.out &
    
    

    Each command takes between 1 hour (LOC over DS repos) to 52 hours (Pylint over chunk of 800 repos), and may consume up to 8GB of memory each. (We assigned ~4 concurrent tasks to each node)

Analyse results (in Jupyter):

  • Merge the chunks back together (results will be written to output/merged):

    merge_chunks-cc.ipynb
    merge_chunks-imports.ipynb
    merge_chunks.ipynb
    merge_chunks-loc.ipynb
    merge_chunks-version.ipynb
    merge_chunks-git.ipynb
    
  • Analyse project imports and Python version (intermediate results, will be written to output/notebooks_out):

    analyse_imports.ipynb
    analyse_py_ver.ipynb
    
  • Refine the final selection of DS and non-DS repos to control for the distribution of stars, age, etc.:

    distributions-sel.ipynb
    

    Analyse differences between the final selection of DS versus non-DS repos:

    ml-distribution.ipynb
    
  • Tables and figures for the paper will be exported to output/notebooks_out

  • Remove the docker image

    surround run remove

Known Bugs

  • The GitHub API pages results, thus the number of contributors is limited to 30, so this should be interpreted as 30+. This does not affect the figure in the paper (as the axis is limited to 30)
  • The old project name was mining_nlp_repositories, as we initially trailed the analysis on a corpus of NLP projects. The new project name Mining Data Science Repositories reflects the broader scope of the project to include all types of DS repositories (but the source code still contains references to the old project name).

About

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •