About

DS Mining is a project with the purpose of mining data from from public Data Science repositories to identify patterns and behaviors. The project is developed in Python 3.8, uses SQL Alchemy to deal with a SQLite database and pytest as a testing framework.

Workflow

Corpus

The corpus of the project consists of a search in GitHub's GraphQL API for the terms: "Data Science", "Ciência de Dados", "Ciencia de los Datos" and "Science des Données". After the collection, we estabelished the following requirements a repository has to have to be analyzed:

At least 1 language, 1 commit and 1 contributor
Is not a course project

Repositories that did not meet the requirements were discarded on step 2, the filtering.

Scripts Description

Main Scripts

Script	Description	Input Table	Output Table
s1_collect.py	Queries projects' metadata from GitHub API	None	Queries
s2_filter.ipynb	Filters and selects repositories for further extractions	Queries	Repositories
s3_extract.py	Extracts data from selected repositories	Repositories	Commits, Notebooks, Cells, Python Files, Requirement Files and others tables derivated from them

Extraction Scripts

Script	Description	Input Table	Output Table
e1_download.py	Downloads selected repositories from GitHub	Repositories	Repositories
e2_notebooks_and_cells.py	Extracts Notebooks and Cells from repositories	Repositories	Notebooks, Cells
e3_python_files.py	Extracts Python Files from repositories	Repositores	Python Files
e4_requirement_files.py	Extracts Requirement Files from repositories	Repositores	Requirement Files
e5_markdown_cells.py	Extracts features from markdown cells	Cells with type "markdown"	Cell Markdown Features
e6_code_cells.py	Extracts features from code cells	Cells with type "code"	Cell Modules, Cell Data IOs
e7_python_features.py	Extracts features from python files	Python Files	Python Modules, Python Data IOs

Aggregation Scripts

Script	Description	Input Table	Output Table
ag1_notebook_aggregate.ipynb	Aggregates some of the data related to Notebooks and their Cells for an easier analysis	Cell Markdown Features, Cell Modules, Cell Data IOs	Notebook Markdowns, Modules, Data IOs
ag2_python_aggregate.ipynb	Aggregates some of the data related to Python Files for an easier analysis	Python Modules, Python Data IOs	Modules, Data IOs

Analysis Notebooks

After we extract all the data from selected repositories, we use Jupyter Notebooks to analyze the data and generate conclusions and graphic outputs.

Notebook	Description
a1_collected.ipynb	Analyzes collected repositories' features
a2_filtered.ipynb	Analyzes language-related features
a3_selected.ipynb	Analyzes selected repositories' features
a4_modules.ipynb	Analyzes modules extracted
a5_code_and_data.ipynb	Analyzes code features and data inputs/outputs

Results

The resulting database is available here (~28GB).

Installation

The project primarily uses Python 3.8 as an interpreter, but it also uses other Python versions (2.7 and 3.5) when extracting features from Abstract Syntax Trees from other versions, to deal with that we use Conda, instructions to install it on Linux can be found here.

After downloading and installing conda you might need to add export PATH="~/anaconda3/bin":$PATH to your .bashrc file. Then you must run conda init to initialize conda.

Requirements

We also used several Python modules that can be found on requirements.txt. You can follow the instructions bellow to set up the conda enviroments and download the modules in each one of them.

Install nltk stopwords that will be used in Cell Markdowns extraction by running python -c "import nltk; nltk.download('stopwords')"

Conda 2.7

conda create -n dsm27 python=2.7 -y
conda activate dsm27  
pip install --upgrade pip
pip install -r requirements.txt
pip install astunparse

Conda 3.5

conda create -n dsm35 python=3.5 -y
conda activate dsm35
pip install --upgrade pip
pip install -r requirements.txt

Conda 3.8

conda create -n dsm38 python=3.8 -y
conda activate dsm38
pip install --upgrade pip
pip install -r requirements.txt

Running

To run the project you simply have to run scripts s1, s2, s3, p1, p2 and then each analysis notebook. To run the tests you can call them using pytest file.py or pytest directory/

Tests

Run the tests by using python -m pytest tests

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Table of Contents

Overview

Workflow

Corpus

Scripts Description

Main Scripts

Extraction Scripts

Aggregation Scripts

Analysis Notebooks

Results

Installation

Requirements

Conda 2.7

Conda 3.5

Conda 3.8

Running

Tests

References

About

Releases

Packages

Contributors 3

Languages

dew-uff/dsmining

Folders and files

Latest commit

History

Repository files navigation

About

Table of Contents

Overview

Workflow

Corpus

Scripts Description

Main Scripts

Extraction Scripts

Aggregation Scripts

Analysis Notebooks

Results

Installation

Requirements

Conda 2.7

Conda 3.5

Conda 3.8

Running

Tests

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages