DS Mining is a project with the purpose of mining data from from public Data Science repositories to identify patterns and behaviors. The project is developed in Python 3.8, uses SQL Alchemy to deal with a SQLite database and pytest as a testing framework.
The project consists of 4 step that go from data collection to analysis.
The corpus of the project consists of a search in GitHub's GraphQL API for the terms: "Data Science", "Ciência de Dados", "Ciencia de los Datos" and "Science des Données". After the collection, we estabelished the following requirements a repository has to have to be analyzed:
- At least 1 language, 1 commit and 1 contributor
- Is not a course project
Repositories that did not meet the requirements were discarded on step 2, the filtering.
Script | Description | Input Table | Output Table |
---|---|---|---|
s1_collect.py | Queries projects' metadata from GitHub API | None | Queries |
s2_filter.ipynb | Filters and selects repositories for further extractions | Queries | Repositories |
s3_extract.py | Extracts data from selected repositories | Repositories | Commits, Notebooks, Cells, Python Files, Requirement Files and others tables derivated from them |
Script | Description | Input Table | Output Table |
---|---|---|---|
e1_download.py | Downloads selected repositories from GitHub | Repositories | Repositories |
e2_notebooks_and_cells.py | Extracts Notebooks and Cells from repositories | Repositories | Notebooks, Cells |
e3_python_files.py | Extracts Python Files from repositories | Repositores | Python Files |
e4_requirement_files.py | Extracts Requirement Files from repositories | Repositores | Requirement Files |
e5_markdown_cells.py | Extracts features from markdown cells | Cells with type "markdown" | Cell Markdown Features |
e6_code_cells.py | Extracts features from code cells | Cells with type "code" | Cell Modules, Cell Data IOs |
e7_python_features.py | Extracts features from python files | Python Files | Python Modules, Python Data IOs |
Script | Description | Input Table | Output Table |
---|---|---|---|
ag1_notebook_aggregate.ipynb | Aggregates some of the data related to Notebooks and their Cells for an easier analysis | Cell Markdown Features, Cell Modules, Cell Data IOs | Notebook Markdowns, Modules, Data IOs |
ag2_python_aggregate.ipynb | Aggregates some of the data related to Python Files for an easier analysis | Python Modules, Python Data IOs | Modules, Data IOs |
After we extract all the data from selected repositories, we use Jupyter Notebooks to analyze the data and generate conclusions and graphic outputs.
Notebook | Description |
---|---|
a1_collected.ipynb | Analyzes collected repositories' features |
a2_filtered.ipynb | Analyzes language-related features |
a3_selected.ipynb | Analyzes selected repositories' features |
a4_modules.ipynb | Analyzes modules extracted |
a5_code_and_data.ipynb | Analyzes code features and data inputs/outputs |
The resulting database is available here (~28GB).
The project primarily uses Python 3.8 as an interpreter, but it also uses other Python versions (2.7 and 3.5) when extracting features from Abstract Syntax Trees from other versions, to deal with that we use Conda, instructions to install it on Linux can be found here.
After downloading and installing conda you might need to add export PATH="~/anaconda3/bin":$PATH
to your .bashrc file.
Then you must run conda init
to initialize conda.
We also used several Python modules that can be found on requirements.txt
. You can follow the instructions bellow to set up the conda enviroments and download the modules in each one of them.
Install nltk stopwords that will be used in Cell Markdowns extraction by running python -c "import nltk; nltk.download('stopwords')"
conda create -n dsm27 python=2.7 -y
conda activate dsm27
pip install --upgrade pip
pip install -r requirements.txt
pip install astunparse
conda create -n dsm35 python=3.5 -y
conda activate dsm35
pip install --upgrade pip
pip install -r requirements.txt
conda create -n dsm38 python=3.8 -y
conda activate dsm38
pip install --upgrade pip
pip install -r requirements.txt
To run the project you simply have to run scripts s1, s2, s3, p1, p2 and then each analysis notebook.
To run the tests you can call them using pytest file.py
or pytest directory/
Run the tests by using python -m pytest tests