# Analyse Imports

This notebook:
- Analyses the most popular imports.
- Categorises paths in each repo as ml0 (direct user of DS library), no-ml-dependency (no direct use of DS library), or test (unit test).

Outputs:
- `proj_labels.csv`

In [1]:
import pandas as pd
import numpy as np
from os.path import join
import pathlib
import altair as alt # Python wrapper for Vega-Lite visualisation grammar

In [2]:
DATA_DIR = "../output/"
NB_OUT = join(DATA_DIR, "notebooks_out")
MERGED_DIR = join(DATA_DIR, "merged")
pathlib.Path(NB_OUT).mkdir(parents=True, exist_ok=True)

In [3]:
proj_imports3 = pd.read_csv(join(MERGED_DIR, "results_imports_python3.csv"))
proj_imports3

Unnamed: 0,repo,path,module_name,import_name,parse_error
0,190000321,190000321/setup.py,setup,codecs,False
1,190000321,190000321/setup.py,setup,os.path,False
2,190000321,190000321/setup.py,setup,setuptools,False
3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False
4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False
...,...,...,...,...,...
1308074,38355280,38355280/lib/core/core.py,,,True
1308075,38347270,38347270/main.py,,,True
1308076,38347270,38347270/.buildozer/android/app/main.py,,,True
1308077,38347270,38347270/.buildozer/android/app/sitecustomize.py,sitecustomize,os.path,False


In [4]:
proj_imports3.nunique()[["repo", "path", "module_name", "import_name"]]

repo             3932
path           236262
module_name    181758
import_name    210302
dtype: int64

In [5]:
len(proj_imports3[proj_imports3["parse_error"]])

17591

In [6]:
len(proj_imports3[pd.isnull(proj_imports3["module_name"])])

17591

## Identify Popular Libraries/Frameworks

To make it easier to identify popular imports (libraries/frameworks), we define the "import_short_name" to be just the top level of an import e.g. "sklearn.model_selection.train_test_split" => "sklearn".

In [7]:
proj_imports_all = proj_imports3

In [8]:
proj_imports_all["import_short_name"] = proj_imports_all["import_name"].apply(lambda x: x.split(".")[0] if not pd.isnull(x) else "")

In [9]:
proj_imports_all

Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
0,190000321,190000321/setup.py,setup,codecs,False,codecs
1,190000321,190000321/setup.py,setup,os.path,False,os
2,190000321,190000321/setup.py,setup,setuptools,False,setuptools
3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False,braindecode
4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False,braindecode
...,...,...,...,...,...,...
1308074,38355280,38355280/lib/core/core.py,,,True,
1308075,38347270,38347270/main.py,,,True,
1308076,38347270,38347270/.buildozer/android/app/main.py,,,True,
1308077,38347270,38347270/.buildozer/android/app/sitecustomize.py,sitecustomize,os.path,False,os


In [10]:
# popularity of imports (based on number of *repos* that use them)
popular_imports = proj_imports_all.groupby(["import_short_name"])["repo"].nunique().sort_values(ascending=False)

In [11]:
popular_imports

import_short_name
os                    3097
                      2964
sys                   2689
numpy                 2029
time                  1981
                      ... 
multi_armed_bandit       1
multi                    1
mult_time                1
mulpyplexer              1
zzz                      1
Name: repo, Length: 16138, dtype: int64

In [12]:
popular_imports.to_csv(join(NB_OUT, "popular.csv"), header=True)

Note: The top import for '' (empty string) represents no import (e.g. due to a file with no imports, or that could not be parsed). It can be ignored.

## Annotated Libs

In [13]:
# https://github.com/boalang/MSR19-DataShowcase/blob/master/info.txt

ml_libs = [
    "theano",
    #"pytroch", # boa contains a typo, we manually correct this
    "pytorch",
    "caffe", "keras", "tensorflow", "sklearn",
    "numpy", "scipy", "pandas", "statsmodels",
    "matplotlib", "seaborn", "plotly", "bokeh", "pydot",
    "xgboost", "catboost", "lightgbm", "eli5",
    "elephas", "spark", "nltk", "cntk", "scrapy", "gensim",
    "pybrain", "lightning", "spacy", "pylearn2",
    "nupic", "pattern", "imblearn", "pyenv"
]

In [14]:
popular_imports_df = pd.DataFrame({"lib": popular_imports.index, "cnt": popular_imports.values})
popular_imports_df["lib"] = popular_imports_df["lib"].replace('', 'None')
popular_imports_df

Unnamed: 0,lib,cnt
0,os,3097
1,,2964
2,sys,2689
3,numpy,2029
4,time,1981
...,...,...
16133,multi_armed_bandit,1
16134,multi,1
16135,mult_time,1
16136,mulpyplexer,1


In [15]:
popular_imports_df_ml = popular_imports_df[popular_imports_df["lib"].isin(ml_libs)]
popular_imports_df_ml

Unnamed: 0,lib,cnt
3,numpy,2029
16,tensorflow,941
20,matplotlib,871
22,scipy,813
33,sklearn,602
39,pandas,531
53,keras,427
97,theano,174
98,nltk,173
160,caffe,80


In [16]:
num_repos = proj_imports_all.nunique()["repo"]
num_repos 

3932

Note that in order to keep the figure size reasonable, we only show ML libs, rather than all imports:

In [17]:
chart = alt.Chart(popular_imports_df_ml).mark_bar().encode(
    x = alt.X('cnt', type='quantitative', title="Number of Python repos (out of %s) that import Lib" % num_repos, scale=alt.Scale(domain=(0,int(num_repos)))),
    y = alt.Y('lib', type='nominal', title="Lib (only ML libs shown)", sort=alt.EncodingSortField(order="descending")),
)
chart

In [18]:
#chart.save("imports.svg")

## Which ML libraries are used together in projects?

Some projects will use a combination of libraries. Inspired by https://blog.bitergia.com/2018/04/02/a-preliminary-analysis-on-the-use-of-python-notebooks/, an attempt was made to examine which libraries were used together. (Note that for now we just analyse combinations of ML libraries, else there are too many combinations to make sense of)

In [19]:
import_sets = []
proj_imports_grouped = proj_imports_all.groupby("repo")
for repo,repo_df in proj_imports_grouped:
    libs_set = tuple(sorted(set(repo_df["import_short_name"]) & set(ml_libs))) # empty tuple if none
    import_sets.append([repo, libs_set])

import_sets_df = pd.DataFrame.from_records(import_sets, columns=["repo", "importset"])

For each repo, which ML related libraries it uses:

In [20]:
import_sets_df

Unnamed: 0,repo,importset
0,118130,"(matplotlib, nltk, numpy, scipy, sklearn, tens..."
1,192904,"(matplotlib, numpy, pandas, scipy, theano)"
2,329033,()
3,379988,"(numpy, scipy)"
4,462713,"(numpy, scipy, theano)"
...,...,...
3927,220350524,()
3928,222271895,()
3929,234515221,"(numpy,)"
3930,236706700,"(numpy,)"


Which combinations are most popular (note that a single import anywhere in the repo will count as usage):

In [21]:
import_sets_df.groupby("importset").count().rename(columns={"repo": "num_repos"}).sort_values("num_repos", ascending=False)

Unnamed: 0_level_0,num_repos
importset,Unnamed: 1_level_1
(),1722
"(numpy, tensorflow)",194
"(numpy,)",169
"(numpy, scipy, tensorflow)",84
"(matplotlib, numpy)",77
...,...
"(lightning, matplotlib, numpy, pandas, scipy, seaborn, sklearn)",1
"(lightning, matplotlib, numpy, scipy, sklearn)",1
"(cntk, numpy, pandas, tensorflow)",1
"(matplotlib, nltk)",1


The above shows that many repos mix ML libraries/frameworks.

A limitations of the analysis is that it doesn't distinguish between using a library in example code rather than in a core part of the the project. e.g. SpaCy contains example code in `21467110/examples/` that imports keras and tensorflow, but these are not really part of SpaCy.

## Identification of test code

One way to detect test code is to search the path/module name for the word "test"

In [22]:
proj_imports_test = proj_imports_all[proj_imports_all["path"].str.contains("test") | proj_imports_all["import_name"].str.contains("test")]

In [23]:
proj_imports_test

Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False,braindecode
4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False,braindecode
5,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.models.shallow_fbcsp.ShallowFBCSPNet,False,braindecode
6,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.torch_ext.util.np_to_var,False,braindecode
7,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.torch_ext.util.set_random_seeds,False,braindecode
...,...,...,...,...,...,...
1307868,38386331,38386331/testing/test_version.py,test_version,pytest,False,pytest
1307869,38386331,38386331/testing/test_version.py,test_version,setuptools_scm.config.Configuration,False,setuptools_scm
1307870,38386331,38386331/testing/test_version.py,test_version,setuptools_scm.version.meta,False,setuptools_scm
1307871,38386331,38386331/testing/test_version.py,test_version,setuptools_scm.version.simplified_semver_version,False,setuptools_scm


In [24]:
pd.Series(proj_imports_test["path"].unique())

0        190000321/test/acceptance_tests/from_notebooks...
1        190000321/test/acceptance_tests/from_notebooks...
2        190000321/test/acceptance_tests/from_notebooks...
3        190000321/test/unit_tests/datautil/test_trial_...
4         160251929/tensorflow_ranking/python/data_test.py
                               ...                        
61229                 38386331/testing/test_file_finder.py
61230                   38386331/testing/test_functions.py
61231                         38386331/testing/conftest.py
61232                   38386331/testing/test_basic_api.py
61233                     38386331/testing/test_version.py
Length: 61234, dtype: object

However, the problem with this approach is that we may accidentally flag model 'test' related code (in ML sense of train/test) in addition to unittests (in the software engineering sense).

A more conservative approach to ensure we detect just Software tests is to flag code that imports a unittesting framework. Initially, we just searched for code that imported the Python 'unittest' module, however, this failed to detect the testing code in SpaCy, which uses `pytest` instead.

A list of common Python unittesting frameworks was taken from:
https://docs.python-guide.org/writing/tests/

In [25]:
test_frameworks = ["unittest", "pytest", "unittest2", "mock"]

In [26]:
unittests = proj_imports_all[proj_imports_all["import_short_name"].isin(test_frameworks)]
unittests

Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
63,190000321,190000321/test/unit_tests/datautil/test_trial_...,test_trial_segment,pytest,False,pytest
436,159175746,159175746/tests/utils/test_metrics.py,test_metrics,unittest,False,unittest
440,159175746,159175746/tests/utils/test_utils.py,test_utils,unittest,False,unittest
441,159175746,159175746/tests/utils/test_utils.py,test_utils,unittest.mock,False,unittest
445,159175746,159175746/tests/utils/test_datahandler.py,test_datahandler,unittest,False,unittest
...,...,...,...,...,...,...
1307837,38386331,38386331/testing/test_file_finder.py,test_file_finder,pytest,False,pytest
1307841,38386331,38386331/testing/test_functions.py,test_functions,pytest,False,pytest
1307855,38386331,38386331/testing/conftest.py,conftest,pytest,False,pytest
1307862,38386331,38386331/testing/test_basic_api.py,test_basic_api,pytest,False,pytest


In [27]:
paths_unittests = unittests.path.unique()
pd.Series(paths_unittests)

0        190000321/test/unit_tests/datautil/test_trial_...
1                    159175746/tests/utils/test_metrics.py
2                      159175746/tests/utils/test_utils.py
3                159175746/tests/utils/test_datahandler.py
4             159175746/tests/utils/test_trainer_helper.py
                               ...                        
26505                 38386331/testing/test_file_finder.py
26506                   38386331/testing/test_functions.py
26507                         38386331/testing/conftest.py
26508                   38386331/testing/test_basic_api.py
26509                     38386331/testing/test_version.py
Length: 26510, dtype: object

Importing a unittesting framework is a strong indication that the code is a software test. However, it may not be able to identify all tests.

In particular, the `pytest` framework only requires test code to follow naming conventions and use the `assert` statement. This means some unittests will not import `pytest` (or any other unittesting framework) at all, causing them to go undetected (e.g. `21467110/spacy/tests/regression/test_issue3531.py`)

## Categorisation of each Python code file

In [28]:
ml0 = proj_imports_all[proj_imports_all["import_short_name"].isin(ml_libs)].path.unique()

In [29]:
ml0

array(['190000321/test/acceptance_tests/from_notebooks/test_trialwise_decoding.py',
       '190000321/test/acceptance_tests/from_notebooks/test_cropped_decoding.py',
       '190000321/test/acceptance_tests/from_notebooks/test_experiment_class.py',
       ..., '38377985/tessellate_numpy.py',
       '38377985/colors_groups_exchanger.py',
       '38377985/numba_functions.py'], dtype=object)

We categorise each Python file as one of:
- (unit) test
- ML0
- no-ml-dependency

In [30]:
test = paths_unittests

In [31]:
proj_labels = proj_imports_all[["repo", "path"]].copy()
proj_labels = proj_labels.drop_duplicates()

In [32]:
proj_labels["cat"] = "no-ml-dependency"

In [33]:
proj_labels.loc[proj_labels["path"].isin(ml0), "cat"] = "ml0"

In [34]:
proj_labels.loc[proj_labels["path"].isin(test), "cat"] = "test"

In [35]:
proj_labels

Unnamed: 0,repo,path,cat
0,190000321,190000321/setup.py,no-ml-dependency
3,190000321,190000321/test/acceptance_tests/from_notebooks...,ml0
16,190000321,190000321/test/acceptance_tests/from_notebooks...,ml0
32,190000321,190000321/test/acceptance_tests/from_notebooks...,ml0
57,190000321,190000321/test/unit_tests/datautil/test_trial_...,test
...,...,...,...
1308073,38355280,38355280/lib/core/exceptions.py,no-ml-dependency
1308074,38355280,38355280/lib/core/core.py,no-ml-dependency
1308075,38347270,38347270/main.py,no-ml-dependency
1308076,38347270,38347270/.buildozer/android/app/main.py,no-ml-dependency


In [36]:
proj_labels["cat"].value_counts()

no-ml-dependency    158496
ml0                  51256
test                 26510
Name: cat, dtype: int64

In [37]:
proj_labels.to_csv(join(NB_OUT, "proj_labels.csv"), index=False)