# Odyssey Tutorial

Welcome to Odyssey! Odyssey is a Python package that can analyze python library usage on GitHub through Google BigQuery. The purpose of this tutorial is to provide you a high-level idea of how to use it. Let's begin!

## Part 1: Work with GithubPython object

We start by introducing a central piece of Odyssey -- GithubPython object. This is the object that connects to Github data using BigQuery. It takes care of all the BigQuery connection, SQL query building, result polling, etc. for you.

Let's start by creating a default GithubPython object. Because we didn't specify any package, the information we will get is about all data in the BigQuery Github database.

In [1]:
from odyssey.core.bigquery.GithubPython import GithubPython
gp = GithubPython()

Let's try to see how many Python files in our BigQuery Github database.

You may think: "Wow that's way less than I expect. Does it mean that we only have ~5.9 million Python files on Github? The answer is no. The main reason is that Google BigQuery only has access to open-sourced repos on Github (those who has certain licences). Therefore, it is just a small subset of the whole Github.

That's why, if you search for *.py file using Github web GUI, the number you will get won't be comparable to the number you get here.

In [2]:
print(gp.get_count())

5995653


Now let's create another GithubPython object, but this time, specify that the package we are interested in is sklearn.

Also, Odyssey allows you to exclude forks of the package, by explicitly providing a list of keywords that shouldn't appear in the repo name or file path. In this case, scikit-learn is the one we should avoid counting.

In [3]:
gp_sklearn = GithubPython(package="sklearn", exclude_forks=["scikit-learn"])

Let's count then how many files that count "sklearn". **Caveat: Note that this is a simple string matching. So even if sklearn appears in comment or as a variable name, it will still count!**

In [4]:
print(gp_sklearn.get_count())

37262


If you want to see exactly what are those 37262 files that contain the word "sklearn", you can use get_all() to see all the entries. The return result is a list of BigQueryGithubEntry, a wrapper that provides nice utility function, such as get_url().

In [5]:
data = gp_sklearn.get_all()

In [6]:
print(type(data[0]))

<class 'odyssey.core.bigquery.BigQueryGithubEntry.BigQueryGithubEntry'>


In [7]:
from odyssey.utils.output import pprint_ipynb

In [8]:
pprint_ipynb(data[0])

In [9]:
data[0].get_url() # a link to Github file

'https://github.com/haradatm/nlp/tree/master/classify/train_spp3-w2v.py'

## Part 2: Use filter to refine search

Sometimes we are interested in searching for code snippet that contains usage of a specific class. In other cases, the criteria is a little bit more complicated, such as having "X" function and "Y" function in one file, or having "Z" alone. To support those need, filter is built. Let's now utilize its power to refine the search!

In [10]:
from odyssey.core.bigquery.filter import Contains, And, Or

In [11]:
# Let's define a filter that asks for either RandomForestClassifier or RandomForestRegressor
rf_classifier_or_regressor = Or(Contains('RandomForestRegressor'),Contains('RandomForestClassifier'))

In [12]:
# Then another filter that asks for occurence of SVC
svc = Contains('SVC')

In [13]:
# Connect the two using And
# so we are interested in files that have both SVC and one of the two RandomForest models 
# (RandomForestClassifier or RandomForestRegressor) appearing at the same time.
f = And(rf_classifier_or_regressor, svc)

In [14]:
rf_and_svc = gp_sklearn.get_all(f)

In [15]:
print(len(rf_and_svc))

1070


In [16]:
# Verify the occurence. We indeed have both!
pprint_ipynb(rf_and_svc[0])

## Part 3: Repos with top imports

One common question Python library writers (or even users) are interested in is: who is using this library? Odyssey supports querying repos with top imports of your package-in-interest. In one line, you can get the answer!

**Note: The first time running this will be very slow!**

In [17]:
top20_imports = gp_sklearn.get_top_import_repo(n=20) # top imports by file count

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000


In [18]:
print(top20_imports)

[('ngoix/OCRF', 291), ('automl/auto-sklearn', 195), ('hmendozap/auto-sklearn', 186), ('florian-f/sklearn', 146), ('seckcoder/lang-learn', 141), ('GbalsaC/bitnamiP', 119), ('automl/paramsklearn', 100), ('chaluemwut/fbserver', 99), ('magic2du/contact_matrix', 96), ('nok/sklearn-porter', 95), ('jpzk/evopy', 87), ('B3AU/waveTree', 77), ('sinhrks/expandas', 64), ('chkoar/imbalanced-learn', 61), ('liyu1990/sklearn', 61), ('KennyCandy/HAR', 54), ('sinhrks/pandas-ml', 54), ('RecipeML/Recipe', 52), ('dvro/imbalanced-learn', 51), ('Tjorriemorrie/trading', 51)]


In [22]:
# Verify that the the count matches
print(len(top20_imports)) # 20

20


## Part 4: Most imported class/submodule/funcion

Another common question is how often a certain class/submodule/function is imported. Odyssey can answer that too.

In [23]:
top20_models = gp_sklearn.get_most_imported_class(n=20)

In [24]:
print(top20_models)

[('RandomForestClassifier', 2534), ('LogisticRegression', 2152), ('SVC', 1998), ('StandardScaler', 1783), ('PCA', 1732), ('Pipeline', 1519), ('GridSearchCV', 1511), ('KMeans', 1451), ('TfidfVectorizer', 1314), ('CountVectorizer', 1294), ('KNeighborsClassifier', 1188), ('LinearSVC', 1116), ('DecisionTreeClassifier', 1047), ('LinearRegression', 861), ('GaussianNB', 817), ('LabelEncoder', 728), ('MultinomialNB', 723), ('RandomForestRegressor', 681), ('AdaBoostClassifier', 673), ('SGDClassifier', 642)]


See what are the entries by calling get_import_source() 

In [25]:
sources = gp_sklearn.get_import_source("RandomForestClassifier")

In [26]:
pprint_ipynb(sources[0])

## Part 5: Instantiation

For classes, Odyssey can provide you with insights about how they are instantiated, default argument value people use, etc.

**Note: All the arguments in the returned dictionary are in string format (even for integer values). This may be changed later.**

In [27]:
rfc_instantiation = gp_sklearn.get_instantiation("RandomForestClassifier")

In [28]:
rfc_instantiation

defaultdict(<function odyssey.core.analyzer.InstantiationAnalyzer.InstantiationAnalyzer.__init__.<locals>.<lambda>>,
            {'*': defaultdict(int, {None: 4}),
             '**': defaultdict(int, {None: 24}),
             '**args': defaultdict(int, {None: 2}),
             '**classi_params': defaultdict(int, {None: 1}),
             '**classif_base.get_params()': defaultdict(int, {None: 1}),
             '**classifier_pram_dic[rf_name]': defaultdict(int, {None: 1}),
             "**clf.get('config')": defaultdict(int, {None: 2}),
             '**clf_args_': defaultdict(int, {None: 2}),
             '**clf_params': defaultdict(int, {None: 1}),
             '**cls_kwargs': defaultdict(int, {None: 1}),
             '**config_clf': defaultdict(int, {None: 1}),
             '**estimator_params': defaultdict(int, {None: 4}),
             '**forest_parms': defaultdict(int, {None: 2}),
             '**gs.best_params_': defaultdict(int, {None: 1}),
             '**job': defaultdict(int, {No