To open notebook in Colab please click below:
<a href="https://colab.research.google.com/github/bwolfson2/dsclass2022/blob/main/Module_1_Dealing_with_Data/A%20taste%20of%20things%20to%20come%20-%20RMS.ipynb" target="_parent"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /> </a>'

In [None]:
#If opening in colab run this cell
#!git clone https://github.com/bwolfson2/dsclass2022
#%cd dsclass2022/Module_1_Dealing_with_Data/

<span style="font-family: Palatino; font-size: 40px; color: purple">
             A sampling of some data science
</span>


Spring 2022 - Instructors: Roger M. Stein and Ben Wolfson

Teaching Assistant: Ben Wolfson

***

# <span style="font-family: Palatino; font-size: 30px; color: purple"> Set-up and housekeeping</span>

## <span style="font-family: Palatino; font-size: 25px; color: purple">Some general imports</span>

Python has a ton of packages that make doing complicated stuff very easy. We won't discuss how to install packages, or give a detailed list of what packages exist, but we will give a brief description about how they are used. 

An easy way to think of why package are useful is by thinking: "**Python packages give us access to MANY functions**".

Packages contain pre-defined functions (built-in) that make our life easier!  We've seen pre-defined functions before, for example, the funciton 'str()' that we used to convert numbers into strings in the Python Basics notebook.

In this class we will use five packages very frequently: `pandas`, `sklearn`, `matplotlib`,  `seaborn`, and `numpy`:

- **`pandas`** is a data manipulation package. It lets us store data in data frames. More on this soon.
- **`sklearn`** is a machine learning and data science package. It lets us do fairly complicated machine learning tasks, such as running regressions and building classification models with only a few lines of code. (Nice!)
- **`statsmodels.api`** is a statistical and econometrics package that is newer than, but more "statistcal" than `sklearn`. It has two main advantages over `sklearn` (a) a more intuitive set of reporting and analysis tools that include a number of advanced statistical and econometric techniques for time series, regime-switching and other applications; and (b) a formula-based (vs. matrix/vector) representation of models (as in R), which makes experimenting with data transformations much easier. It is Roger's first choice for anything involving time-series, generalized linear models or complex model designs in Python.  You should note that it uses a different set of conventions for model representation (but it is easy to write an `sklearn` wrapper if you prefer - see the `patsy` package). It lets us do fairly complicated statistical and econometric tasks, such as building statistical models with only a few lines of code. (Nice!)
- **`matplotlib`** lets you make plots and graphs directly from your code.  This can be a secret weapon when combined with notebooks, as you can very easily rerun analyses on different data or with slightly different code, and the graphs can just appear magically.  (Ok, always easier said than done, but you get the idea.)
- **`seaborn`** an extension to matplotlib that really helps make your plots look more appealing.
- **`numpy`** (pronounced num-pie) is used for doing "math stuff", such as complex mathematical operations (e.g., square roots, exponents, logs), operations on matrices, and more. 

As we use these through the semester, their usefulness will become increasingly apparent.

$~$

To make the contents of a package available, you need to `import` it.  It is possible to import all or only a certain subset of a package's contents, depending on what we need.  It is also generally useful to create a shorthand label for packages when we import them (this will save you a lot of typing and make your code easier to read).

In [None]:
# It is usually preferable to arrange imports by broad function
# (e.g., all graphics imports in the same code block, all DS library
#  imports in the same block, etc.), but for expository purposes the are
# organized by import type.

import os                        # full package imports
import pipes        
import math
import warnings
import sys

import numpy as np               # full package imports with aliases 
import pandas as pd
import copy as cp   
import seaborn as sns
from icecream import ic 

import matplotlib.pylab as plt   # partial package imports with aliases 

# this trick is sometimes required to get plots to display inline with the rest of your notebook,
# not in a separate window
%matplotlib inline

sns.set(style='ticks', palette='Set2')

We can now use package-specific things. For example, numpy has a function called `sqrt()` which will give us the square root of a numpy number. Since it is part of numpy, we need to tell Python that that's where it is by using a dot (e.g., `np.sqrt()`).

In the following cell you can also see how to write **comments** in your code. Take my advice: write comments as you go.  It's helpful when you want to collaborate, then you don't have to figure out what you did to explain it to your collaborator.  But even more: often you need to come back to an analysis weeks, months, or even years later, and you will thank yourself for explaining what you did!

In [None]:
some_list = [0,0,1,2,3,3,4.5,7.6]
print(some_list)
some_dictionary = {'student1': '(929)-000-0000', 'student2': '(917)-000-0000', 'student3': '(470)-000-0000'}
print(some_dictionary)
some_set = set( [1,2,4,4,5,5] )
print(some_set)


# In this part of the code I am using numpy (np) functions

print ("Square root: " + str ( np.sqrt(25) ))
print ("Maximum element of our previous list: " + str( np.max(some_list) ))

# In this part of the code I am using python functions

print ("Number of elements in our previous list: " + str( len(some_list) ))
print ("Sum of elements in our previous list: " + str( sum(some_list) ))
print ("Range of 5 numbers (remember we start with 0): " + str( range(5) ))

# #Bonus LIST COMPREHENSIONS
# some_list_squared = [i*i for i in some_list]
# #This is the same as:
# some_list_squared_too = []
# for i in some_list:
#     some_list_squared_too.append(i*i)

# print(f"some_list_squared: {some_list_squared} = some_list_squared_too:{some_list_squared_too}")

## <span style="font-family: Palatino; font-size: 25px; color: purple">  Custom (user-defined) helper functions </span>

Sometimes there isn't a package to do exactly what we want, or sometimes, we might need to repeat the same sequence of operations many times.  In both cases, it is useful to write funcions for this.  In order to avoid cluttering up our notebooks, and to keep us focused on the main ideas of each lesson, we will define functions that are largely convenience or helper functions at the top of our notebooks.

### `relabel(data, from_lab, to_lab, n)` to change label names

In [None]:
def relabel(data, from_lab, to_lab, n):
    label_map = dict(zip(from_lab, to_lab))
    new_labs = [None] * n
    for i in range(n):
        new_labs[i] = label_map.get(data[i])
        
    return new_labs

### `count_words(s)` to count number of words in a text string

In [None]:
def count_words(s):
    word_list = s.split()
    number_of_words = len(word_list)
    return(number_of_words)

# <span style="font-family: Palatino; font-size: 30px; color: purple">  Import the raw survey data and do quick-check of quality/structure</span>

OK.  Let's see how to use the data science packages to do...Data Science!

We can use `pandas` to import some data and keep it in a DS-friendly structure called a `DataFrame`.

In this example, we will be looking at the responses from the pre-class survey.

In [None]:
path=""
fn = "responses.csv"
full_path= path+fn
df = pd.read_csv(full_path)

Let's take a look at the data to see the variables we are dealing with.

In [None]:
df.columns

Ugh!  Right out of the gate we can see that the variable names are really long and cumbersome, and have spaces and other special charicters, which makes them difficult to use in computation.

Before we dive it, it would be useful to know what kind of data we have loaded.

`pandas` makes it easy to get more detail on the a DataFrame.

In [None]:
df.info()

Ugh$^2$ !  Not only are the names cumbersome, but they make the output almost unreadable. 

Even so, we can see that some of the columns may contain personal identifying information, so we should drop those, along with any columns that are unlikely to be directly useful at this point.

Let's start by dropping the variables we "don't need."

In [None]:
df=df.drop(columns=['#', '*What is your first name?*', '*What is your NetID?*',
                    '*Please use this space to provide any comments:*', 'Start Date (UTC)','Network ID'
                   ]
          )

 Now we can rename the columns that we do want to make things a bit easier.

In [None]:
old_col_names = df.columns
new_col_names = ["degree",     "other_deg",
              "cw_mvstats", "cw_python",  "cw_ML",        "cw_finance",  "cw_mgt",     "cw_strat",
              "work_yrs",
              "exp_stats",  "exp_python",  "exp_ML",      "exp_val",     "exp_deploy", "exp_prd_dev", "exp_prd_strat", 
              "did_OLS",    "did_RDB",     "did_cloud",   "did_API", 
              "comf_viz",   "comf_python", "comf_math",   "comf_stats",  "comf_ML",    "comf_strat",
              "comf_term",  "comf_comm",  "comf_wrangle", 
              "intr_segm",  "intr_fraud",  "intr_invest", "intr_invnt",  "intr_QC", 
              "intr_recm",  "intr_risk",   "intr_supply", "intr_target", "intr_insur", "intr_other",
              "obj_acc",    "obj_jobsrch", "obj_proj",    "obj_gen",     "obj_mgt_ds", "obj_startup", "obj_other",
              "mcostly_sw", "warmup",
              "sub_date"
             ]
df.columns = new_col_names
df.columns

Let's compare the old and new variable names to see how much of a difference this makes for us...

In [None]:
pd.DataFrame(zip(old_col_names, new_col_names), columns=["old","new"])

That does seem a lot more manageable.  Look at rows `0`, `47`, and `48`, for example.

Let's try to get some detail again, now that our variable names are pretty.

In [None]:
df.info()

Well that's certainly better! 

Next we can take a look at what is actually *in* each variable:

Let's get a sense of the class makeup by degree type.

In [None]:
df.degree.hist()

That's odd, there were 20 respondents but this histogram only shows 9.  We will need to sort that out.

Let's take a look at respondents interest in Finance applications of DS?

In [None]:
df.intr_invest.hist(color='red')

That's odd: We only get a count for one category and many respondent are not represented.

Let's take a look at that variable in more detail.

In [None]:
df.intr_invest.head


Ugh again! It would appear that the for choice questions, like this one, the survey software just repeats the text of a category label in its export of the survey data.  For example, the question  we just examined, `intr_invest`, asked respondents to indicate whether they had interest in investment-related applications of DS.  In this case, the software export contains the label of that response option ("Investment Methods" in the field if the respondent had indicated interest, but it puts an `NaN` in that field for 'No', rather than simply creating a single variable indicating, e.g., a `0` or `1`.

Let's take a look at what kind of data is in some of the other variables:

In [None]:
df.head(2)  # show the first 2 records of the DataFrame

It looks like we are not really ready yet to analyze this data set.

We will need to do some work to get this data into shape first.

# <span style="font-family: Palatino; font-size: 30px; color: purple">Rudimentary data cleaning and recoding </span>

## <span style="font-family: Palatino; font-size: 25px; color: purple"> Relabel cumbersome (long) response labels from survey software</span>

We will start by trying to relabel some of the very long data labels that the survey software provides.  Note that these are not variable names; they are the  values that a categorical variable can take in our data.

In [None]:
# Note definition of `relabel` in the Helper Functions section, above!

n_resp = df.shape[0]

# --- realable coursework cols
coursework_labels_long  = ["No prior coursework",
                           "An undergraduate or graduate introductory course",
                           "One or more advanced undergraduate courses",
                           "One or more advanced Masters/PhD courses"
                          ]

coursework_labels_short = ["no courses",
                           "intro",
                           "adv_undergrad",
                           "adv_grad"
                          ]

# could be automated using prefix of col name
coursework_columns = ["cw_mvstats", "cw_python", "cw_ML", "cw_finance", "cw_mgt", "cw_strat"]

for col in coursework_columns:
    df[col] = cp.copy(relabel(df[col], coursework_labels_long, coursework_labels_short, n_resp))
    
# -- relabel comfort cols
 
comfort_labels_long  = ["I've had no exposure", "I've had limited exposure", "I am somewhat comfortable", "I'm pretty good", "I'm very good"]
comfort_labels_short = ["no exposure", "limited exposure", "somewhat comfortable", "pretty good", "very good"]

comfort_columns =      ["comf_viz",   "comf_python", "comf_math",   "comf_stats",  "comf_ML",   
                        "comf_strat", "comf_term",  "comf_comm",  "comf_wrangle"
]

for col in comfort_columns:
    df[col] = cp.copy(relabel(df[col], comfort_labels_long, comfort_labels_short, n_resp))

experience_labels_long  = ["No prior projects", "One small project", "A few projects", "A large project", "A number of large projects"]
experience_labels_short = ["NONE", "SMALL", "MED", "LARGE", "LARGE+"]

experience_columns =      ["exp_stats", "exp_python", "exp_ML", 
                           "exp_val",   "exp_deploy", "exp_prd_dev", 
                           "exp_prd_strat"
                          ]
for col in experience_columns:
    df[col] = cp.copy(relabel(df[col], experience_labels_long, experience_labels_short, n_resp))

    
# clean up survey software's long-text for "Yes" convention   


YES = 1
NO  = 0

yesno_cols = ["intr_segm",   "intr_fraud", "intr_invest",
              "intr_invnt",  "intr_QC",    "intr_recm",  "intr_risk",   "intr_supply",
              "intr_target", "intr_insur", "intr_other", "obj_acc",     "obj_jobsrch",
              "obj_proj",    "obj_gen",    "obj_mgt_ds", "obj_startup", "obj_other"
             ]
# yesno_cols = [ "obj_startup", "obj_other"]

for col in yesno_cols:
    df[col] =[YES if type(v) == str else NO for v in df[col]]


## <span style="font-family: Palatino; font-size: 25px; color: purple"> Transform  </span>  <span style="font-size: 20px"> `sub_date` </span>  <span style="font-family: Palatino; font-size: 25px; color: purple"> and </span><span style="font-size: 20px">  `warmup`</span>  <span style="font-family: Palatino; font-size: 25px; color: purple">to make them more useful </span>

`sub_date`, the date on which the response was submitted is listed as a date, which is techncially OK for our learning algorithm, but it is not really set up for answering questions.  We can fix that by transforming it into a more useful quantity, for example, the number of days before the due date that the response was submitted.

In [None]:
import datetime as dt

HW_due_date   = dt.date(2022, 2, 7)

spare_time = [ HW_due_date - dt.datetime.date(dt.datetime.strptime(d, '%Y-%m-%d %H:%M:%S')) for d in df.sub_date]
hours_early = [d.total_seconds() / 3600 for d in spare_time]
df = df.assign(hours_early=hours_early)

`warmup`,  the respondent's submission for the question regarding how to approach the hardware store chain's problem, could potentially be a fruitful source of text mining or even NLP insights, but that is a bit beyond our scope right now.  (In addition, for the algorithm we are going to demonstrate here, text values are not acceptable.)  For our purposes, we can transform that variable's values into something that is perhaps much cruder, but which might give us something easier to work with at this point, such as the number of words in the response.

In [None]:
# warmup_len = [len(s) for s in df.warmup]
warmup_len = [count_words(s) for s in df.warmup]
df = df.assign(warmup_len = warmup_len)
long_warmup = [1 if s > np.median(warmup_len) else 0 for s in warmup_len]
df = df.assign(long_warmup = long_warmup)

And, finally,  at least for this pass, we can drop the original raw columns since we will be using the transformed, rather than raw forms of the variables.

In [None]:
df = df.drop(columns=['sub_date', 'warmup'])

## <span style="font-family: Palatino; font-size: 25px; color: purple"> Clean up category names and coding </span>

Next let's work on the categorical variables.  We can start with the columns dealing with the degrees that respondents are pursuing.  There are two of these:  `degree` and `other_deg`.  Let's look at them:

In [None]:
pd.DataFrame(zip(df.degree, df.other_deg))

In [None]:
df.degree =  df.degree.fillna(df.other_deg)
df = df.drop(columns=["other_deg"])

In [None]:
#convert other categorical cols into categories and relabel for compactness
cat_cols = ["work_yrs", 
              "exp_stats", "exp_python",  "exp_ML",      "exp_val",      "exp_deploy", "exp_prd_dev", "exp_prd_strat", 
              "did_OLS",   "did_RDB",     "did_cloud",    "did_API", 
              "comf_viz",  "comf_python", "comf_math",    "comf_stats",  "comf_ML",    "comf_strat",
              "comf_term", "comf_comm",   "comf_wrangle"
           ]

for col in [cat_cols]:
    df[col] = df[col].astype("category")

In [None]:
df.head

# <span style="font-family: Palatino; font-size: 30px; color: purple">  (Finally) Begin analysis: EDA </span>

## <span style="font-family: Palatino; font-size: 25px; color: purple">Basic EDA </span>

We first need to look at the data to get a sense of how we want to explore it...

In [None]:
warnings.filterwarnings('ignore')

rows, cols = 3, 3
variables  = ['degree',     'work_yrs',   'hours_early', 
              'exp_stats',  'exp_python', 'exp_ML',
              'comf_python','comf_math', 'comf_stats' 
             ]
colors     = ['purple',    'gray',        'green', 
              'lightblue', 'lightblue',   'lightblue',
              'lightgreen', 'lightgreen', 'lightgreen'
             ]
nvars      = len(variables)

fig, axs = plt.subplots(ncols=cols, nrows=rows, figsize=(7*cols, 7*rows))
axs      = axs.flatten()
params   = {'axes.titlesize':'28', 'xtick.labelsize':'20', 'ytick.labelsize':'20'}
plt.rcParams.update(params)


for i in range(nvars):
    v = variables[i]
    df[v].hist(ax = axs[i], color = colors[i])
    axs[i].set_title(v)
plt.show()    
    



## <span style="font-family: Palatino; font-size: 25px; color: purple">Intermediate EDA </span>

Later on, we'll learn about some tools to make this easier in python.  For example....

In [None]:
def create_explorer(eda_env, using_colab = False):
    # if using Python 3.10 or higher, this can be done more elegantly with a `match` statement
    
    # -----------------------------------------
    def clear_previous_explorers(eda_modules):
        for m in eda_modules:
            try:
                sys.modules.pop(m)
            except:
                None
    # -----------------------------------------          
         
        
    eda_modules = ['dataprep', 'pandas_profiling', 'bamboolib']
    clear_previous_explorers(eda_modules)
    
    global explore
    
    if eda_env == "dataprep":
        from dataprep.eda import create_report        
        def explore(df, show = True):
            report  = create_report(df)
            report.show()
            if show:
                report.show()
        return explore
    elif eda_env == "pandas_profiling":
        from pandas_profiling import ProfileReport
        
        def explore(df, show = True):
            report = ProfileReport(df) 
            if show:
                if using_colab:
                    report.to_notebook_iframe()
                else:
                    report.to_notebook_iframe()
                    # report.to_widgets()
        return explore
    elif eda_env == "bamboo":
        
        if using_colab:
            print("WARNING: Cannot use bamboolib in Colab!  Null function returned.")
            explore = None
        else:
            import bamboolib
            def explore(df, show = True):
                if(show):
                    return(df)
            return explore
    else:
        print("WARNING: Unrecognized eda_env: ", + eda_env + "  Null function returned.")
        explore = None
        return explore


In [None]:
warnings.filterwarnings('ignore')

EDA_ENV = "pandas_profiling"
USING_COLAB = False

explore = create_explorer(EDA_ENV, USING_COLAB)
explore(df)

In [None]:
import bamboolib
df

There is a lot to see here and even a cursory review of the report suggests that we would need to do some further _preprocessing_ of the data before modeling it.  (We will learn about this in the next lecture.

But since this is just a fly-by introduction, we will move onto buiding our first data-driven model.

# <span style="font-family: Palatino; font-size: 30px; color: purple">Fitting a tree-based model</span>

We will use a form of *recursive partitioning* algorithm to fit a tree model to the survey data.  To get our feet a little wet, let's see if we can figure out the features that are associated with a respondent having a longer vs. shorter response to the warmup question on the questionnaire...

We can fit a model to use some of the other variables to try to explain `long_warmup`.

In [None]:
# define which variables we will be trying to "predict"
target_col     = ['long_warmup']

# select some variables to enter into the model to predict it
predictor_cols = ["comf_math",   
                  "comf_stats", 
                  "hours_early", 
                  "degree"
                 ]

The implementation of the algorithm we will be using from `sklearn` does not acomodate text or label data, so we first need to massage the data to get it into a form that is conformable with the algorithm's requirements.  (Yup.  More data wrangling.)

In [None]:
X = pd.get_dummies(df[predictor_cols], drop_first=False)
X.head(5)

In [None]:
y = df[target_col]
y.head(5)

*Now we are ready to go..!*

##  <span style="font-family: Palatino; font-size: 25px; color: purple">Estimate a tree model using </span>  <span style="font-size: 20px"> `sklearn` </span>

We will estimate our model in two steps.  These steps are, in general the same ones we will use for each estimation algorithm we study:
1. **define a model object, with the structure and parameters we wish to use;** and
2. **use the algorithm to estimate (fit) the parameters of the model, using a particular data set.**

For example:

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
np.random.seed(42) # so that we all get the same results


decision_tree = DecisionTreeClassifier(max_depth=3, criterion="entropy") 
decision_tree.fit(X, y)

The fitting algorithm will return an object that contains our the model we estimated, along with a number of useful members and methods that we may find useful:

In [None]:
decision_tree.get_n_leaves()

In [None]:
decision_tree.feature_importances_

Hmm... Not as revealing as we may have hoped.  It's sometimes easier to get a sense of the model by examining it graphically.

In [None]:
fig = plt.figure(figsize=(15,10))
_ = tree.plot_tree(decision_tree, 
                   feature_names =X.columns, 
                   class_names   = ['shorter','longer'],
                   impurity = False, proportion = True, precision = 1, 
                   rounded = False, fontsize = 16, filled = True
                  )

Interesting!  Now we can get a better sense of what the model is doing.

The very first node seems to suggest that `days_early`, one of the derived variables we calculated during the import and preprocessing steps, is pretty predictive of warmup response length.

Let's explore this in a bit more detail.

Let's do a bit more data wrangling and then plot the average word length of a response based on the value of `days_early`.  The tree model suggested that values of `days_early` that are greater than about `35` were more likely to be associated with longer responses.  

Let's go to the data.

In [None]:
# set up the data to make plotting a barplot easier

df_plot_dat = pd.DataFrame()

Now let's take a look...

In [None]:
# -----------------------------------------
def cut(x, cutoff, below_lab, above_lab):
    v  = [below_lab if h < cutoff else above_lab for h in df.hours_early]
    return pd.Series(v)
# -----------------------------------------

params   = {'axes.titlesize':'28', 'xtick.labelsize':'18', 'ytick.labelsize':'18'}
plt.rcParams.update(params)

rows, cols = 2, 2
fig, ax  = plt.subplots(ncols = cols, nrows = rows, figsize = (9*cols, 9*rows))
axs      = ax.ravel()

cutoffs = [24, 36, 72, 96]
pace = pd.Series([] * df.shape[0])
df_plot_dat = df.assign(pace = pace)

difs = np.zeros(len(cutoffs))

for i, c in enumerate(cutoffs):
    below_lab = "rushed"
    above_lab = "leisurely"
    pace = cut(df_plot_dat.warmup_len, c, above_lab, below_lab)     
    df_plot_dat = df_plot_dat.assign(pace = pd.Series(pace))

    title_txt = "With more (less) than " + str(c) + " hrs. left"

    #subset the data using the `pandas` `groupby` method (and create the raw plot)
    p = df_plot_dat.groupby(by = ["pace"])['warmup_len'].mean().plot(kind="bar", ax = axs[i],
                                                                 color=['green', 'red'],
                                                                 title = title_txt, ylabel = "word count",
                                                                 ylim = (0, 300),
                                                                 xlabel = "", rot = 0, 
                                                                 grid = True
                                                                )
    p.set_ylabel("word count",fontdict={'fontsize':20})
    summary = df_plot_dat.groupby(by = ["pace"])['warmup_len'].mean()
    difs[i] = summary[0] - summary[1]
plt.show();



dat = pd.DataFrame(data = {"cutoffs": cutoffs, "difs": difs})
dat.plot(x = 'cutoffs', y = 'difs', figsize=(12, 5), color = "red",  linewidth=3)

plt.xlabel("Hours before HW due", fontsize = 20, fontweight = 'bold')
plt.ylabel("Dif. in mean words", fontsize = 20, fontweight = 'bold')
plt.title("\nAs the threshold is moved closer to the deadline, \nboth lengths and differences got shorter", fontsize = 30, fontweight = 'bold')
plt.text(0, max(difs), "Leisurly are longer",    fontsize = 15, fontstyle = "italic", ma = "right")
plt.text(0, min(difs)-10, "Rushed are longer", fontsize = 15, fontstyle = "italic", ma = "right")

plt.show();



A really cool new package for visualzing decision trees that saves us a bit of effort in the first pass.

In [None]:
from dtreeviz.trees import dtreeviz # remember to load the package

class_colors = [None, # 0 classes
                None, # 1 class
                ["#ff6347","#00ff80"], # 2 classes
]

vis = dtreeviz(decision_tree,
               X.iloc[:,:], y.iloc[:,0], # painful notation since dtreeviz
               target_name=target_col,   # expects numpy data structures but
               feature_names=X.columns,  # our data is in a pandas data frame
               class_names=['shorter','longer'],
               colors={'classes': class_colors}
               
              )

vis.scale = 2.5
vis

In [None]:
vis.__dir__()