<br>

## EDA

---

In this notebook I will walk through a simple exploration of the Python files contained within The Stack dataset.

<br>

### IMPORTS

---



In [12]:
import tensorflow as tf
import pandas as pd
import numpy as np
import params
import wandb; wandb.login()
import sys
import os

# add the cllm-data-curation library to the path
sys.path.insert(0, "/home/paperspace/home/cllm-data-curation")

# We will show these functions when required
from cllm_data_curation.thestack_curation.curation_utils import make_meta_df
from cllm_data_curation.thestack_curation.curation_utils import filter_meta_languages
from cllm_data_curation.thestack_curation.curation_utils import open_pq_as_df
from cllm_data_curation.thestack_curation.curation_utils import flatten_l_o_l
from cllm_data_curation.thestack_curation.curation_utils import print_ln
from cllm_data_curation.thestack_curation.curation_utils import read_json_file
from cllm_data_curation.thestack_curation.curation_utils import glob_pq_paths
from cllm_data_curation.thestack_curation.curation_utils import get_dir_size
from cllm_data_curation.thestack_curation.curation_utils import replace_byte_encoded_string
from cllm_data_curation.thestack_curation.curation_utils import contains_repeating_substring
from cllm_data_curation.thestack_curation.general_utils import get_optimal_worker_count

[34m[1mwandb[0m: Currently logged in as: [33mds08tf[0m ([33mds-ml[0m). Use [1m`wandb login --relogin`[0m to force relogin


<br>

### **`ConfigBuilder`**

---

A useful class to keep track of various configurations throughout the course of exploration

In [18]:
class ConfigBuilder():
    def __init__(self, name, **kwargs):
        self.name = name
        if kwargs:
            for k,v in kwargs.items(): setattr(self, k,v)

    def add_attr(self, key, value):
        setattr(self, key, value)

    def update(self, **kwargs):
        for key, value in kwargs.items():
            if hasattr(self, key):
                setattr(self, key, value)
            else:
                raise AttributeError(f"{key} not found in Config")

    def to_dict(self):
        return vars(self)   
    
    def __repr__(self):
        return f"{'**** '+self.name.upper()+' CONFIG ATTRIBUTES ****':^40}\n"+\
               "\n".join([f"--- {repr(k):<20} --> {repr(v)}" for k,v in self.__dict__.items()])


ConfigBuilder("demo", **{"height":6, "width":100, "style":"cat"})

    **** DEMO CONFIG ATTRIBUTES ****    
--- 'name'               --> 'demo'
--- 'height'             --> 6
--- 'width'              --> 100
--- 'style'              --> 'cat'

<br>

### WANDB CONSTANTS

---

Here is where we initialize the constants we will use within **`WANDB`**

In [19]:
wandb_config = ConfigBuilder("wandb")

WANDB_PROJECT = "pystack"
wandb_config.add_attr("project", WANDB_PROJECT)

ENTITY = None # set this to team name if working in a team
wandb_config.add_attr("entity", ENTITY)

wandb_config

   **** WANDB CONFIG ATTRIBUTES ****    
--- 'name'               --> 'wandb'
--- 'project'            --> 'pystack'
--- 'entity'             --> None

<br>

### NOTEBOOK CONSTANTS

---

These will be saved in a special config that we use in our notebook.

In [20]:
nb_config = ConfigBuilder("notebook")

# 1. Add a flag for debugging purposes
#   - setting this flag to True will use only a small subset of the data
DEBUG = True 
nb_config.add_attr("debug", DEBUG)

# 2. Add file and path information 
nb_config.add_attr("project_dir", "/home/paperspace/home/python-stack")
nb_config.add_attr("data_dir", os.path.join(nb_config.project_dir, "data"))
nb_config.add_attr("working_dir", os.getcwd())

nb_config

  **** NOTEBOOK CONFIG ATTRIBUTES ****  
--- 'name'               --> 'notebook'
--- 'debug'              --> True
--- 'project_dir'        --> '/home/paperspace/home/python-stack'
--- 'data_dir'           --> '/home/paperspace/home/python-stack/data'
--- 'working_dir'        --> '/home/paperspace/home/python-stack/notebooks'

In [11]:
run = wandb.init(**wandb_config.to_dict(), job_type="upload")
raw_data = wandb.Artifact(nb_config.data_dir, type="raw_data")

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 9
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/paperspace/.netrc


True

In [14]:
params.WANDB_PROJECT

AttributeError: module 'params' has no attribute 'WANDB_PROJECT'