# Data Cleaning the raw MetaKaggle Dataset (.ipynb to .json)

## Purpose 
This notebook runs preprocessing scripts which strip away unnecessary data from the notebooks (e.g. metadata) and produces a cleaned dataset with each .json file being a list of `{cell_type, source}`, where `cell_type` can be either "markdown" or "code", and where `source` is a single string of either markdown/code, with newlines denoted by `\n`. 

We also use `nbformat`, an official Jupyter library, to standardize the format of all notebooks in the dataset. For example, an unknown number of notebooks are formatted such that the `source` attribute is a list of strings per cell, not just a single string of code per cell. This preprocessing step standardizes the format so that there are no inconsistencies between all notebooks in the dataset.

## Note: The raw .ipynb dataset was first renamed to all .json files
The raw dataset originally consists of .ipynb files. However, all .ipynb files are simply .json files and can be renamed to be .json files. 
Therefore, all .ipynb files were renamed to .json using the shell command:

```find . -name "*.ipynb" -exec rename 's/\.ipynb$/.json/' '{}' +```



# Import Modules

In [78]:
import pandas as pd
import matplotlib.pyplot as plt
import nbformat
import json
import glob
import os,os.path
import errno

In [68]:
# Retrieves a list of filepaths of all ipynb files (pre-converted into json by simply renaming their file extensions from .ipynb to .json)
all_file_paths = glob.glob('data/notebooks-full-json/**/*.json',recursive=True)

In [81]:

# Mkdir function
# Taken from https://stackoverflow.com/a/600612/119527
def mkdir_p(path):
    """ 
    Creates a directory at the given path, if it does not exist
    If the directory already exists, do nothing
    """
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

# Look through each raw file in /data/notebooks-full.
for file_path in all_file_paths:
    print(file_path)
    try:
        data = nbformat.read(file_path, as_version=4)
    # Some .json files are invalid, i.e. contain zero bytes
    except ValueError:
        print("Invalid file", file_path)

        continue
    
    jsondata = json.loads(nbformat.writes(data, version=4))
    jsondata = jsondata['cells']

    keep_keys = ['cell_type', 'source']
    newData = []
    for obj in jsondata:
        obj = { theKey: obj[theKey] for theKey in keep_keys }

        # Edge case: Some source are null for some reason
        if obj['source'] is None:
            obj['source'] = ""
            
        # Otherwise, it's a normal list of strings
        # Flatten them (each element of the list already contains \n delimiter)
        else:
            # Flatten the source to one single string
            obj['source'] = " ".join(obj['source'])
        newData.append(obj)
    
    # create dir if not existing
    # creates the directory clean-notebooks-full-json/[competitionname]
    newFilePath = 'data/clean-' + file_path
    dirName = 'data/clean-' + os.path.dirname(file_path)
    mkdir_p(dirName)
    with open(newFilePath, 'w') as f:
        json.dump(newData, f)

notebooks-full-json/favorita-grocery-sales-forecasting/1806927.json
notebooks-full-json/favorita-grocery-sales-forecasting/2153907.json
notebooks-full-json/favorita-grocery-sales-forecasting/1707222.json
notebooks-full-json/favorita-grocery-sales-forecasting/2287641.json
notebooks-full-json/favorita-grocery-sales-forecasting/1803957.json
notebooks-full-json/favorita-grocery-sales-forecasting/2104193.json
notebooks-full-json/favorita-grocery-sales-forecasting/1998847.json
notebooks-full-json/favorita-grocery-sales-forecasting/2159421.json
notebooks-full-json/favorita-grocery-sales-forecasting/1805310.json
notebooks-full-json/favorita-grocery-sales-forecasting/1997487.json
notebooks-full-json/favorita-grocery-sales-forecasting/1753449.json
notebooks-full-json/favorita-grocery-sales-forecasting/1930002.json
notebooks-full-json/favorita-grocery-sales-forecasting/2144207.json
notebooks-full-json/favorita-grocery-sales-forecasting/1685247.json
notebooks-full-json/favorita-grocery-sales-forec