# Prep 100 
This is the first prep notebook where I begin my project 'Data Science in Practice: An Analysis of Jupyter Notebooks' 

## Purpose
* In this notebook I will begin parsing the notebook files of the smaller dataset of 6500 notebooks downloaded from - https://library.ucsd.edu/dc/object/bb2733859v as 'sample notebook data'.
* I will first look at the cell data in the notebooks.

    * First I downloaded the data: CSV file containing notebook ID's and 6500 individual ipynb files
    * Created functions that parsed the individual notebooks by their notebook ID in the CSV
    * Created a dataframe with information I needed including: 
        * Cell type
        * Number of words in markdown cells
        * Lines of code
    * I saved the dataframe as a CSV file in /15487912_FYP/data/CSV_files/ for analysis in notebook Analysis_100

In [1]:
#importing relevant libraries
import os
import json
import numpy
import pandas as pd
import re

In [2]:
df_nb = pd.read_csv('../data/Dataset1/csv/notebooks_sample.csv')

print('%s notebooks' % df_nb.shape[0])

6529 notebooks


Reading in the CSV file containing 6529 notebooks as a dataframe.

In [3]:
df_nb.head()

Unnamed: 0,nb_id,html_url,max_filesize,min_filesize,name,path,query_page,repo_id
0,294,https://github.com/j3hempsey/dotfiles/blob/f3e...,10,0,jupyter.ipynb,atom/packages/file-icons/examples/jupyter.ipynb,3,33378844
1,329,https://github.com/cagdasyetkin/raspberryPi3/b...,10,0,IoTdemo.ipynb,IoTdemo.ipynb,4,89953189
2,580,https://github.com/Ruijin1988/Material-Design-...,10,0,6_Artistic_style_transfer_with_a_repurposed_VG...,vanilla_vae/6_Artistic_style_transfer_with_a_r...,6,90590288
3,594,https://github.com/psygrammer/dprl/blob/c4bdd2...,1000000000,100000000,Design_1.ipynb,part2/OT/3.Result/Design_1.ipynb,1,40293216
4,921,https://github.com/ArtemKupriyanov/MOOC-COURSE...,10025,10005,sklearn.datasets-2.ipynb,ML-MIPT-Yandex-spec/2.supervised_learning/Trai...,4,74897953


Here we see the first 5 rows of the dataframe. The first column of nb_id is particularly important as it will be used to iterate over the individual notebook files with corresponding names.

### Parsing notebooks
    * Here I have created functions to parse the 6500 notebooks.
    * I have extracted information about:
        * cell type
        * number of words in markdowns
        * lines of code

In [4]:
def get_all_cell_data(df):
    
    all_cells = []      

    for index, row in df.iterrows():

        f = '../data/Dataset1/notebooks/nb_%s.ipynb' % row['nb_id']
        with open(f) as nb_file:

            try:
                data = json.load(nb_file)
            except:
                continue

            if isinstance(data, dict): 
                keys = data.keys()
            else:
                keys = []
                
            # get the language
            nb_language = None
            if 'metadata' in keys:
                if isinstance(data, dict):
                    metadata_keys = data['metadata'].keys()
                else:
                    metadata_keys = []
            else:
                metadata_keys = []
            if 'language_info' in metadata_keys:
                if isinstance(data['metadata']['language_info'], dict):
                    lang_keys = data['metadata']['language_info'].keys()
                else:
                    lang_keys = None
                if 'name' in lang_keys:
                    nb_language = data['metadata']['language_info']['name']
            elif 'language' in metadata_keys:
                nb_language = data['metadata']['language']

            # get data for each cell, nbformat v 4.x
            if 'cells' in keys:
                for i, c in enumerate(data['cells']):
                    cell_data = get_cell_data(c, i, row['nb_id'], nb_language)
                    all_cells.append(cell_data)
            
            # get data for each cell, nbformat v 2.x / 3.x
            elif 'worksheets' in keys:
                for j, w in enumerate(data['worksheets']):
                    if isinstance(w, dict): 
                        worksheet_keys = w.keys()
                    else:
                        keys = []
                    if 'cells' in worksheet_keys:
                        for k, c in enumerate(w['cells']):
                            cell_data = get_cell_data(c, k, row['nb_id'], nb_language, j)
                            all_cells.append(cell_data)
                
    return all_cells

In [5]:
def get_cell_data(cell, index, nb_id, nb_language, worksheet_index = None):
    nbformat_3_mimes = ['text', 'latex', 'png', 'jpeg', 'svg', 'html', 'javascript', 'json', 'pdf', 'metadata']
    
    if isinstance(cell, dict): 
        cell_keys = cell.keys()
    else:
        cell_keys = [] 
    
    if 'cell_type' in cell_keys:
        cell_type = cell['cell_type']
    else:
        cell_type = None
    
    
    if cell_type in ['raw','markdown','heading']: #checking cell type
        num_words = 0
        if 'source' in cell_keys:
            if isinstance(cell['source'], list):
                for l in cell['source']:
                    words = len(l.split())
                    num_words += words
            elif isinstance(cell['source'], str):
                num_words += len(cell['source'].split())
            else:
                num_words = None
    else:
        num_words = None
    
    if cell_type == 'code':
        lines_of_code = 0
        if 'source' in cell_keys:
            if isinstance(cell['source'], list):
                lines_of_code = len(cell['source'])
            elif isinstance(cell['source'], str):
                lines_of_code = len(cell['source'].splitlines())
            else:
                lines_of_code = None
            
        elif 'input' in cell_keys:
            if isinstance(cell['input'], list):
                lines_of_code = len(cell['input'])
            elif isinstance(cell['input'], str):
                lines_of_code = len(cell['input'].splitlines())
            else:
                lines_of_code = None
    else:
        lines_of_code = None
    
    return [nb_id, nb_language, worksheet_index, index, cell_type, num_words, lines_of_code]

The cell below takes 10 minutes to run.

In [6]:
# passing the notebook dataframe through function to iterate over each notebook and extract information
cell_data = get_all_cell_data(df_nb)

In [7]:
df_cells = pd.DataFrame(cell_data) # creating a dataframe with extracted information
print(df_cells.shape) # getting shape of dataframe
df_cells.columns = ['nb_id','nb_language','workbook_index','cell_index','cell_type','num_words','lines_of_code'] # renaming the columns
df_cells.head(5) # printing first 5 rows of dataframe

(152204, 7)


Unnamed: 0,nb_id,nb_language,workbook_index,cell_index,cell_type,num_words,lines_of_code
0,1122,python,,0,code,,7.0
1,1122,python,,1,code,,4.0
2,1122,python,,2,code,,12.0
3,1122,python,,3,code,,1.0
4,1122,python,,4,code,,6.0


### Saving this dataframe to a CSV file

In [8]:
df_cells.to_csv('../data/CSV_files/Cells_info.csv', index=False, encoding='utf-8')

Analysis of this information is in notebook Analysis_100 in the analysis folder.