# Analysis_100
This notebook is an analysis of the data prepared in Prep_100.

## Purpose 
* This notebook will read in the small dataset from Prep_100 for analysis for RQ1
* What will be looked at:
    * Notebook language
    * Markdown vs code cells

In [1]:
#importing relevant libraries
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

%matplotlib inline

In [2]:
df_cell = pd.read_csv('../data/CSV_files/Cells_info.csv') # reading in csv 

In [3]:
df_cell.head()

Unnamed: 0,nb_id,nb_language,workbook_index,cell_index,cell_type,num_words,lines_of_code
0,1122,python,,0,code,,7.0
1,1122,python,,1,code,,4.0
2,1122,python,,2,code,,12.0
3,1122,python,,3,code,,1.0
4,1122,python,,4,code,,6.0


## Notebook Language

In [4]:
df_cell.groupby(['nb_id', 'nb_language']).size().groupby('nb_language').size().sort_values(ascending = False)

nb_language
python      4476
julia         52
R             49
c++           12
lua            9
Julia          7
octave         2
c              2
ruby           1
http           1
elixir         1
ansible        1
Any text       1
dtype: int64

Here we can see, unsurprisingly, that python is the most popular language among jupyter notebooks with 4476. There are a few other languages but they don't come close to that utilised by python.

## Markdown vs Code

In [5]:
len(df_cell[df_cell['cell_type'] == 'code'])

99988

There are 99988 code cells in the 6529 notebooks. 

In [6]:
len(df_cell[df_cell['cell_type'] == 'markdown'])

50107

There are 50107 markdown cells altogether in the 6529 notebooks. This is a 2:1 ratio of code to markdown cells. This use of markdown cells shows a promising for the advancement of literate programming.

## Average number of cells per notebook

* Grouping by the nb_id and cell index will get the number of cells in each notebook
* I then sort these values and get the mean number of cells from the 6500 notebooks

In [7]:
df_cell.groupby('nb_id')['cell_index'].count().sort_values(ascending=False).mean()

26.187887130075705

We can see here that there is on average 26 cells per notebook.

In [8]:
cells = df_cell.groupby('nb_id')['cell_index'].count().sort_values(ascending=False)
cells.head(5)

nb_id
757289    514
643702    469
189246    383
689876    323
266880    315
Name: cell_index, dtype: int64

Here are the 5 notebooks with the most cells.
<br> 
Ranging from 300-500. That is a lot of cells per notebook, this may be due to users using their notebooks for messy analysis and not cleaning up their cells or maybe some users prefer to not break up their analysis and keep it all to one large notebook.

In [9]:
(cells > 26).value_counts()

False    3955
True     1857
Name: cell_index, dtype: int64

There are 1857 notebooks above the average cell count of 26 and 3955 below the average. 

## Average number of code cells per notebook

In [10]:
df_cells_code = df_cell.loc[df_cell['cell_type'] == 'code']

In [11]:
df_cells_code.groupby('nb_id')['cell_index'].count().sort_values(ascending=False).mean()

17.523308797756748

Here, we can see that the average number of code cells in a notebook is 17.5.
This is 2/3 of the average cells at 26.

In [12]:
code_cells = df_cells_code.groupby('nb_id')['cell_index'].count().sort_values(ascending=False)
code_cells.head(5)

nb_id
643702    384
189246    326
757289    309
689876    270
103154    224
Name: cell_index, dtype: int64

Here are the 5 notebooks with the most code cells.
<br>
These are the 5 notebooks with above 200 code cells. We can see notebook 757289 which contains the most cells out of all notebooks in this dataset, here, contains 309 code cells. 

In [13]:
(code_cells > 17.5).value_counts()

False    3915
True     1791
Name: cell_index, dtype: int64

There are 1791 notebooks containing above the average code cells of 17.5 and 3915 notebooks conatining below the average code cells.

## Average number of markdown cells per notebook

In [14]:
df_cells_markd = df_cell.loc[df_cell['cell_type'] == 'markdown']

In [15]:
df_cells_markd.groupby('nb_id')['cell_index'].count().sort_values(ascending=False).mean()

12.831498079385403

Here, we can see there are 12.8 (13 if rounded) markdown cells per notebook.
<br>
This is about 1/3 of the average number of cells at 26. This showing people are explaining their code as they are avidly using markdown cells in order to explain or comment on their analysis.

In [16]:
md_cells = df_cells_markd.groupby('nb_id')['cell_index'].count().sort_values(ascending=False)
md_cells.head(5)

nb_id
44838      214
266880     207
757289     168
1240165    157
1240005    157
Name: cell_index, dtype: int64

Here are the 5 notebooks with the most markdown cells. It ranges from 157-214. This is a large amount of markdown cells for a notebook to have which is promising for the advancement of literate programming. Again, we see notebook 757289 with 168 markdown cells.

In [17]:
(md_cells > 12.8).value_counts()

False    2656
True     1249
Name: cell_index, dtype: int64

Here we can see there are 1249 cells above the average markdown cells of 12.8 and 2656 below the average markdown cells. 

## Average lines of code per notebook

In [18]:
#dropping NaN values in lines of code columns as will be null for markdown cells
code_lines = df_cell.dropna(subset=['lines_of_code']) 

In [19]:
code_lines.groupby('nb_id')['lines_of_code'].sum().mean()

136.28338590956886

There are 136 lines of code on average per notebook in code cells.

In [20]:
lines_code = code_lines.groupby('nb_id')['lines_of_code'].sum().sort_values(ascending=False)
lines_code.head(5)

nb_id
1163984    3211.0
1163985    3211.0
1163983    3211.0
1124656    1776.0
282393     1576.0
Name: lines_of_code, dtype: float64

Here we can see the 5 notebooks with the most lines of codes. These notebooks range from 1576 to 3211. Interestingly on examination I have found that these top 3 notebooks with the same number of lines of code actually contain the same content. This may be due to a GitHub user uploading the same code but as different notebooks. 

In [21]:
(lines_code > 136).value_counts()

False    3733
True     1973
Name: lines_of_code, dtype: int64

There are 1973 notebooks with more than the average lines of code per notebook and 3733 notebooks with below the average. 

## Average number of words per notebook

In [22]:
# dropping NaN values as those are not markdown cells
words_markd = df_cell.dropna(subset=['num_words'])

In [23]:
words_markd.groupby('nb_id')['num_words'].sum().mean()

551.530839231547

There are 551.5 words on average per notebook. On average 500 words is about an A4 page of writing. This again shows notebook users are explaining and commenting their analysis.

In [24]:
markd_words = words_markd.groupby('nb_id')['num_words'].sum().sort_values(ascending=False)
markd_words.head()

nb_id
1198190    8794.0
42835      8563.0
1060294    7139.0
1240005    6575.0
1240165    6575.0
Name: num_words, dtype: float64

Here are the 5 notebooks with the most words in markdowns per notebooks. 

In [25]:
(markd_words > 551).value_counts()

False    2831
True     1125
Name: num_words, dtype: int64

There are 1125 notebooks with above the average words per markdoown and 2831 notebooks below the average.

The results of this analysis are in notebook Results_100