# Assert Context Analysis

In this notebook, we focus our analysis on trying to understand the circumstances that lead to writing an `assert` statement.

Defining context is not trivial. We start with the simplest case where we only consider the python statement immediately above the assert statement.

We only consider cells that contain assert statements defined at the top-most level. This way, we exclude assertions defined say within a function call, control-flow statements and loops. Because, in these cases, our definition of context becomes invalid.

We only consider cells that have a single assert statement (to ensure that the context applies only to that particular assertion). And cells that generate an AST with more than the `Assert` node (so we don't consider cells that only have an assertion defined in them).

**TODO**: ensure that our sample size is sufficiently large and representative of the entire population after applying all those filters.

**TODO**: I know that we will run into the same problem as we faced with the visualisation: how do you know if the context statement is related to the assert statement?

In [1]:
import pandas as pd
import ast
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_theme(context="talk", style="whitegrid", palette="colorblind")

In [15]:
asserts = pd.read_csv(
    "data/shome2023notebook/quaranta2021kgtorrent-assert-content.csv",
    header=None,
    names=["cell_type", "source", "notebook"],
)
asserts

Unnamed: 0,cell_type,source,notebook
26,code,"def make_submission(test, submission):\n pr...",data/quaranta2021kgtorrent/KT_dataset/akashsup...
85,code,# Lets check with assert statement\n# Assert ...,data/quaranta2021kgtorrent/KT_dataset/elifnkar...
86,code,"# In order to run all code, we need to make th...",data/quaranta2021kgtorrent/KT_dataset/elifnkar...
87,code,assert data['Type 2'].notnull().all() # retur...,data/quaranta2021kgtorrent/KT_dataset/elifnkar...
89,code,assert data['Type 2'].notnull().all() # retur...,data/quaranta2021kgtorrent/KT_dataset/elifnkar...
...,...,...,...
10,code,def get_news_dropList():\n return ['sourceT...,data/quaranta2021kgtorrent/KT_dataset/enders12...
27,code,"def main_block(x, filters, n, strides, dropout...",data/quaranta2021kgtorrent/KT_dataset/varanr_e...
16,code,"assert len(predictions) == len(test_data), 'Nu...",data/quaranta2021kgtorrent/KT_dataset/fbusche_...
12,code,"def get_couples(structure):\n """"""\n For ...",data/quaranta2021kgtorrent/KT_dataset/jamescha...


# Pre-processing

+ Remove cells that don't have a valid AST
+ Remove cells with no top level nodes (all commented out)
+ Remove cells with more than 1 `Assert` node
+ Remove cells with no other nodes except the `Assert` node

In [4]:
def get_ast(source: str) -> ast.Module: 
    try:
        tree = ast.parse(source)
    except:
        tree = None
    finally:
        return tree

In [16]:
asserts.loc[:, "ast"] = asserts["source"].apply(get_ast)
asserts = asserts.loc[asserts["ast"].notna()]
asserts = asserts.loc[asserts["ast"].map(lambda x: True if list(ast.walk(x)) else False)]
asserts.shape



(13441, 4)

In [19]:
def has_one_top_level_assert(nodes: list) -> bool:
    nodes = [node for node in nodes if isinstance(node, ast.Assert)]
    return len(nodes) == 1

def has_other_top_level_nodes(nodes: list) -> bool:
    nodes = [node for node in nodes if not isinstance(node, ast.Assert)]
    return len(nodes) > 0

asserts.loc[:, "top_nodes"] = asserts.loc[:, "ast"].map(lambda x: x.body)
asserts = asserts.loc[asserts["top_nodes"].map(lambda x: has_one_top_level_assert(x))]
asserts = asserts.loc[asserts["top_nodes"].map(lambda x: has_other_top_level_nodes(x))]
asserts.shape

(2144, 5)

In [20]:
# NOTE: random sample of cells
for _, source in asserts["source"].sample(5).items():
    print("==========")
    print(source)

# Exercise 4
# Create a variable named numbers and assign it a list of numbers, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
assert numbers == [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "Ensure the variable contains the numbers 1-10 in order."
print("Exercise 4 is correct.")
# Exercise 95
# Write a function called lowest_priced_book that takes in the above defined list of dictionaries "books" and returns the dictionary containing the title, price, and author of the book with the lowest priced book.
def lowest_price_book(x):
    #return min([i['price'] for i in x if 'price' in i])
    min_price = min([i['price'] for i in x if 'price' in i])
    if min_price in x:
        return x

assert lowest_price_book(books) == {
    "title": "Weapons of Math Destruction",
    "author": "Cathy O'Neil",
    "price": 17.44
}
print("Exercise 95 is complete.")
# Exercise 93
# Write a function named get_average_book_price that takes in a list of dictionaries and returns the average book 