# Minor EDA on Partially Annotated Data

This is some very very minimal exploratory data analysis on the dataset constructed for testing papers-without-code. The purpose is to quantify "what proportion of papers likely use code for their paper but do not link to it in anyway?"

You can read about the dataset construction on the [data README](./README.md).

In [None]:
# Do some basic data reading and prep
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid")

df = pd.read_csv("annotated.csv")

# We have only annotated 25 so far
df = df[:25]
df.head()

## "How many papers likely use code?"

Out of the 25 papers annotated so far, 20 "likely used code" as a part of their work in completing the paper. These other papers are commonly math or theory papers where they introduce an algorithm via mathematical notation.

In [None]:
sns.countplot(x=df.best_guess_paper_used_code)

In [None]:
# Filter out the repos where we don't think code was used
# Usually math or theory papers
df = df[df.best_guess_paper_used_code == "yes"]
len(df)

## "Of the 20, how many papers can we find repositories for?"

Out of the remaining 20 papers, we can find 13 related repositories. In the cases where we can't find repositories they either were not discoverable within the ~10 minutes I gave to searching for each repository or in one case after searching I assume the code is private because both authors are from private industry.

In [None]:
df["code_found"] = ~df.code_repository_link.isna()
sns.countplot(x=df.code_found)

## "How do the papers break down by if code was found AND the code had to be manually found (it _wasn't_ linked in the paper)?"

Of the papers where related code was found, 8 of the papers provided links directly to the code and 5 of the papers I had to manually search for repositories for.

Note on the odd case of "no code was found but a code repository _was_ linked in the paper" is that the code has since been deleted (or was never published) -- however, I found a similar repository authored by one of the authors that I feel would be useful to serve back to users.

In [None]:
sns.countplot(x="code_found", hue="code_repository_linked_in_paper", data=df)

In [None]:
df[(df.code_found == False) & (df.code_repository_linked_in_paper == "yes")].iloc[0].comments