# Pandas Analysis: Summary of Findings

## Motivation and Methodology

In order to demonstrate the efficacy of our Pandas on Ray library as a functional drop-in replacement for Pandas we wanted to calculate some notion of coverage. In other words, how much of the functionality of Pandas being used by most users do we offer an implementation for.

To arrive at some notion of this value we decided to leverage the public kernels available on Kaggle. First we wrote a scraper that would collect the top ~1900 Python files (iPython notebooks or Python scripts) sorted by votes and download the raw source code for analysis. Then using some string matching we tabulated how many Pandas methods (e.g. `pd.*`) or DataFrame methods (e.g. `df.*`) were used in each file and how many times. We then compiled those numbers into a single aggregate count of method frequency over the total 1762 scripts.

## Findings

In [7]:
import pandas as pd
results = pd.read_csv('results/results.csv')
results.head()

Unnamed: 0,Method,Count
0,pd.read_csv,1422
1,pd.DataFrame,886
2,df.append,792
3,df.mean,783
4,df.head,783


In [8]:
annot = pd.read_csv('results/annotations.csv')
annot.head()

Unnamed: 0,Method,Count,Finished,Notes
0,pd.read_csv,756,0.0,I/O
1,pd.DataFrame,469,1.0,
2,df.head,438,1.0,
3,df.append,436,0.0,
4,df.mean,429,1.0,


In [18]:
results['Finished'], results['Notes'] = None, None
results.head()

Unnamed: 0,Method,Count,Finished,Notes
0,pd.read_csv,1422,,
1,pd.DataFrame,886,,
2,df.append,792,,
3,df.mean,783,,
4,df.head,783,,


In [26]:
def updateRow(row):
    if row['Method'] in annot['Method'].values:
        annotRow = annot[annot['Method'] == row['Method']]
        row['Finished'] = annotRow['Finished'].values[0]
        row['Notes'] = annotRow['Notes'].values[0]
        
    return row

results = results.apply(updateRow, axis=1)
results.head()

Unnamed: 0,Method,Count,Finished,Notes
0,pd.read_csv,1422,0.0,I/O
1,pd.DataFrame,886,1.0,
2,df.append,792,0.0,
3,df.mean,783,1.0,
4,df.head,783,1.0,
