EDA is a small miscellany of tools to facilitate exploratory data analysis and other common data science tasks. It is released under the MIT license.
To install with pip: pip install git+https://github.com/alexklapheke/eda.git
.
To install manually, place in a directory such as ~/bin/python
, then add the following line to your ~/.bashrc
:
export PYTHONPATH="${PYTHONPATH}:$HOME/bin/python"
You should then be able to import eda
in Python applications.
This module adds methods to Pandas data frames for exploring datasets. It is therefore only usable if you import pandas
. It includes the following methods:
df.summary()
: Print a summary of the data framedf
, including missing data and histograms of numeric columns. A good jumping-off point for exploring a new dataset.df.missing()
: Show a bar plot of missing data by column.df.missing_by(col)
: Likemissing
, but grouped by columncol
.df.missing_map()
: Show a heatmap of missing data to uncover patterns.df.misordered(col1, col2, ...)
: Show rows which are in the wrong order; e.g.,df.misordered("start", "end")
will show rows in which the end date precedes the start date.benford(iterable)
: Given an iterable of numerics, give the proportion of first digits to check conformity to Benford's Law. You can feed the input intosparkline
(be sure to setwidth=9
). The results should look like█▅▃▂▂▂▁▁▁
.benford_plot(iterable)
: Show a bar plot comparing the proportion of first digits to those predicted by Benford's Law.
This module contains standalone functions for evaluating models.
accuracy_metrics(y_true, y_pred)
: Returns a labeled confusion matrix, along with measures such as sensitivity and specificity.multiaccuracy(y_true, y_pred)
: Given the results of a multiclass classification, show a pivot table of class predictions.multiaccuracy_heatmap(y_true, y_pred)
: Likemultiaccuracy
, but show a heatmap.fuzzy_accuracy(y_true, y_pred, tolerance)
: For a multiclass classification of ordinal data, show the percent of results that were withintolerance
of the true class.cohens_kappa(y_pred1, y_pred2)
: Given the results of two models, calculate the degree to which they agree using Cohen's kappa, from 0 (no agreement) to 1 (perfect agreement).test_LINE(y_true, y_pred)
: Show some plots to help test the "LINE" assumptions of a linear regression.
This module is for machine learning.
DBSCAN()
: an implementation of the DBSCAN clustering algorithm, that doesn't require the high memory overhead of scikit-learn's implementation (sklearn computes a distance matrix which is O(n²) in space in the number of data points and can easily use several GB of memory). Uses sklearn's.fit()
/.predict()
convention and can be used in pipelines.
This module provides some convenience functions for dealing with natural language.
tf_idf
: Compute tf-idf score.logodds_dirichlet
: Compute log-odds ratio, uninformative Dirichlet prior.
Like summary
, this module adds methods to Pandas data frames.
sparkline(iterable)
: Produce a sparkline given an iterable of numerics or date/time objects. For example,sparkline(range(8))
produces▁▂▃▄▅▆▇█
.series.sparkline()
,df.sparkline(col)
: Produce a sparkline of the given series or column of the data frame.df.data_dictionary()
: Return a data dictionary in GitHub-flavored markdown, suitable for inclusion in a GitHub README (this is a wrapper forsummary.summary()
).