Skip to content

Commit

Permalink
Initial alpha commit
Browse files Browse the repository at this point in the history
  • Loading branch information
fbertrand27 committed May 26, 2020
1 parent b6978c7 commit e3bd4b7
Show file tree
Hide file tree
Showing 53 changed files with 5,098 additions and 1 deletion.
9 changes: 9 additions & 0 deletions .gitignore
@@ -1,3 +1,11 @@
# CUSTOM SWEETVIZ DEV
internal_tests/
.idea/
MainTest.py
sweetviz-temp
Notes.txt
Override.ini

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -27,6 +35,7 @@ share/python-wheels/
*.egg
MANIFEST


# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
Expand Down
6 changes: 6 additions & 0 deletions MANIFEST.in
@@ -0,0 +1,6 @@
include LICENSE
include README.md
recursive-include sweetviz/fonts *.*
recursive-include sweetviz/mpl_styles *.*
recursive-include sweetviz/templates *.*
include sweetviz/sweetviz_defaults.ini
111 changes: 110 additions & 1 deletion README.md
@@ -1 +1,110 @@
# sweetviz
![Sweetviz Logo](./docs/images/logo.png)

Sweetviz is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. Output is a fully self-contained HTML application.

The system is built around quickly **visualizing target values** and **comparing datasets**. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

**Note: Sweetviz is in the ALPHA TESTING PHASE.** Core functionality is complete, please let me know if you run into any data, compatibility or install issues! Thank you for [reporting any BUGS in the issue tracking system here](https://github.com/fbdesignpro/sweetviz/issues), and I welcome your feedback and questions on usage/features [in our Discourse server (you should be able to log in with your Github account!)](https://sweetviz.fbdesignpro.com).

# Features
![Features](./docs/images/features.png)
- Target analysis
- How target values (boolean or numerical) relate to other features
- Visualize and compare
- Distinct datasets (e.g. training vs test data)
- Intra-set characteristics (e.g. male versus female)
- Mixed-type associations
- Sweetviz integrates associations for numerical (Pearson's correlation), categorical (uncertainty coefficient) and categorical-numerical (correlation ratio) datatypes seamlessly, to provide maximum information for all data types.
- Type inference: automatically detects numerical, categorical and text features, with optional manual overrides
- Summary information:
- Type, unique values, missing values, duplicate rows, most frequent values
- Numerical analysis:
- min/max/range, quartiles, mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

# Installation
## Using pip
Currently, the best way to install sweetviz (other than from source) is to use pip:
```
pip install sweetviz
```
# Basic Usage
Create a `DataframeReport` object, then use a `show_xxx` function to render the report.

**Note: Currently the only rendering supported is to a standalone HTML file, using a "widescreen" aspect ratio (i.e. 1080p resolution or wider).** Please let me know of formats/resolutions you would like to be supported in our Discourse Forum.

There are 3 main functions for creating reports:
- analyze(...)
- compare(...)
- compare_intra(...)

## Analyzing a single dataframe (and its optional target feature)
To analyze a single dataframe, simply use the `analyze(...)` function, then the `show_html(...)` function:
```
import sweetviz as sv
my_report = sv.analyze(my_dataframe)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"him
```
When run, this will output a 1080p widescreen html app in your default browser:
![Widescreen demo](./docs/images/demo_wide.png)
### Optional arguments
The `analyze()` function can take multiple other arguments:
```
analyze(source: Union[pd.DataFrame, Tuple[pd.DataFrame, str]],
target_feat: str = None,
feat_cfg: FeatureConfig = None,
pairwise_analysis: str = 'auto'):
```
- **source:** Either the data frame (as in the example) or a tuple containing the data frame and a name to show in the report.
e.g. `my_df` or `[my_df, "Training"]`
- **target_feat:** A string representing the name of the feature to be marked as "target". *Only BOOLEAN and NUMERICAL features can be targets for now.*
- **feat_cfg:** A FeatureConfig object representing features to be skipped, or to be forced a certain type in the analysis. The arguments can either be a single string or list of strings. Parameters are `skip`, `force_cat`, `force_num` and `force_text`. The "force_" arguments override the built-in type detection. They can be constructed as follows:
```
feature_config = sv.FeatureConfig(skip="PassengerId", force_text=["Age"])
```
- **pairwise_analysis:** Correlations and other associations can take exponential time (n^2) to complete. The default setting ("auto") will run without warning until a data set contains "association_auto_threshold" features. Past that threshold, you need to explicitly pass the parameter `pairwise_analysis="on"` (or `="off"`) since processing that many features would take a long time. This parameter also covers the generation of the association graphs (based on Drazen Zaric's concept):
![Pairwise sample](./docs/images/pairwise.png)

## Comparing two dataframes (e.g. Test vs Training sets)
To compare two data sets, simply use the `compare()` function. Its parameters are the same as `analyze()`, except with an inserted second parameter to cover the comparison dataframe. It is recommended to use the [dataframe, "name"] format of parameters to better differentiate between the base and compared dataframes. (e.g. `[my_df, "Train"]` vs `my_df`)
```
my_report = sv.analyze([my_dataframe, "Training Data"], [test_df, "Test Data"], "Survived", feature_config)
```
## Comparing two subsets of the same dataframe (e.g. Male vs Female)
Another way to get great insights is to use the comparison functionality to split your dataset into 2 sub-populations.

Support for this is built in through the `compare_intra()` function. This function takes a boolean series as one of the arguments, as well as an explicit "name" tuple for naming the (true, false) resulting datasets. Note that internally, this creates 2 separate dataframes to represent each resulting group. As such, it is more of a shorthand function of doing such processing manually.
```
my_report = sv.compare_intra(my_dataframe, my_dataframe["Sex"] == "male", ["Male", "Female"], feature_config)
```
# Config file
The package contains an INI file for configuration. You can override any setting by providing your own
```
sv.config_parser.read("Override.ini")
```
# Contribute
This is my first open-source project! I built it to be the most useful tool possible and help as many people as possible with their data science work. If it is useful to you, your contribution is more than welcome and can take many forms:
### 1. Spread the word!
A STAR here on GitHub, and a Twitter or Instagram post are the easiest contribution and can potentially help grow this project tremendously! If you find this project useful, these quick actions from you would mean a lot and could go a long way.

Kaggle notebooks/posts, Medium articles, YouTube video tutorials and other content take more time but will help all the more!

### 2. Report bugs & issues
I expect there to be many quirks once the project is used by more and more people with a variety of new (& "unclean") data. If you found a bug, please [open a new issue here](https://github.com/fbdesignpro/sweetviz/issues).

### 3. Suggest and discuss usage/features
To make Sweetviz as useful as possible we need to hear what you would like it to do, or what it could do better! [Head on to our Discourse server and post your suggestions there; no login required!](https://sweetviz.fbdesignpro.com).

### 4. Contribute to the development
I definitely welcome the help I can get on this project, simply get in touch on the issue tracker and/or our Discourse forum.

Please note that after a hectic development period, the code itself right now needs a bit of cleanup. :)

# Special thanks & related materials
I want Sweetviz to be a hub of the best of what's out there, a way to get the most valuable information and visualization, without reinventing the wheel.

As such, I want to point some of those great resources that were inspiring and integrated into Sweetviz:
- [Pandas-Profiling](https://github.com/pandas-profiling/pandas-profiling) was the original inspiration for this project. Some of its type-detection code was included in Sweetviz.
- [Shaked Zychlinski: The Search for Categorical Correlation](https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9) is a great article about different types of variable interactions that was the basis of that analysis in Sweetviz.
- [Drazen Zaric: Better Heatmaps and Correlation Matrix Plots in Python](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec) was the basis for our association graphs.

Binary file added docs/images/demo_wide.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/features.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/logo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pairwise.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
42 changes: 42 additions & 0 deletions setup.py
@@ -0,0 +1,42 @@
import setuptools

with open("README.md", "r") as fh:
long_description_from_file = fh.read()

setuptools.setup(
name="sweetviz",
version="1.0alpha3",
author="Francois Bertrand",
author_email="fb@fbdesignpro.com",
description="A pandas-based library to visualize and compare datasets.",
long_description=long_description_from_file,
long_description_content_type="text/markdown",
url="https://github.com/fbdesignpro/sweetviz",
packages=setuptools.find_packages(),
license="MIT",
include_package_data=True,
classifiers=[
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Development Status :: 3 - Alpha",
"Topic :: Scientific/Engineering :: Visualization",
"Topic :: Software Development :: Libraries :: Python Modules",
],
keywords="pandas data-science data-analysis python eda",
python_requires='>=3.6',
install_requires=[
'pandas>=0.25.3,!=1.0.0,!=1.0.1,!=1.0.2',
'numpy>=1.16.0',
'matplotlib>=3.1.3',
'tqdm>=4.43.0',
'scipy>=1.3.2',
'jinja2>=2.11.1',
'importlib_resources>=1.2.0'
]
)
12 changes: 12 additions & 0 deletions sweetviz/__init__.py
@@ -0,0 +1,12 @@
# sweetviz public interface
# -----------------------------------------------------------------------------------
# These are the main API functions
from sweetviz.sv_public import analyze, compare, compare_intra
from sweetviz.feature_config import FeatureConfig

# This is the main report class; holds the report data
# and is used to output the final report
from sweetviz.dataframe_report import DataframeReport

# This is the config_parser, use to customize settings
from sweetviz.config import config as config_parser
18 changes: 18 additions & 0 deletions sweetviz/config.py
@@ -0,0 +1,18 @@
import configparser
import os


try:
import importlib.resources as pkg_resources
except ImportError:
# Try backported to PY<37 `importlib_resources`.
import importlib_resources as pkg_resources


config = configparser.ConfigParser()
# print("Config: " + os.path.abspath('sweetviz_defaults.ini'))
the_open = pkg_resources.open_text("sweetviz", 'sweetviz_defaults.ini')
config.read_file(the_open)
the_open.close()
# config.read_file(open('sweetviz_defaults.ini'))

0 comments on commit e3bd4b7

Please sign in to comment.