In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/diabetes-data-set/diabetes.csv


<a id="table"></a>
<h1 style="color:white; background-color:#b45f06;font-family:charter;font-size:350%;text-align:center;border-radius: 25px 50px;">Exploratory Data Analysis (EDA) Using Pandas Profiling</h1>

<p style="color:#b45f06;">Explore pandas profiling and see the benefit of this magical single line of code.</p>
<p style="color:#b45f06;">Pandas profiling is a Python library that performs an automated Exploratory Data Analysis. It automatically generates a dataset profile report that gives valuable insights.</p>

![image.png](attachment:f6a35e40-7254-45e2-92ac-3f05aad79b7a.png)

<p style="border-width:1px;border-style:solid;border-color:#da8b38;">pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding.</p>

<p style="border:1px solid #da8b38;">For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:</p>

* Type inference: detect the types of columns in a DataFrame
* Essentials: type, unique values, indication of missing values
* Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
* Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* Most frequent and extreme values
* Histograms: categorical and numerical
* Correlations: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
* Missing values: through counts, matrix and heatmap
* Duplicate rows: list of the most common duplicated rows
* Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
* File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata.

***Let’s get into practice.***

# 1. Installation 🛠️

    1.1.Using pip

You can install using the pip package manager by running:

In [2]:
pip install -U pandas-profiling

Collecting pandas-profiling
  Downloading pandas_profiling-3.5.0-py2.py3-none-any.whl (325 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.0/325.0 kB[0m [31m563.4 kB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]==0.7.5
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: visions, pandas-profiling
  Attempting uninstall: visions
    Found existing installation: visions 0.7.4
    Uninstalling visions-0.7.4:
      Successfully uninstalled visions-0.7.4
  Attempting uninstall: pandas-profiling
    Found existing installation: pandas-profiling 3.1.0
    Uninstalling pandas-profiling-3.1.0:
      Successfully uninstalled pandas-profiling-3.1.0
Successfully installed pandas-profiling-3.5.0 visions-0.7.5
[0mNote: you may need to restart the kernel to use updated packages.


    1.2. Using conda
    
You can install using the conda package manager by running:

In [3]:
conda install -c conda-forge pandas-profiling

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

# 2. Import Libraries 📚

In [4]:
import numpy as np
import pandas as pd

***Let’s import the pandas profiling library:***

In [5]:
from pandas_profiling import ProfileReport

In [6]:
data = pd.read_csv("/kaggle/input/diabetes-data-set/diabetes.csv")

In [7]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# 3. Generate report

***To generate the standard profiling report, merely run:***

In [8]:
profile = ProfileReport(data, title="Pandas Profiling Report")

In [9]:
data.profile_report()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



# 4. Exporting the report to a file

***To generate a HTML report file, save the ProfileReport to an object and use the to_file() function:***

In [10]:
profile.to_file("your_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

*Alternatively, the report’s data can be obtained as a JSON file:*

In [11]:
# As a JSON string
json_data = profile.to_json()

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

In [12]:
# As a file
profile.to_file("your_report.json")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

The main disadvantage of pandas profiling is its use with large datasets. With the increase in the size of the data, the time to generate the report also increases a lot.

One way to solve this problem is to generate the report from only a part of all the data we have.

In [13]:
profile = ProfileReport(data.sample(n=100))
profile.to_file(output_file='output.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

* **Overview:** Mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
* **Variables:** This section provides detail analysis on Variables/Columns/Features of the Dataset that totally depend on type of Variables/Columns/Features like Numeric, String , Boolean etc.
* **Interaction** sections gives more details with bivariate analysis/multivariate analysis.
* **Correlations:** It’s a common tool for describing simple relationships without making a statement about cause and effect. In the pandas profiling report have 5 types of correlation coefficients: Pearson’s r, Spearman’s ρ, Kendall’s τ, Phik (φk), and Cramer’s V (φc).
* **Missing values:** This report also gives detail analysis of Missing values in the four types of graph Count, matrix
* **Sample:** This section displays the first and last 10 rows of the dataset.

# Conclusion

Pandas Profiling is an awesome python package for exploratory analysis (EDA). It extends pandas for statistical analysis summaries including correlations, missing values, distribution (quantile), and descriptive statistics.

I definitely recommend you to try it!

# References

1. https://pandas-profiling.ydata.ai/docs/master/index.html
2. https://www.numpyninja.com/post/eda-using-pandas-profiling