## Pandas Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [2]:
import sys

!{sys.executable} -m pip install -U ydata-profiling[notebook]
!pip install jupyter-contrib-nbextensions
!jupyter nbextension enable --py widgetsnbextension

Collecting jupyter-contrib-nbextensions
  Downloading jupyter_contrib_nbextensions-0.7.0.tar.gz (23.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m[31m7.1 MB/s[0m eta [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting ipython_genutils (from jupyter-contrib-nbextensions)
  Downloading ipython_genutils-0.2.0-py2.py3-none-any.whl.metadata (755 bytes)
Collecting jupyter_contrib_core>=0.3.3 (from jupyter-contrib-nbextensions)
  Downloading jupyter_contrib_core-0.4.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting jupyter_highlight_selected_word>=0.1.1 (from jupyter-contrib-nbextensions)
  Downloading jupyter_highlight_selected_word-0.2.0-py2.py3-none-any.whl.metadata (730 bytes)
Collecting jupyter_nbextensions_configurator>=0.4.0 (from jupyter-contrib-nbextensions)
  Downloading jupyter_nbextensions_configurator-0.6.4-py2.py3-none-any.whl.m

You might want to restart the kernel now.

### Import libraries

In [3]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

### Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [4]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

df = pd.read_csv(file_name)

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = pd.concat([df, duplicates_to_add], ignore_index=True)

### Inline report without saving object

In [5]:
report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)
report



### Save report to file

In [6]:
profile_report = df.profile_report(html={"style": {"full_width": True}})
profile_report.to_file("/tmp/example.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  "min": pd.Timestamp.to_pydatetime(series.min()),
  "max": pd.Timestamp.to_pydatetime(series.max()),


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### More analysis (Unicode) and Print existing ProfileReport object inline

In [7]:
profile_report = df.profile_report(
    explorative=True, html={"style": {"full_width": True}}
)
profile_report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Notebook Widgets

In [8]:
profile_report.to_widgets()

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Unnamed: 0,boolean,fall,id,mass (g),mixed,nametype,reclat,reclat_city,reclong
boolean,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008119
fall,0.0,1.0,0.126388,0.012114,0.0,0.0,0.449923,0.4004,0.194729
id,0.0,0.126388,1.0,-0.141548,0.0,0.129949,0.26121,0.218883,-0.316046
mass (g),0.0,0.012114,-0.141548,1.0,0.0,0.0,0.408746,0.421334,-0.281372
mixed,0.0,0.0,0.0,0.0,1.0,0.0,0.004395,0.004599,0.010152
nametype,0.0,0.0,0.129949,0.0,0.0,1.0,0.34934,0.335825,0.043965
reclat,0.0,0.449923,0.26121,0.408746,0.004395,0.34934,1.0,0.94289,-0.650308
reclat_city,0.0,0.4004,0.218883,0.421334,0.004599,0.335825,0.94289,1.0,-0.616445
reclong,0.008119,0.194729,-0.316046,-0.281372,0.010152,0.043965,-0.650308,-0.616445,1.0


Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,source,boolean,mixed,reclat_city
0,Aachen,1,Valid,L5,21.0,Fell,1970-01-01 00:00:00.000001880,50.775,6.08333,"(50.775, 6.08333)",NASA,False,1,42.183688
1,Aarhus,2,Valid,H6,720.0,Fell,1970-01-01 00:00:00.000001951,56.18333,10.23333,"(56.18333, 10.23333)",NASA,False,A,52.796155
2,Abee,6,Valid,EH4,107000.0,Fell,1970-01-01 00:00:00.000001952,54.21667,-113.0,"(54.21667, -113.0)",NASA,False,1,45.815963
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1970-01-01 00:00:00.000001976,16.88333,-99.9,"(16.88333, -99.9)",NASA,True,A,6.787124
4,Achiras,370,Valid,L6,780.0,Fell,1970-01-01 00:00:00.000001902,-33.16667,-64.95,"(-33.16667, -64.95)",NASA,True,1,-38.757018
5,Adhi Kot,379,Valid,EH4,4239.0,Fell,1970-01-01 00:00:00.000001919,32.1,71.8,"(32.1, 71.8)",NASA,True,A,32.429366
6,Adzhi-Bogdo (stone),390,Valid,LL3-6,910.0,Fell,1970-01-01 00:00:00.000001949,44.83333,95.16667,"(44.83333, 95.16667)",NASA,False,A,55.180639
7,Agen,392,Valid,H5,30000.0,Fell,1970-01-01 00:00:00.000001814,44.21667,0.61667,"(44.21667, 0.61667)",NASA,True,A,41.814493
8,Aguada,398,Valid,L6,1620.0,Fell,1970-01-01 00:00:00.000001930,-31.6,-65.23333,"(-31.6, -65.23333)",NASA,False,A,-33.94882
9,Aguila Blanca,417,Valid,L,1440.0,Fell,1970-01-01 00:00:00.000001920,-30.86667,-64.55,"(-30.86667, -64.55)",NASA,False,1,-26.273491


Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,source,boolean,mixed,reclat_city
45716,Aachen copy,1,Valid,L5,21.0,Fell,1970-01-01 00:00:00.000001880,50.775,6.08333,"(50.775, 6.08333)",NASA,False,1,42.183688
45717,Aarhus copy,2,Valid,H6,720.0,Fell,1970-01-01 00:00:00.000001951,56.18333,10.23333,"(56.18333, 10.23333)",NASA,False,A,52.796155
45718,Abee copy,6,Valid,EH4,107000.0,Fell,1970-01-01 00:00:00.000001952,54.21667,-113.0,"(54.21667, -113.0)",NASA,False,1,45.815963
45719,Acapulco copy,10,Valid,Acapulcoite,1914.0,Fell,1970-01-01 00:00:00.000001976,16.88333,-99.9,"(16.88333, -99.9)",NASA,True,A,6.787124
45720,Achiras copy,370,Valid,L6,780.0,Fell,1970-01-01 00:00:00.000001902,-33.16667,-64.95,"(-33.16667, -64.95)",NASA,True,1,-38.757018
45721,Adhi Kot copy,379,Valid,EH4,4239.0,Fell,1970-01-01 00:00:00.000001919,32.1,71.8,"(32.1, 71.8)",NASA,True,A,32.429366
45722,Adzhi-Bogdo (stone) copy,390,Valid,LL3-6,910.0,Fell,1970-01-01 00:00:00.000001949,44.83333,95.16667,"(44.83333, 95.16667)",NASA,False,A,55.180639
45723,Agen copy,392,Valid,H5,30000.0,Fell,1970-01-01 00:00:00.000001814,44.21667,0.61667,"(44.21667, 0.61667)",NASA,True,A,41.814493
45724,Aguada copy,398,Valid,L6,1620.0,Fell,1970-01-01 00:00:00.000001930,-31.6,-65.23333,"(-31.6, -65.23333)",NASA,False,A,-33.94882
45725,Aguila Blanca copy,417,Valid,L,1440.0,Fell,1970-01-01 00:00:00.000001920,-30.86667,-64.55,"(-30.86667, -64.55)",NASA,False,1,-26.273491


VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…