## YData Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/docs/legacy/meteorite_landings/Meteorite_Landings.csv

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [None]:
import sys

!{sys.executable} -m pip install -U ydata-profiling[notebook]
!pip install jupyter-contrib-nbextensions
!jupyter nbextension enable --py widgetsnbextension

Collecting ydata-profiling[notebook]
  Downloading ydata_profiling-4.17.0-py2.py3-none-any.whl.metadata (22 kB)
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata-profiling[notebook])
  Downloading visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting minify-html>=0.15.0 (from ydata-profiling[notebook])
  Downloading minify_html-0.16.4-cp312-cp312-win_amd64.whl.metadata (18 kB)
Collecting filetype>=1.0.0 (from ydata-profiling[notebook])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting phik<0.13,>=0.11.1 (from ydata-profiling[notebook])
  Downloading phik-0.12.5-cp312-cp312-win_amd64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata-profiling[notebook])
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting typeguard<5,>=3 (from ydata-profiling[notebook])
  Downloading typeguard-4.4.4-py3-none-any.whl.metadata (3.3 kB)
Collecting imagehash==4.3.1 (from ydata-profiling[notebook])
  D

You might want to restart the kernel now.

### Import libraries

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

### Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [3]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/docs/legacy/meteorite_landings/Meteorite_Landings.csv",
)

df = pd.read_csv(file_name)

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = pd.concat([df, duplicates_to_add], ignore_index=True)

ConnectTimeout: HTTPSConnectionPool(host='data.nasa.gov', port=443): Max retries exceeded with url: /docs/legacy/meteorite_landings/Meteorite_Landings.csv (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002101D8C6750>, 'Connection to data.nasa.gov timed out. (connect timeout=None)'))

### Inline report without saving object

In [None]:
report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)
report

### Save report to file

In [None]:
profile_report = df.profile_report(html={"style": {"full_width": True}})
profile_report.to_file("/tmp/example.html")

### More analysis (Unicode) and Print existing ProfileReport object inline

In [None]:
profile_report = df.profile_report(
    explorative=True, html={"style": {"full_width": True}}
)
profile_report

### Notebook Widgets

In [None]:
profile_report.to_widgets()