<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<center><h1><font color="red" size="+3">Data Profiling with Pandas</font></h1></center>

# <font color="red">Objectives</font>
In this presentation, we will cover the following topics:

- Define the concept of data profiling
- Explain what data profiling involves
- Introduce a couple of data profling tools
- Show how data profiling principles are used with real datasets.

# <font color="red">Useful References</font>

- [What is data profiling?](https://www.ibm.com/think/topics/data-profiling) from IBM
- R. A. Ruddle, J. Cheshire and S. J. Fernstad, [Tasks and Visualizations Used for Data Profiling: A Survey and Interview Study](https://ieeexplore.ieee.org/document/10008084), in IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 3400-3412, July 2024, doi: 10.1109/TVCG.2023.3234337.

# <font color="red">What is Data Profiling?</font>

>Data profiling is the practice of looking closely at a dataset to understand its overall structure and quality. This means reviewing things, like the types of data it contains, how the data is distributed, whether any information is missing, and if everything is consistent.

- It involves reviewing, analyzing, and cleansing data to understand its structure, characteristics, integrity, and quality. 
- It helps you gain insights into your data, identify potential issues, and make informed decisions about how to use it.
- The main purpose of data profiling is to make sure the data is correct, well-organized, and ready to be used for analysis or important decisions.
   - It helps choosing the right algorithms by gaining an initial high-level understanding of the dataset.
- It is a crucial preliminary step before using the data in a model.

## <font color="blue">Purpose of data profiling</font>

The purpose of data profiling is to:

- Identify and correct issues with the quality of the data
- Explore and understand how the data was generated (i.e., from which source)
- Identify missing values in the data
- Identify duplicate records in the dataset
- Identify how frequently each attribute occurs
- Identify how unique the values on each attribute
- Identify outliers in the dataset

Data profiling serves as a data hygiene process resulting in a collection of information about data, also known as metadata and its overall health. This could include:

- Data types: Are the values in a column numbers, text, dates, etc.?
- Value ranges: What are the minimum and maximum values a field can hold?
- Missing values: How many data points are missing in a specific column?
- Data distributions: How are the values distributed across a column?
- Data relationships: Are there any connections between different data points or columns?
- Data quality issues: Are there any inconsistencies, anomalies, duplicates, or errors present?

## <font color="blue">Common techniques for data profiling</font>

Data profiling relies on a set of activities, including discovery and analytical techniques to collect statistics or informative summaries about the data. 

The most common methods used are:

- __Column profiling__: Analyze individual columns to understand their data type, distribution, unique values, and missing values.
- __Data type profiling__: Identify the data types of each column and checking for inconsistencies.
- __Pattern profiling__: Identifyi patterns and anomalies in the data.
- __Relationship profiling__: Examine the relationships between different columns and tables.

## <font color="blue">Information generated by data profiling</font>

The analysis performed by data profiling tools exposes:

- data rule (for instance, identifies dependencies that represent business rules embedded within the data).
- anomalies that exist within the data sets.

Data profiling tools can generate information about: 

- Data patterns
- Numeric statistics
- Data domains
- Dependencies
- Relationships, and
- Anomalies.

At the column level, reports focus on statistical measures and metadata to provide insights into the distribution and quality of the data. These reports include information on:

- Minimum and maximum values: indicate the range of the data. Limiting a value set range also helps to determine outliers. 
- Mean and mode: reveal the average and most common values.
- Percentiles: show the data distribution across intervals.
- Standard deviation: indicates data variability.
- Frequency: highlights the repetition of specific values.
- Variation: shows data diversity and the aggregated sum of values within a column.

## <font color="blue"> Benefits of data profiling</font>

- Identify inaccuracies and inconsistencies, resulting in cleaner datasets and reduced errors.
- Improve data credibility and quality.
- Minimize the risk of data errors or inaccurate results.
- Make better sense of the relationships between different data sets and sources
- Improve users' understanding of data.
- Can help quickly identify and address problems, often before they arise.

__The result of data profiling is a constructive process of information inference to prepare a data set for later integration, analysis and modeling__.

# <font color="red">Some data profiling tools</font>

We will rely on two profiling tools in this presentation.

## <font color="blue">[Data Profiler](https://github.com/capitalone/DataProfiler)</font>

- Created by the financial company Capital One.
- Designed to make data analysis, monitoring and sensitive data detection easy.
- It is mainly meant to analyze datasets and detect if any of the information contained within is sensitive data, such as bank account numbers, credit card information, or social security numbers.
- It can be used to generate basic statistical reports on science related datasets too.

## <font color="blue">[ydata-profiling](https://docs.profiling.ydata.ai/)</font>

- Provides a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. 
- Automates and standardizes the generation of detailed reports, complete with statistics and visualizations.
- Delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as __html__ and __json__.

# <font color="red">Packages Used</font>

We will use the followin packages:

- `Matplolib`: for visualization
- `Seaborn`: for visualization 
- `Plotly`: for interactive visualization
- `NumPy`: for array creation.
- `Pandas`: for creating and manipulating Series and DataFrames, and for visualization.
- `Skimpy`: for descriptive statistics
- `DataProfiler`: data profiling tool mainly for financial data
- `ydata-profiling`: data profiling tool
- `great_table`: for dispalying tables.

In addition, we will use the `datetime` module to manipulate dates and times.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import json

In [None]:
#import skimpy

In [None]:
from ydata_profiling import ProfileReport

In [None]:
import dataprofiler

In [None]:
from great_tables import GT

In [None]:
import datetime
import numpy as np
import pandas as pd

In [None]:
print(f'Using Numpy version:  {np.__version__}')
print(f'Using Pandas version: {pd.__version__}')

#### Notebook settings

Only 5 rows of data will be displayed:

In [None]:
pd.set_option('display.max_rows', 4)

Print floating point numbers using fixed point notation:

In [None]:
np.set_printoptions(suppress=True)

#### Graphics

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

In [None]:
import plotly.express as px

- There are five preset Seaborn themes: `darkgrid` (default), `whitegrid`, `dark`, `white`, and `ticks`.
- They are each suited to different applications and personal preferences. 

In [None]:
sns.set_style("whitegrid")

- The four preset contexts, in order of relative size, are `paper`, `notebook` (default), `talk`, and `poster`.

In [None]:
sns.set_context("paper")

Remove spine:

In [None]:
sns.despine();

# <font color="red">Perform data profiling in real applications


## <font color="blue">Arctic Oscillation and North Atlantic Oscillation  Datasets</font>

__[Arctic oscillation](https://en.wikipedia.org/wiki/Arctic_oscillation)__

- The Arctic oscillation (AO) or Northern Annular Mode/Northern Hemisphere Annular Mode (NAM) is a weather phenomenon at the Arctic poles north of 20 degrees latitude.
- It describes how pressure patterns are distributed over the Arctic region and the middle latitudes of the Northern Hemisphere.
- It is an important mode of climate variability for the Northern Hemisphere.
- A negative AO index causes the jet stream to weaken and dip into the mid-latitudes, allowing Arctic air to spill out and creating cold-air outbreaks.
- A positive AO index causes the jet stream to move farther north than normal.


__[North Atlantic Oscillation](https://en.wikipedia.org/wiki/North_Atlantic_oscillation)__

- The North Atlantic Oscillation (NAO) is a weather phenomenon in the North Atlantic Ocean of fluctuations in the difference of atmospheric pressure at sea level (SLP) between the Icelandic Low and the Azores High.
- The NAO determines the speed and direction of the westerly winds across the North Atlantic, as well as winter sea surface temperature.
   - A negative NAO leads to a weaker westerly flow into Western Europe; the more negative, the more this is true.
   - When the NAO index is positive, enhanced westerly flow across the North Atlantic during winter moves relatively warm (and moist) maritime air over much of Europe and far downstream across Asia, while stronger northerlies over Greenland and northeastern Canada carry cold air southward and decrease land temperatures and SST over the northwest Atlantic.

__Read the North Atlantic Oscillation (NAO) data__

In [None]:
nao_url = "http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii"

In [None]:
nao_df = pd.read_table(nao_url, sep='\s+', 
                       parse_dates={'dates':[0, 1]}, 
                       header=None)
nao_df.columns = ["dates", "NAO"]
nao_df

__Read the Atlantic Oscillation (AO) data__

In [None]:
ao_url = "http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii"

In [None]:
ao_df = pd.read_table(ao_url, sep='\s+', 
                      parse_dates={'dates':[0, 1]}, 
                      header=None)
ao_df.columns = ["dates", "AO"]
ao_df

__Create a Pandas DataFrame by combining the two Pandas objects__

The `ao_df` and `nao_df` have the same number of rows and they have in common the column `dates`. We can use the `merge()` function to combine them.

In [None]:
aonao_df = ao_df.merge(nao_df, how='outer')
aonao_df

In [None]:
#aonao_df = pd.concat([ao_df, nao_df["NAO"]], axis=1)
#aonao_df

__We can use `great_tables` to better display the content of the DataFrame__

In [None]:
GT(aonao_df).tab_header(
    title="Combined AO and NAO datasets"
)

### <font color="green">Obtain basic information on the columns</font>

In [None]:
aonao_df.info()

__Quick observations__

- There appears to be no missing values.
- There is a consistency in the data type of each column.
- We are dealing with time series data.

### <font color="green">Descriptive statistics</font>

In [None]:
aonao_df.describe().T

In [None]:
#skimpy.skim(aonao_df)

### <font color="green">Basic plots</font>

__Time series plots ofAO and NAO__

In [None]:
fig, ax = plt.subplots(figsize=(9,4))
sns.lineplot(data=aonao_df, x="dates", y="AO", ax=ax)
sns.lineplot(data=aonao_df, x="dates", y="NAO", ax=ax)

In [None]:
fig = px.line(aonao_df, x="dates", y=["AO", "NAO"],
              hover_data={"dates": "|%b %Y"},
              title='AO and NAO')
fig.show()

__Time series plot NAO only__

In [None]:
fig, ax = plt.subplots(figsize=(9,4))
sns.lineplot(data=aonao_df[aonao_df["NAO"]>=0], x="dates", y="NAO", ax=ax)
sns.lineplot(data=aonao_df[aonao_df["NAO"]<0], x="dates", y="NAO", ax=ax)

In [None]:
fig = px.line(aonao_df, x='dates', y="NAO", 
              color=aonao_df["NAO"]>=0, 
              color_discrete_map={True: "red", False: "blue"},
              range_y=[-4, 4])
# Set white background
fig.update_layout(
    plot_bgcolor="white",
    showlegend=False
)

# Change grid color and axis colors
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', gridcolor='lightgrey')
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', gridcolor='lightgrey')

fig.show()

__Time series plot AO only__

In [None]:
fig, ax = plt.subplots(figsize=(9,4))
sns.lineplot(data=aonao_df[aonao_df["AO"]>=0], x="dates", y="AO", ax=ax)
sns.lineplot(data=aonao_df[aonao_df["AO"]<0], x="dates", y="AO", ax=ax)

In [None]:
fig = px.line(aonao_df, x='dates', y="AO", 
              color=aonao_df["AO"]>=0, 
              color_discrete_map={True: "red", False: "blue"},
              range_y=[-4, 4])
# Set white background
fig.update_layout(
    plot_bgcolor="white",
    showlegend=False
)

# Change grid color and axis colors
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', gridcolor='lightgrey')
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', gridcolor='lightgrey')

fig.show()

### <font color="green">Histograms</font>

In [None]:
sns.histplot(data=aonao_df, x='NAO')

In [None]:
sns.histplot(data=aonao_df, x='AO')

In [None]:
fig = px.histogram(aonao_df, x="NAO", color=aonao_df["NAO"]>0)
fig.update_layout(
    plot_bgcolor="white",
    showlegend=False
)
fig.show()

In [None]:
fig = px.histogram(aonao_df, x="AO", color=aonao_df["AO"]>0)
fig.update_layout(
    plot_bgcolor="white",
    showlegend=False
)
fig.show()

### <font color="green">Finding outliers</font>

We use a __boxplot__:

- Pictorial representation of distribution of data which shows extreme values, median and quartiles.
- Shows robust measures of location and spread as well as providing information about symmetry and outliers.
   - The range of the data provides us with a measure of spread and is equal to a value between the smallest data point (min) and the largest one (Max).
   - The interquartile range (`IQR`), which is the range covered by the middle 50% of the data.
   - `IQR=Q3-Q1`, the difference between the third and first quartiles.
      - The first quartile (`Q1`) is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data.
      - The third quartile (`Q3`) is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data.
   - The `IQR` can be used to detect outliers using the 1.5(IQR) criteria. Outliers are observations that fall below `Q1-1.5(IQR)` or above `Q3+1.5(IQR)`.

In [None]:
def create_boxplots(mydf):
    """
    Create a boxplot for each column of the DataFrame.
    """
    # Get the column names
    column_names = mydf.columns
    fig, axes = plt.subplots(ncols=len(column_names), figsize=(14,5))
    
    # Create the boxplots with Seaborn
    for name, axis in zip(column_names, axes):
        sns.boxplot(data=mydf[name], ax=axis) 
        axis.set_xlabel(name, rotation=45)
        axis.set(xticklabels=[], xticks=[], ylabel='')

    # Show the plot
    plt.tight_layout()

In [None]:
create_boxplots(aonao_df)

In [None]:
aonao_df

__Dealing with outliers using IQR__

In [None]:
Q1 = aonao_df.quantile(0.25)
Q3 = aonao_df.quantile(0.75)
IQR = Q3 - Q1

In [None]:
df_outlier_IQR = aonao_df[~((aonao_df < (Q1-1.5*IQR)) | (aonao_df > (Q3+1.5*IQR))).any(axis=1)]
df_outlier_IQR.shape

In [None]:
create_boxplots(df_outlier_IQR)

### <font color="green">violinplot</font>

- Draw a combination of boxplot and kernel density estimate.
- It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.
- This can be an effective and attractive way to show multiple distributions of data at once.

In [None]:
fig,  ax = plt.subplots(1, 2, figsize=(8,6))

for i, name in enumerate(['AO', 'NAO']):
    sns.violinplot(aonao_df[name], ax=ax[i]);

In [None]:
fig,  ax = plt.subplots(1, 2, figsize=(8,6))

for i, name in enumerate(['AO', 'NAO']):
    sns.violinplot(df_outlier_IQR[name], ax=ax[i]);

### <font color="green">Heatmap</font>

- Represent the individual values that are contained in a matrix as colors.
- Create a correlation matrix that measures the linear relationships between the variables.
- The pairs which are highly correlated represent the same variance of the dataset thus we can further analyze them to understand which attribute among the pairs are most significant for building the model.
- A number on the map indicates a strong inverse relationship, no relationship, and a strong direct relationship, respectively.

In [None]:
correlation_matrix = aonao_df[['AO', 'NAO']].corr()  

In [None]:
sns.heatmap(correlation_matrix);

#### Heatmap by year/month

In [None]:
aonao_df["Year"] = aonao_df.dates.apply(lambda x: x.year)
aonao_df["Month"] = aonao_df.dates.apply(lambda x: x.strftime("%b"))
aonao_df

In [None]:
nao_pt = aonao_df.pivot_table(index="Year", columns="Month", values="NAO")
fig, ax = plt.subplots(figsize=(9, 8))
sns.heatmap(nao_pt, 
            annot=True,
            fmt=".2f",
            annot_kws={"size": 6},
            linewidths=.5,
            ax=ax);

In [None]:
ao_pt = aonao_df.pivot_table(index="Year", columns="Month", values="AO")
fig, ax = plt.subplots(figsize=(9, 8))
sns.heatmap(ao_pt, 
            annot=True,
            fmt=".2f",
            annot_kws={"size": 6},
            linewidths=.5,
            ax=ax);

### <font color="green">Using `Data Profiler`</font>

In [None]:
aonao_profile1 = dataprofiler.Profiler(aonao_df)

In [None]:
# print the report using json to prettify.
aonao_report = aonao_profile1.report(report_options={"output_format":"pretty"})
print(json.dumps(aonao_report, indent=4))

In [None]:
# read a specified column, in this case it is labeled 0:
print(json.dumps(aonao_report["data_stats"][0], indent=4))

In [None]:
# read a specified column, in this case it is labeled 1:
print(json.dumps(aonao_report["data_stats"][1], indent=4))

### <font color="green"> Using `ydata-profiling`</font>

In [None]:
aonao_profile2 = ProfileReport(aonao_df, title="Profiling Report")

In [None]:
aonao_profile2.to_notebook_iframe()

## <font color="blue">AERONET Observations at Goddard</font>

- [AERONET](https://aeronet.gsfc.nasa.gov/) (AErosol RObotic NETwork) is a globally distributed network of identical robotically controlled ground-based sun/sky scanning radiometers. 
- Each instrument measures the intensity of sun and sky light throughout daylight hours from the ultraviolet through the near-infrared. 
- The program provides a longterm, continuous, and accessible public domain database of aerosol optical, microphysical, and radiative properties for aerosol research including, aerosol characterization, validation of satellite retrievals and model predictions, and synergism with other databases.
- Here are some Science benefits of AERONET:
     - AERONET measurements are used to validate and advance algorithm development of satellite retrievals of aerosols.
     - Aerosol transport models use aerosol data from AERONET to validate and improve model algorithms.
     - Aerosol assimilation models as well as weather prediction models use real time AERONET data to improve predictions.
     - Long-term commitment to AERONET sites worldwide provides assessment of the regional climatological impact of aerosols (e.g., aerosol amount, size, and heating or cooling effects).
- Over 840 stations worldwide.
- Here, we analyze the measurements (Aerosol Optical Depth (AOD)) at the [NASA GSFC](https://aeronet.gsfc.nasa.gov/new_web/photo_db_v3/GSFC.html) site.

In [None]:
url = "https://portal.nccs.nasa.gov/datashare/astg/training/python/pandas/aeronet/"

filename = url+"19930101_20210102_GSFC.lev20"

In [None]:
dateparse = lambda x: datetime.datetime.strptime(x, '%d:%m:%Y %H:%M:%S')
aeronet_df = pd.read_csv(filename, skiprows=6, na_values=-999,
                 parse_dates={'datetime': [0, 1]}, 
                 #date_parser=dateparse, # date_parser=dateparse
                 index_col=0) 
                 #squeeze=True)

In [None]:
aeronet_df

__Basic information on each column__

In [None]:
aeronet_df.info()

#### Number of unique values in each column

In [None]:
unique_counts = aeronet_df.nunique()
print(unique_counts.to_string())

__Quick observations__

- There are 6835 data points.
- Many columns only have `NaN` and need to be deleted.
- Many columns only have one value and need to be removed.

#### Initial data cleaning

__Identify and delete columns with all `NaN` values__

In [None]:
var = aeronet_df.isnull().sum()
print(var.to_string())

In [None]:
nan_columns = aeronet_df.columns[aeronet_df.isna().all()].tolist()
nan_columns

In [None]:
aeronet_df.dropna(axis=1, how='all', inplace=True)

In [None]:
aeronet_df

__Identify and delete columns with only one unique value__

In [None]:
unique_value_columns = [col for col in aeronet_df.columns if aeronet_df[col].nunique() == 1]

```python
unique_value_columns = list()
for col in aeronet_df.columns:
    if aeronet_df[col].nunique() == 1:
        unique_value_columns.append(col)
```

In [None]:
unique_value_columns

In [None]:
aeronet_df.drop(unique_value_columns, axis=1, inplace=True)

In [None]:
aeronet_df

__We can use `great_tables` to better display the content of the DataFrame__

In [None]:
GT(aeronet_df)

In [None]:
aeronet_df.info()

__Quick observations__

- We initially started with 79 columns but we now have 30.
- We still have missing values in some of the columns.
- More than half of the values of the column `AOD_1640nm` are missing values. We can remove the column.

In [None]:
aeronet_df.drop(['AOD_1640nm'], axis=1, inplace=True)

#### Descriptive statistics

In [None]:
aeronet_df.describe().T

In [None]:
#skimpy.skim(aeronet_df)

#### Boxplot

In [None]:
create_boxplots(aeronet_df)

####  Scatterplot matrix

In [None]:
sns.pairplot(aeronet_df)

#### Create the heatmap

In [None]:
aeronet_corr = aeronet_df.corr()

In [None]:
sns.heatmap(aeronet_corr);

In [None]:
mat = px.imshow(aeronet_corr, x=aeronet_df.columns, 
                 y=aeronet_df.columns, title="Correlation matrix", width=900, height=900)
mat.show()

__Quick observations__

- Based on the scatterplot matrix and the heatmap, there appears to be three groups of fields:
  1. Group 1 that has `AOD_1640nm` to `AOD_340nm`. The fields are strongly correlated to each other. They all have float as data type.
  2. Group 2 that has `N[AOD_1640nm]` to `N[340-440_Angstrom_Exponent]`. The fields are strongly correlated to each other. They all have int as data type.
  3. Group 3 with the remaining fields not including `Day_of_Year`, `Precipitable_Water(cm)`, and `AERONET_Instrument_Number`. The fields here have moderate to strong correlation with each other.
- `Precipitable_Water(cm)` has a moderate positive correlation with fields in Group 1.

#### Using `Data Profiler`

In [None]:
aeronet_profile1 = dataprofiler.Profiler(aeronet_df)

In [None]:
# print the report using json to prettify.
aeronet_report = aeronet_profile1.report(report_options={"output_format":"pretty"})
print(json.dumps(aeronet_report, indent=4))

In [None]:
# read a specified column, in this case it is labeled 1:
print(json.dumps(aeronet_report["data_stats"][1], indent=4))

#### Using `ydata-profiling`

In [None]:
aeronet_profile2 = ProfileReport(aeronet_df, title="Profiling Report")

In [None]:
aeronet_profile2.to_notebook_iframe()

---

## <font color="blue">Analyzing CME activities</font>

We read a file (in JSON format) that contains a collection of CME activities. For each recorded CME, we want to extract the following parameters:

- `start_time`
- `speed`
- `longitude`
- `latitude`
- `halfAngle`
- The list of instruments

### Get the remote CME json file

In [None]:
cme_filename = "all_cmes.json"
cme_url = f"https://raw.githubusercontent.com/barbarajthompson/TCMM_Maps/refs/heads/main/{cme_filename}"

In [None]:
import urllib.request
urllib.request.urlretrieve(cme_url, cme_filename)

### Read the file

In [None]:
with open(cme_filename, "r") as fid: 
     cme_activities = json.load(fid)

### Inspect the file content

In [None]:
type(cme_activities)

In [None]:
len(cme_activities)

__Sample record with CME activity__

- Each CME comes as a dictionary.
- There is actual activity if the key `'activeRegionNum'` has a value (different than `None`).

In [None]:
cme_activities[0]

__Sample record without CME activity__

- There was no CME activity because the value associated with the key `'activeRegionNum'` is `None`.
- In our analysis, we exclude such a record.

In [None]:
cme_activities[1]

### Write functions to read all records and create a DataFrame

In [None]:
def get_CME_parameters(cme_activity: dict):
    """
    From a dictionary containing CME activity data on a specific date/time,
    extract the following parameters:

    - starting date/time
    - latitude
    - longitude
    - halfAngle
    - speed
    """

    latitude = cme_activity['cmeAnalyses'][0]['latitude']
    longitude = cme_activity['cmeAnalyses'][0]['longitude']
    half_angle = cme_activity['cmeAnalyses'][0]['halfAngle']
    speed = cme_activity['cmeAnalyses'][0]['speed']
    start_time = cme_activity['startTime']
    list_instruments = list()
    for item in cme_activity['instruments']:
        list_instruments.append(item['displayName'])
    return start_time, speed, latitude, longitude, half_angle, tuple(sorted(list_instruments))

In [None]:
def create_df(cme_activities: list):
    """
    Using a list of CME activities, create a Pandas DataFrame
    with columns:
    
    - star_time
    - latitude
    - longitude
    - half_angle
    - speed
    - instruments
    """
    columns = ["start_time", "speed", "latitude", "longitude", "half_angle", "instruments"]
    df = pd.DataFrame(columns=columns)

    # Loop over the CME event
    for cme_activity in cme_activities:
        # Only process when a CME activity was recorded.
        if cme_activity['cmeAnalyses']:   # This mean that an activity was recorded
            df.loc[len(df)] = get_CME_parameters(cme_activity)
    return df  

### Create a DataFrame of CME activities

In [None]:
cme_df = create_df(cme_activities)

In [None]:
cme_df['start_time'] = pd.to_datetime(cme_df['start_time'])

In [None]:
cme_df

__We can use `great_tables` to better display the content of the DataFrame__

In [None]:
GT(cme_df)

In [None]:
cme_df.info()

### Descriptive statistics

In [None]:
cme_df.describe()

In [None]:
#skimpy.skim(cme_df[["speed", "latitude", "longitude", "half_angle"]])

### Pairplot

In [None]:
sns.pairplot(cme_df[["speed", "latitude", "longitude", "half_angle"]])

### Boxplot

In [None]:
new_cme_df = cme_df[["speed", "latitude", "longitude", "half_angle"]]

In [None]:
create_boxplots(new_cme_df)

### Heatmap

In [None]:
mat = px.imshow(new_cme_df.corr(), x=new_cme_df.columns, 
                y=new_cme_df.columns, 
                title="Correlation matrix", 
                width=500, height=500)
mat.show()