# Example usage

Here we will demonstrate how to use the `datpro` package to summarize data, detect anomalies, and create visualizations for a dataset.

## Imports

In [1]:
import datpro as dp
import pandas as pd
import numpy as np
import altair as alt
from itertools import combinations

## Load example dataset
We'll use a sample dataset to demonstrate the functionalities of the `datpro` package. The dataset contains demographic and transactional data, with the goal of predicting income based on other features such as age, gender, spending_score, and region.


In [2]:
df =  pd.read_csv('../data/example_data.csv')
df

Unnamed: 0,Age,Income,Spending_Score,Gender,Region
0,66,,26.373678,Male,South
1,65,66369.651809,20.906870,Female,South
2,59,70764.092278,47.990597,Male,West
3,64,41432.315153,31.120625,Female,North
4,53,52963.994070,12.016596,Female,East
...,...,...,...,...,...
1005,18,62455.037248,22.795113,Female,North
1006,24,35361.901205,18.846863,Male,South
1007,51,56554.072546,17.076530,Female,South
1008,63,52799.136847,42.219961,Male,East


In this dataset:

- `Age` is the age of the individual.

- `Income` is the annual income (our target variable for prediction).

- `Spending_Score` quantifies spending behavior.

- `Gender` specifies the gender of the individual.

- `Region` indicates the geographical region.

If you'd like to follow along with the same dataset, you can download our example CSV file [here](https://github.com/UBC-MDS/dataprofiler_group-30/blob/main/data/example_data.csv).

## Summarize data

To summarize numeric columns in our data set by calculating their the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values.

In [3]:
dp.summarize_data(df)

Unnamed: 0,min,25%,50%,75%,max
Age,18.0,31.0,44.0,56.0,69.0
Income,6556.169327,40915.394217,51146.204619,60893.485307,443001.985244
Spending_Score,0.536808,16.880278,26.670824,38.786205,75.010095


## Detect Anomalies
To detect missing values, outliers, and duplicates, use the `detect_anomalies()` function. This function allows you to analyze a dataset and identify potential issues that may impact data quality and analysis results. By specifying a particular anomaly type, you can focus on specific data integrity concerns.

#### Detect all anomalies

In [4]:
dp.detect_anomalies(df)

{'missing_values': {'Income': {'missing_count': 50,
   'missing_percentage': np.float64(4.95)},
  'Spending_Score': {'missing_count': 30,
   'missing_percentage': np.float64(2.97)}},
 'outliers': {'Income': {'outlier_count': 24, 'outlier_percentage': 2.38},
  'Spending_Score': {'outlier_count': 4, 'outlier_percentage': 0.4}},
 'duplicates': {'duplicate_count': np.int64(10),
  'duplicate_percentage': np.float64(0.99)}}

#### Detect Specific Anomalies
You can specify an anomaly type to check only for particular data issues:

- **Missing Values:** `dp.detect_anomalies(df, anomaly_type='missing_values')`
- **Outliers:** `dp.detect_anomalies(df, anomaly_type='outliers')`
- **Duplicates:** `dp.detect_anomalies(df, anomaly_type='duplicates')`

For example, if you only want to check for missing values:

In [5]:
# Detect only missing values
dp.detect_anomalies(df, anomaly_type='missing_values')

{'missing_values': {'Income': {'missing_count': 50,
   'missing_percentage': np.float64(4.95)},
  'Spending_Score': {'missing_count': 30,
   'missing_percentage': np.float64(2.97)}}}

The results from `detect_anomalies()` complement those of `summarize_data()` by identifying specific quality issues that require attention. For instance, anomalies such as missing data can guide imputation strategies, while outliers and duplicates may impact model accuracy if not properly addressed. Addressing these issues early ensures a more robust and reliable downstream analysis and modeling process.

## Plotify

#### Default Usage

To visualize the entire dataset with all available plot types:

In [6]:
dp.plotify(df)

Visualizing numeric column: Age


Visualizing numeric column: Income


Visualizing numeric column: Spending_Score


Visualizing categorical column: Gender


Visualizing categorical column: Region


Visualizing numeric vs numeric: Age vs Income


Visualizing numeric vs numeric: Age vs Spending_Score


Visualizing numeric vs numeric: Income vs Spending_Score


Visualizing correlation heatmap


Visualizing numeric vs categorical: Age vs Gender


Visualizing numeric vs categorical: Age vs Region


Visualizing numeric vs categorical: Income vs Gender


Visualizing numeric vs categorical: Income vs Region


Visualizing numeric vs categorical: Spending_Score vs Gender


Visualizing numeric vs categorical: Spending_Score vs Region


Visualizing categorical vs categorical: Gender vs Region


This generates:

- Histograms and density plots for numeric columns like Age, Income, and Spending_Score.
- Bar charts for categorical columns like Gender and Region.
- Scatter plots for pairwise numeric columns.
- A correlation heatmap for numeric columns.
- Box plots comparing numeric columns with categorical columns.
- Stacked bar charts for pairwise categorical columns.

#### Specific Plot Types

To visualize specific plot types, specify them in the plot_types parameter. For example:

In [7]:
# Generate histograms and density plots only
dp.plotify(df, plot_types=['density', 'bar'])

# Generate scatter plots and correlation heatmap
dp.plotify(df, plot_types=['correlation'])

Visualizing numeric column: Age


Visualizing numeric column: Income


Visualizing numeric column: Spending_Score


Visualizing categorical column: Gender


Visualizing categorical column: Region


Visualizing correlation heatmap


This generates:

- Histograms for numeric columns like Age, Income, and Spending_Score.
- Bar charts for categorical columns like Gender and Region.
- A correlation heatmap for numeric columns.

`plotify()` automatically handles missing values by ignoring them in the visualizations. For instance, density plots and histograms will exclude NaN values. Outliers are included in the visualizations, offering insights into their impact on the data distribution.

## Conclusion
The `datpro` package provides a modular and efficient way to explore and profile your dataset. While we demonstrated its functionalities, additional cleaning steps such as handling missing values or outliers may be needed based on your analysis goals.

Feel free to replace the example dataset with your own data and adjust the function calls as needed.