# Data Analysis and Visualization

## What is Data Analysis?

Data analysis involves examining datasets to understand:
- The distribution of various statistics across the dataset
- Potential data issues
- Patterns and anomalies in the data
- Correlations between different metrics

## In This Notebook

1. Setting up the analyzer
2. Running data analysis
3. Understanding analysis results
4. Visualizing data distributions
5. Interpreting correlation analysis
6. Using analysis for data processing decisions

## Setup

Now let's use a demo analysis recipe that will compute various statistics for demo dataset.

In [None]:
# Copy existing analysis configuration files from the repository
%mkdir -p configs/demo
%mkdir -p demos/data
%cp ../configs/demo/analyzer.yaml configs/demo/
%cp ../demos/data/demo-dataset.jsonl demos/data/

print("Configuration files copied successfully!")

In [None]:
# Let's examine the existing demo analyzer configuration
print("Demo analyzer config (configs/demo/analyzer.yaml):")
!cat configs/demo/analyzer.yaml

## Running Data Analysis

Now let's run the data analysis using the recipe we created. The Analyzer will compute statistics for each filter operator and generate visualizations.

In [None]:
print("Running analysis with demo config...")
!dj-analyze --config configs/demo/analyzer.yaml

You can also run data analysis directly in Python code.

In [None]:
# from data_juicer.config import init_configs
# from data_juicer.core import Analyzer

# cfg = init_configs(config_path='configs/demo/analyzer.yaml')
# analyzer = Analyzer(cfg)
# analyzer.run()

## Understanding Analysis Results

The analyzer generates several types of analysis results:
1. Overall statistics
2. Column-wise distributions
3. Correlation analysis

Let's examine these results:

In [None]:
import os
import pandas as pd

# Load the overall analysis results
overall_df = pd.read_csv('./outputs/demo-analyzer/analysis/overall.csv', index_col=0)
print("Overall Statistics:")
print(overall_df)

# Show what files were generated
print("\nGenerated analysis files:")
for root, dirs, files in os.walk('./outputs/demo-analyzer/analysis'):
    for file in files:
        print(f"  {os.path.join(root, file)}")

## Visualizing Data Distributions

The analyzer automatically generates visualizations of the data distributions. Let's display some of these charts:

In [None]:
import matplotlib.pyplot as plt
from PIL import Image

# Display the combined stats visualization if it exists
combined_stats_path = './outputs/demo-analyzer/analysis/all-stats.png'
if os.path.exists(combined_stats_path):
    img = Image.open(combined_stats_path)
    plt.figure(figsize=(15, 10))
    plt.imshow(img)
    plt.axis('off')
    plt.title('Combined Statistics Visualization')
    plt.show()
else:
    print(f"Combined stats visualization not found at {combined_stats_path}")
    
# Try to display individual histograms
analysis_dir = './outputs/demo-analyzer/analysis/'
histogram_files = [f for f in os.listdir(analysis_dir) if f.endswith('-hist.png')]

for i, hist_file in enumerate(histogram_files[:3]):  # Show first 3 histograms
    hist_path = os.path.join(analysis_dir, hist_file)
    if os.path.exists(hist_path):
        img = Image.open(hist_path)
        plt.figure(figsize=(10, 6))
        plt.imshow(img)
        plt.axis('off')
        plt.title(f'Histogram: {hist_file}')
        plt.show()

## Correlation Analysis

The analyzer also computes correlation analysis between different statistics. This helps identify relationships between metrics:

In [None]:
# Display correlation heatmap if it exists
correlation_files = [f for f in os.listdir(analysis_dir) if 'corr' in f and f.endswith('.png')]

for corr_file in correlation_files:
    corr_path = os.path.join(analysis_dir, corr_file)
    if os.path.exists(corr_path):
        img = Image.open(corr_path)
        plt.figure(figsize=(10, 8))
        plt.imshow(img)
        plt.axis('off')
        plt.title(f'Correlation Analysis: {corr_file}')
        plt.show()

## Using Analysis for Data Processing Decisions

The analysis results can guide decisions about data processing. For example, we can use the statistics to refine our data recipe:

In [None]:
# Let's examine the detailed statistics to make informed decisions
print("Key statistics from our analysis:")

print("\n- Language score statistics:")
print(overall_df['lang_score'])

print("\n- Perplexity statistics:")
print(overall_df['perplexity'])

# Based on these statistics, we might adjust our filtering parameters
print("\nBased on this analysis, we might consider adjusting our recipe parameters:")
print("- Adjust language_id_score_filter min_score based on lang_score distribution")
print("- Adjust perplexity_filter max_ppl based on perplexity distribution")

## Auto Analysis Mode

Data-Juicer also provides an auto analysis mode that automatically analyzes your dataset with all filters that produce statistics:

In [None]:
# Run auto analysis on our dataset
!dj-analyze --auto --dataset_path ./demos/data/demo-dataset.jsonl --auto_num 10

## Next Steps

Continue with the next notebook to learn how to process multimodal data including images, videos, and audio with Data-Juicer.