# Explore Binance BTCUSDT Dataset

This notebook explores the `/tmp/BTCUSDT-1s-2024-05.csv` dataset used in the Airflow pipeline. It loads the data, displays its structure, summary statistics, and visualizations.

## 1. Load Dataset

Load the Binance BTCUSDT 1s kline CSV file from `/tmp/BTCUSDT-1s-2024-05.csv`.

In [None]:
import pandas as pd
import os

csv_path = '/tmp/BTCUSDT-1s-2024-05.csv'
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path, header=None)
    df.columns = [
        "open_time", "open", "high", "low", "close", "volume",
        "close_time", "quote_asset_volume", "number_of_trades",
        "taker_buy_base_asset_volume", "taker_buy_quote_asset_volume", "ignore"
    ]
    print('Loaded dataset with shape:', df.shape)
    display(df.head())
else:
    print('Dataset not found at', csv_path)

## 2. Explore Dataset Structure

View columns, data types, and a sample of the data.

In [None]:
if 'df' in globals():
    print('Columns:', df.columns.tolist())
    print('Data types:')
    print(df.dtypes)
    display(df.sample(5))

## 3. Display Summary Statistics

Show summary statistics for numeric columns.

In [None]:
if 'df' in globals():
    display(df.describe())

## 4. Visualize Dataset Features

Plot price and volume trends.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

if 'df' in globals():
    plt.figure(figsize=(12, 5))
    plt.plot(df['close'].astype(float))
    plt.title('BTCUSDT Close Price Over Time')
    plt.xlabel('Time')
    plt.ylabel('Close Price')
    plt.show()

    plt.figure(figsize=(12, 5))
    plt.plot(df['volume'].astype(float))
    plt.title('BTCUSDT Volume Over Time')
    plt.xlabel('Time')
    plt.ylabel('Volume')
    plt.show()

## 5. Additional Analysis

You can extend this notebook to perform more advanced analytics or visualizations on the BTCUSDT dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

if selected_file and dfs:
    for i, df in enumerate(dfs):
        print(f'Visualizations for DataFrame {i+1}:')
        for col in df.select_dtypes(include=['number']).columns:
            plt.figure(figsize=(6, 4))
            sns.histplot(df[col], kde=True)
            plt.title(f'Histogram of {col}')
            plt.show()
        if len(df.select_dtypes(include=['number']).columns) >= 2:
            cols = df.select_dtypes(include=['number']).columns[:2]
            plt.figure(figsize=(6, 4))
            sns.scatterplot(x=df[cols[0]], y=df[cols[1]])
            plt.title(f'Scatter plot: {cols[0]} vs {cols[1]}')
            plt.show()
else:
    print('No data available for visualization.')