<a href="https://www.kaggle.com/code/develuse/numerai-crypto-h2o-sw-automl2?scriptVersionId=233140655" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [2]:
!pip install -q numerapi requests pyarrow fastparquet pydrive2
# Installeer Java (vereist voor H2O en Spark)
!apt-get update -qq
!apt-get install -y default-jre > /dev/null
!java -version

# Installeer Spark en PySpark
!pip install -q pyspark==3.1.2

# Installeer H2O Sparkling Water
!pip install -q h2o-pysparkling-3.1

# Installeer andere benodigde packages
# Gebruik scikit-learn 1.0.2 voor compatibiliteit, zonder waarschuwingen weer te geven
!pip install -q numerapi pandas h2o cloudpickle==2.2.1 scikit-learn==1.0.2 scipy==1.10.1 matplotlib xgboost==1.6.2 --no-deps


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
openjdk version "11.0.26" 2025-01-21
OpenJDK Runtime Environment (build 11.0.26+4-post-Ubuntu-1ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.26+4-post-Ubuntu-1ubuntu122.04, mixed mode, sharing)


# Numerai Crypto Competitie Voorspellingsmodel met H2O Sparkling Water

Dit notebook implementeert een voorspellingsmodel voor de Numerai/Numerai Crypto competitie met behulp van H2O Sparkling Water, wat H2O integreert met Apache Spark voor gedistribueerde verwerking.

## Installatie van benodigde packages

Eerst moeten we Java, Spark en H2O Sparkling Water installeren. Dit kan enige tijd duren.

In [3]:
# Importeer benodigde bibliotheken
# Numerapi imports
from numerapi import NumerAPI, CryptoAPI

# Data download en preparatie imports
import pandas as pd
import json
import os
from typing import List
import gc

# Berekening imports
import numpy as np
import time
import random

# Spark imports
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col

# H2O Sparkling Water imports
from pysparkling import H2OContext
from h2o.estimators.xgboost import H2OXGBoostEstimator
import h2o
import cloudpickle
from datetime import datetime

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Model imports
import lightgbm as lgb

## Initialiseren van Spark en H2O Sparkling Water

In [4]:
# Maak een map voor het opslaan van gegevens en modellen
!mkdir -p /kaggle/working/numerai

# Initialiseer Spark sessie met betere resources (pas aan op basis van je Kaggle-omgeving)
spark = SparkSession.builder \
    .appName("NumeraiSparklingWater") \
    .config("spark.executor.memory", "5g") \
    .config("spark.driver.memory", "5g") \
    .config("spark.executor.cores", "2") \
    .config("spark.driver.extraJavaOptions", "-XX:+UseG1GC") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC") \
    .config("spark.locality.wait", "0s") \
    .getOrCreate()

# Initialiseer H2O Sparkling Water context
h2o_context = H2OContext.getOrCreate()

# Print Spark en H2O versie informatie
print(f"Spark version: {spark.version}")
print(f"H2O cluster version: {h2o.__version__}")  # Gecorrigeerde versie-attribuut
# De getSparklingWaterVersion methode bestaat niet, we slaan deze over
# In plaats daarvan kunnen we de H2O cluster info printen
print(f"H2O cluster info: {h2o.cluster().show_status()}")

Connecting to H2O server at http://8652cf76d465:54323 ... successful.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,18 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,5 months and 8 days
H2O_cluster_name:,sparkling-water-root_local-1744302825258
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,5 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,2



Sparkling Water Context:
 * Sparkling Water Version: 3.46.0.6-1-3.1
 * H2O name: sparkling-water-root_local-1744302825258
 * cluster size: 1
 * list of used nodes:
  (executorId, host, port)
  ------------------------
  (0,172.19.2.2,54321)
  ------------------------

  Open H2O Flow in browser: http://8652cf76d465:54323 (CMD + click in Mac OSX)

    
Spark version: 3.1.2
H2O cluster version: 3.46.0.6


0,1
H2O_cluster_uptime:,18 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,5 months and 8 days
H2O_cluster_name:,sparkling-water-root_local-1744302825258
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,5 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,2


H2O cluster info: None


## Initialiseren van de Numerai API

In [5]:
# Initialiseer de Numerai API client
# Voor het indienen van voorspellingen zijn API keys nodig
# napi = NumerAPI(public_id="UW_PUBLIC_ID", secret_key="UW_SECRET_KEY")
napi = NumerAPI()

## Data downloaden en laden

In [6]:
import numpy as np
import pandas as pd
#reduce_memory_usage(rmd)
def rmd(df, use_float16=1, verbose=True):
    """
    Vermindert het geheugengebruik van een pandas DataFrame door
    de datatypes te optimaliseren naar de kleinst mogelijke types.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        De DataFrame waarvan het geheugengebruik geoptimaliseerd moet worden
    use_float16 : int
        1 om float16 toe te staan, 0 om alleen float32/float64 te gebruiken
        Dit is handig wanneer je later met libraries zoals PySpark werkt die geen float16 ondersteunen
    verbose : bool
        Als True, print voortgangsinformatie
        
    Returns:
    --------
    pandas.DataFrame
        Een kopie van de originele DataFrame met geoptimaliseerde datatypes
    """
    # Maak een kopie om de originele dataframe niet te wijzigen
    df_copy = df.copy()
    
    # Definieer numerieke types die we kunnen optimaliseren
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    
    # Bereken het startgeheugengebruik
    start_mem = df_copy.memory_usage().sum() / 1024**2
    if verbose:
        print(f'Geheugengebruik van DataFrame is {start_mem:.2f} MB')
    
    # Loop door alle kolommen
    for col in df_copy.columns:
        col_type = df_copy[col].dtypes
        
        # Alleen numerieke kolommen optimaliseren
        if col_type in numerics:
            # Bereken min en max waarden
            c_min = df_copy[col].min()
            c_max = df_copy[col].max()
            
            # Integer types
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df_copy[col] = df_copy[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df_copy[col] = df_copy[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df_copy[col] = df_copy[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df_copy[col] = df_copy[col].astype(np.int64)
            # Float types
            else:
                # Gebruik float16 alleen als use_float16 is ingeschakeld
                if use_float16 == 1 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df_copy[col] = df_copy[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df_copy[col] = df_copy[col].astype(np.float32)
                else:
                    df_copy[col] = df_copy[col].astype(np.float64)
    
    # Bereken het eindgeheugengebruik
    end_mem = df_copy.memory_usage().sum() / 1024**2
    
    # Print statistieken als verbose is ingeschakeld
    if verbose:
        print(f'Geheugengebruik na optimalisatie is: {end_mem:.2f} MB')
        reduction = 100 * (start_mem - end_mem) / start_mem
        print(f'Geheugengebruik verminderd met {reduction:.1f}%')
    
    return df_copy

In [16]:
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd

# 1. PyArrow chunked reading
def read_parquet_with_pyarrow_chunked(file_path, chunksize=1000000, columns=None, filters=None, verbose=True):
    """
    Leest een parquet bestand in chunks met PyArrow, wat veel efficiënter is dan Pandas.
    """
    dataset = pq.ParquetDataset(file_path)
    schema = dataset.schema
    if verbose:
        print(f"Schema: {schema}")
        print(f"Totaal aantal kolommen: {len(schema.names)}")
    scanner = dataset.scanner(columns=columns, filter=filters, batch_size=chunksize)
    for i, batch in enumerate(scanner.to_batches()):
        chunk_df = batch.to_pandas()
        if verbose and i % 10 == 0:
            print(f"Verwerking chunk {i}, grootte: {len(chunk_df)} rijen")
        yield chunk_df
        del batch
        gc.collect()

# 2. Dask voor out-of-core verwerking
def process_with_dask(file_path, sample_freq='W', output_path=None, memory_limit='16GB', date_column='timestamp', verbose=True):
    """
    Gebruikt Dask voor out-of-core verwerking van grote parquet bestanden.
    """
    import dask
    dask.config.set({"distributed.worker.memory.limit": memory_limit})
    if verbose:
        print(f"Start verwerking met Dask, geheugenlimiet: {memory_limit}")
    ddf = dd.read_parquet(file_path)
    if verbose:
        print(f"Dask DataFrame info: {len(ddf.columns)} kolommen")
    if date_column in ddf.columns:
        ddf[date_column] = dd.to_datetime(ddf[date_column])
        if verbose:
            print(f"Resampling data naar {sample_freq} frequentie...")
        ddf = ddf.set_index(date_column)
        resampled = ddf.resample(sample_freq).mean()
        if output_path:
            if verbose:
                print(f"Opslaan van geresampelde data naar {output_path}")
            resampled.to_parquet(output_path)
        return resampled
    return ddf

# 3. SQLite voor query-gebaseerde verwerking
def process_with_sqlite(file_path, query, output_path=None, chunk_size=1000000, verbose=True):
    """
    Converteert parquet naar SQLite voor efficiënte verwerking zonder alles in geheugen te laden.
    """
    import sqlite3
    import tempfile
    with tempfile.NamedTemporaryFile(suffix='.db') as temp:
        db_path = temp.name
        if verbose:
            print(f"SQLite database aangemaakt op: {db_path}")
        conn = sqlite3.connect(db_path)
        chunks = read_parquet_with_pyarrow_chunked(file_path, chunksize=chunk_size, verbose=verbose)
        for i, chunk in enumerate(chunks):
            if i == 0:
                chunk.to_sql('data', conn, if_exists='replace', index=False)
                if verbose:
                    print(f"Tabel aangemaakt. Schema: {list(chunk.columns)}")
            else:
                chunk.to_sql('data', conn, if_exists='append', index=False)
            if verbose:
                print(f"Chunk {i} toegevoegd aan database, {len(chunk)} rijen")
        if verbose:
            print(f"Query uitvoeren: {query}")
        result = pd.read_sql_query(query, conn)
        if output_path:
            result.to_parquet(output_path)
            if verbose:
                print(f"Resultaat opgeslagen naar {output_path}")
        conn.close()
        return result

# 4. Directe wekelijkse sampling
def sample_weekly_direct(file_path, date_column='timestamp', value_columns=None, agg_func='mean', verbose=True):
    """
    Leest en samplet data zonder alles in geheugen te houden.
    """
    try:
        if verbose:
            print(f"Start directe wekelijkse sampling van {file_path}")
        parquet_file = pq.ParquetFile(file_path)
        all_columns = parquet_file.schema.names
        if verbose:
            print(f"Beschikbare kolommen: {all_columns}")
        if date_column not in all_columns:
            timestamp_cols = [col for col in all_columns if 'time' in col.lower() or 'date' in col.lower()]
            if verbose and timestamp_cols:
                print(f"Waarschuwing: {date_column} niet gevonden, gebruik {timestamp_cols[0]}")
            if timestamp_cols:
                date_column = timestamp_cols[0]
        cols_to_read = [date_column]
        if value_columns:
            cols_to_read.extend([c for c in value_columns if c in all_columns])
        else:
            cols_to_read.extend([col for col in all_columns if col != date_column])
        if verbose:
            print(f"Kolommen om te lezen: {len(cols_to_read)}")
        chunks_generator = read_parquet_with_pyarrow_chunked(
            file_path, 
            columns=cols_to_read,
            chunksize=5000000,
            verbose=verbose
        )
        weekly_dfs = []
        for i, chunk_df in enumerate(chunks_generator):
            try:
                chunk_df[date_column] = pd.to_datetime(chunk_df[date_column])
                chunk_df.set_index(date_column, inplace=True)
                if agg_func == 'mean':
                    weekly_chunk = chunk_df.resample('W').mean()
                elif agg_func == 'sum':
                    weekly_chunk = chunk_df.resample('W').sum()
                elif agg_func == 'first':
                    weekly_chunk = chunk_df.resample('W').first()
                elif agg_func == 'last':
                    weekly_chunk = chunk_df.resample('W').last()
                else:
                    weekly_chunk = chunk_df.resample('W').mean()
                weekly_dfs.append(weekly_chunk)
            except Exception as e:
                if verbose:
                    print(f"Fout bij verwerken chunk {i}: {e}")
            del chunk_df
            gc.collect()
        if weekly_dfs:
            combined_weekly = pd.concat(weekly_dfs)
            final_weekly = combined_weekly.groupby(level=0).agg(agg_func)
            if verbose:
                print(f"Wekelijkse data gegenereerd: {len(final_weekly)} rijen")
            return final_weekly
        else:
            if verbose:
                print("Geen wekelijkse data kunnen genereren!")
            return None
    except Exception as e:
        if verbose:
            print(f"Fout bij directe wekelijkse sampling: {e}")
            import traceback
            traceback.print_exc()
        return None

# 5. Gecombineerde functie voor YIEDL data verwerking
def process_yiedl_data(file_path, method='direct', date_column='timestamp', output_path=None, verbose=True):
    """
    Verwerkt YIEDL data met optimaal geheugengebruik.
    """
    if method == 'direct':
        if verbose:
            print("\n=== Direct weekly sampling ===")
        result = sample_weekly_direct(file_path, date_column=date_column, verbose=verbose)
        if output_path and result is not None:
            result.to_parquet(output_path)
            if verbose:
                print(f"Wekelijkse data opgeslagen naar {output_path}")
        return result
    elif method == 'dask':
        if verbose:
            print("\n=== Dask processing ===")
        result = process_with_dask(
            file_path, 
            sample_freq='W', 
            output_path=output_path,
            date_column=date_column,
            verbose=verbose
        )
        if output_path is None and result is not None:
            if verbose:
                print("Berekenen van Dask resultaat...")
            return result.compute()
        return None
    elif method == 'chunks':
        if verbose:
            print("\n=== PyArrow chunked processing ===")
        chunks = read_parquet_with_pyarrow_chunked(file_path, verbose=verbose)
        sample_chunks = []
        for i, chunk in enumerate(chunks):
            if i < 3:
                if verbose:
                    print(f"Chunk {i}: {len(chunk)} rijen, {len(chunk.columns)} kolommen")
                sample_chunks.append(chunk.head(5))
            else:
                break
        if sample_chunks:
            return pd.concat(sample_chunks)
        return None
    elif method == 'sqlite':
        if verbose:
            print("\n=== SQLite processing ===")
        # Voorbeeld query voor wekelijkse aggregatie
        date_extract = f"strftime('%Y-%W', {date_column})"
        query = f"""
        SELECT {date_extract} as week, 
               AVG(price) as avg_price,
               COUNT(*) as count
        FROM data
        GROUP BY week
        ORDER BY week
        """
        return process_with_sqlite(file_path, query, output_path=output_path, verbose=verbose)
    else:
        if verbose:
            print(f"Onbekende methode: {method}")
        return None


In [8]:
%%time
# Download the Numerai training data to the current directory
napi.download_dataset(filename = "crypto/v1.0/train_targets.parquet", 
                      dest_path = os.getcwd() + "/numerai_train_targets.parquet")
#napi.download_dataset(filename = "crypto/v2.0/train_targets.parquet", 
#                      dest_path = os.getcwd() + "/numerai_train_targets.parquet")
# Download the Numerai live crypto universe to the current directory
napi.download_dataset(filename = "crypto/v1.0/live_universe.parquet", 
                      dest_path = os.getcwd() + "/numerai_live_universe.parquet")
#napi.download_dataset(filename = "crypto/v2.0/live_universe.parquet", 
#                      dest_path = os.getcwd() + "/numerai_live_universe.parquet")

# Load the Numerai training targets
train_df = rmd(pd.read_parquet("numerai_train_targets.parquet"))
# Load the Numerai live universe
live = rmd(pd.read_parquet("numerai_live_universe.parquet"))


Geheugengebruik van DataFrame is 14.12 MB
Geheugengebruik na optimalisatie is: 13.11 MB
Geheugengebruik verminderd met 7.1%
Geheugengebruik van DataFrame is 0.01 MB
Geheugengebruik na optimalisatie is: 0.01 MB
Geheugengebruik verminderd met 0.0%
CPU times: user 162 ms, sys: 144 ms, total: 306 ms
Wall time: 1.9 s


In [9]:
%%time
api = CryptoAPI()
# Parquet files
api.download_dataset(
	"crypto/v1.0/live_universe.parquet",
	"numerai_crypto_live_universe.parquet"
)
gc.collect()
api.download_dataset(
	"crypto/v1.0/train_targets.parquet",
	"numerai_crypto_train_targets.parquet"
)
gc.collect()
api.download_dataset(
	"crypto/v1.0/meta_model.parquet",
	"numerai_crypto_meta_model.parquet"
)
gc.collect()
api.download_dataset(
	"crypto/v1.0/historical_meta_models.parquet",
	"numerai_crypto_historical_meta_models.parquet"
)
gc.collect()
# CSV Files
api.download_dataset(
	"crypto/v1.0/meta_model.csv",
	"numerai_crypto_meta_model.csv"
)
gc.collect()
api.download_dataset(
	"crypto/v1.0/historical_meta_models.csv",
	"numerai_crypto_historical_meta_models.csv"
)
gc.collect()
working_dir = '/kaggle/working/'
files = os.listdir(working_dir)
print("Files in /kaggle/working/:")
for f in files:
    print(f)
# # Load the data
numerai_crypto_train_targets = rmd(pd.read_parquet('numerai_crypto_train_targets.parquet'))
numerai_crypto_live_universe = rmd(pd.read_parquet('numerai_crypto_live_universe.parquet'))
gc.collect()
historical_meta_model_preds = rmd(pd.read_parquet('numerai_crypto_historical_meta_models.parquet'))
live_meta_model_preds = rmd(pd.read_parquet('numerai_crypto_meta_model.parquet'))
gc.collect()



Files in /kaggle/working/:
.virtual_documents
numerai_live_universe.parquet
yiedl_historical.parquet
h2ologs
numerai_crypto_historical_meta_models.parquet
numerai_train_targets.parquet
numerai_crypto_meta_model.csv
numerai
numerai_crypto_meta_model.parquet
numerai_crypto_train_targets.parquet
numerai_crypto_historical_meta_models.csv
numerai_crypto_live_universe.parquet
yiedl_latest.parquet
Geheugengebruik van DataFrame is 14.12 MB
Geheugengebruik na optimalisatie is: 13.11 MB
Geheugengebruik verminderd met 7.1%
Geheugengebruik van DataFrame is 0.01 MB
Geheugengebruik na optimalisatie is: 0.01 MB
Geheugengebruik verminderd met 0.0%
Geheugengebruik van DataFrame is 3.69 MB
Geheugengebruik na optimalisatie is: 3.00 MB
Geheugengebruik verminderd met 18.8%
Geheugengebruik van DataFrame is 0.01 MB
Geheugengebruik na optimalisatie is: 0.00 MB
Geheugengebruik verminderd met 37.5%
CPU times: user 1.71 s, sys: 97.2 ms, total: 1.81 s
Wall time: 6.29 s


0

In [10]:
# Helper Function from example: https://github.com/councilofelders/notebooks/blob/main/yiedl_crypto_data/yiedl_crypto_data_for_numerai_example.ipynb
import requests

def download_file(url, output_filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(output_filename, 'wb') as file:
            file.write(response.content)
        print(f"File downloaded successfully as {output_filename}")
    else:
        print("Failed to download file")


In [11]:
%%time
# Download YIEDL crypto latest dataset to current directory
url = 'https://api.yiedl.ai/yiedl/v1/downloadDataset?type=latest'
output_filename = 'yiedl_latest.parquet'
download_file(url, output_filename)


# Download YIEDL crypto historical dataset to current directory
# NOTE: it is a huge file in zip format. We need to unzip it afterwards
url = 'https://api.yiedl.ai/yiedl/v1/downloadDataset?type=historical'
output_filename = 'yiedl_historical.zip'
download_file(url, output_filename)
#10m9s-13m48s

File downloaded successfully as yiedl_latest.parquet
File downloaded successfully as yiedl_historical.zip
CPU times: user 51.2 s, sys: 1min 12s, total: 2min 3s
Wall time: 13min 48s


In [12]:
%%time
# Unzip and rename the file
!unzip -p yiedl_historical.zip > yiedl_historical.parquet
!rm yiedl_historical.zip
## 1m12s

CPU times: user 1.57 s, sys: 598 ms, total: 2.17 s
Wall time: 1min 12s


In [14]:
file_path = "yiedl_historical.parquet"

# Optie 1: Chunk-gebaseerde verwerking met PyArrow
print("\n=== PyArrow chunked processing ===")
yiedl_historical = read_parquet_with_pyarrow_chunked(file_path, chunksize=5000000)

# Voorbeeld van verwerking per chunk
for i, chunk in enumerate(yiedl_historical):
    if i < 3:  # Eerste 3 chunks als voorbeeld
        print(f"Chunk {i}: {len(chunk)} rijen")
        # Hier kun je verwerking per chunk doen
        # Na verwerking wordt het geheugen van de chunk vrijgegeven
    else:
        break

# Optie 2: Verwerken met Dask
print("\n=== Dask processing ===")
yield_weekly_dask = process_with_dask(file_path, sample_freq='W', 
                                     output_path="yiedl_weekly_data_dask.parquet")

# Optie 3: Direct wekelijks samplen
print("\n=== Direct weekly sampling ===")
yiedl_weekly_df = sample_weekly_direct(file_path)
print(f"Wekelijkse data: {len(weekly_df)} rijen")
yiedl_weekly_df.to_parquet("yiedl_weekly_data_direct.parquet")


=== PyArrow chunked processing ===


NameError: name 'verbose' is not defined

In [None]:
%%time
# Load and display the YIEDL historical crypto dataset
## Faalt bij ram gebruik 30GB
##df_yield_historical = rmd(pd.read_parquet("yiedl_historical.parquet",
##                                      engine = "pyarrow",
                                      dtype_backend = "numpy_nullable"))                                    
# Check dtypes
yield_historical.dtypes
# Display
#display(df_yield_historical)
## ms

In [None]:
%%time
# Load and display the YIEDL latest crypto dataset
df_yield_latest = rmd(pd.read_parquet("yiedl_latest.parquet", 
                                  engine = "pyarrow",
                                  dtype_backend = "numpy_nullable"))

In [None]:
''' numerai competitie dataset
# Gebruik een van de nieuwste dataversies
DATA_VERSION = "v5.0"

# Maak een data directory
!mkdir -p {DATA_VERSION}

# Download data
print("Downloading training data...")
napi.download_dataset(f"{DATA_VERSION}/train.parquet")
napi.download_dataset(f"{DATA_VERSION}/features.json")

# Laad feature metadata
feature_metadata = json.load(open(f"{DATA_VERSION}/features.json"))
print("Available feature sets:", list(feature_metadata["feature_sets"].keys()))
features = feature_metadata["feature_sets"]["small"]  # gebruik "small" voor sneller testen, "medium" of "all" voor betere prestaties
'''

In [2]:
'''
# PyDrive implementation for Google Drive integration
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def setup_pydrive():
    # Authenticate and create the PyDrive client
    auth.authenticate_user()
    gauth = GoogleAuth()
    gauth.credentials = GoogleCredentials.get_application_default()
    drive = GoogleDrive(gauth)
    return drive

def create_folder_if_not_exists(drive, folder_name):
    # Check if folder exists
    file_list = drive.ListFile({"q": f"title='{folder_name}' and mimeType='application/vnd.google-apps.folder' and trashed=false"}).GetList()
    
    if len(file_list) > 0:
        # Folder exists, return the folder ID
        return file_list[0]["id"]
    else:
        # Create folder
        folder = drive.CreateFile({"title": folder_name, "mimeType": "application/vnd.google-apps.folder"})
        folder.Upload()
        return folder["id"]

def save_notebook_to_drive(drive, folder_id, notebook_name):
    # Create a file in the folder
    file = drive.CreateFile({"title": notebook_name, "parents": [{"id": folder_id}]})
    
    # Get the content of the current notebook
    notebook_content = open(notebook_name, "r").read()
    
    # Set the content of the file
    file.SetContentString(notebook_content)
    file.Upload()
    
    return file["id"]

try:
    # Setup PyDrive
    drive = setup_pydrive()
    print("Successfully authenticated with Google Drive")
    
    # Create Numer_crypto folder if it doesn't exist
    folder_id = create_folder_if_not_exists(drive, "Numer_crypto")
    print(f"Numer_crypto folder ID: {folder_id}")
    
    ## Save the current notebook to the folder
    #notebook_name = "numerai_sparkling_water_kaggle.ipynb"
    #file_id = save_notebook_to_drive(drive, folder_id, notebook_name)
    #print(f"Notebook saved to Google Drive with file ID: {file_id}")
    
    ## List files in the folder
    #file_list = drive.ListFile({"q": f"'{folder_id}' in parents and trashed=false"}).GetList()
    #print("Files in Numer_crypto folder:")
    #for file in file_list:
    #    print(f"- {file['title']} (ID: {file['id']})")
except Exception as e:
    print(f"Error with PyDrive: {e}")
'''

Successfully authenticated with Google Drive
Numer_crypto folder ID: 1nLS8F4unm5wKYIgzTqk8tPyURpRx17w3


## Data laden met PySpark

In [None]:
'''
# Laad trainingsdata met Spark
print("Loading training data with Spark...")
train_spark = spark.read.parquet(f"{DATA_VERSION}/train.parquet")

# Selecteer alleen de benodigde kolommen
columns_to_select = ["era"] + features + ["target"]
train_spark = train_spark.select(*columns_to_select)

# Downsampling voor snelheid (optioneel)
print("Preparing data for training...")
# Haal unieke era's op en sample 25% (elke 4e era)
unique_eras = [row.era for row in train_spark.select("era").distinct().collect()]
sampled_eras = unique_eras[::4]
train_spark = train_spark.filter(col("era").isin(sampled_eras))

# Bekijk de data
print(f"Training data count: {train_spark.count()}")
print(f"Number of features: {len(features)}")
print(f"Number of eras: {len(sampled_eras)}")

# Toon schema
train_spark.printSchema()
'''

## Data voorbereiden met PySpark

In [None]:
'''
# Bereid data voor met Spark ML Pipeline
print("Preparing feature vector with Spark...")

# Maak een feature vector van alle features
assembler = VectorAssembler(inputCols=features, outputCol="features")
train_spark = assembler.transform(train_spark)

# Toon een voorbeeld van de getransformeerde data
train_spark.select("era", "features", "target").show(5, truncate=True)
'''

## Converteren van Spark DataFrame naar H2O Frame

In [None]:
'''
# Converteer Spark DataFrame naar H2O Frame
print("Converting Spark DataFrame to H2O Frame...")
train_h2o = h2o_context.asH2OFrame(train_spark)

# Bekijk H2O Frame info
train_h2o.describe()
'''

## Feature engineering

In [None]:
# using standard Feature Engineering from here: https://www.kaggle.com/code/lucasmorin/crypto-forecasting-lgbm-feval-feature-importance
# https://stackoverflow.com/questions/38641691/weighted-correlation-coefficient-with-pandas
def wmean(x, w):
    return np.sum(x * w) / np.sum(w)

def wcov(x, y, w):
    return np.sum(w * (x - wmean(x, w)) * (y - wmean(y, w))) / np.sum(w)

def wcorr(x, y, w):
    return wcov(x, y, w) / np.sqrt(wcov(x, x, w) * wcov(y, y, w))

def eval_wcorr(preds, train_data):
    w = train_data.add_w.values.flatten()
    y_true = train_data.get_label()
    return 'eval_wcorr', wcorr(preds, y_true, w), True



## Numerai crypto voorbeeld model

In [None]:
%%time
def generate_training_features(df: pd.DataFrame) -> List[str]:
    # TODO: Get your data and create features
    df['fake_feature_1'] = df.groupby(["symbol", "date"])['symbol'].transform(lambda x: random.uniform(0, 1))
    return ['fake_feature_1']

# Historical targets file contains ["symbol", "date", "target"] columns
#train_df

# Add training features for each (symbol, date)
feature_cols = generate_training_features(train_df)

model = lgb.LGBMRegressor(
    n_estimators=2000,
    learning_rate=0.01,
    max_depth=5,
    num_leaves=2 ** 5,
    colsample_bytree=0.1
)

model.fit(
    train_df[feature_cols],
    train_df["target"]
)

In [None]:
def generate_features(df: pd.DataFrame):
    # TODO: Get your data and create features for live universe
    df['fake_feature_1'] = df['symbol'].transform(lambda x: random.uniform(0, 1))

# Use API keys to authenticate
napi = NumerAPI("[your api public id]", "[your api secret key]")

# Generate features for the live universe
generate_features(live)

# Get live predictions
live["signal"] = model.predict(live[feature_cols])

# Predictions must be between 0 and 1
live["signal"] = live["signal"].rank(pct=True)

# Format and save submission
live[['symbol', 'signal']].to_parquet("submission.parquet")

# Get model ids and submit models
models = napi.get_models(tournament=12)
for model_name, model_id in models.items():
    print(f'submitting {model_name}...')
    napi.upload_predictions("submission.parquet", model_id=model_id, tournament=12)

print('done!')

## Claude.ai versie van model met behulp van GPU

In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import random
from typing import List
import time
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Controleer of GPU beschikbaar is
def check_gpu_availability():
    try:
        # Voor Kaggle P100
        !nvidia-smi
        print("GPU is beschikbaar")
        return True
    except:
        print("GPU niet gevonden, fall back naar CPU")
        return False

# Verbeterde versie van feature generatie functie
def generate_training_features(df: pd.DataFrame) -> List[str]:
    """
    Genereert features voor het trainingsmodel met betere prestaties.
    
    Args:
        df: DataFrame met minimaal 'symbol' en 'date' kolommen
    
    Returns:
        List met namen van gegenereerde feature kolommen
    """
    # Start timer voor benchmarking
    start_time = time.time()
    
    # Lijst om feature namen bij te houden
    feature_cols = []
    
    # Sorteer de data op symbol en date - belangrijk voor tijdreekseigenschappen
    df = df.sort_values(['symbol', 'date'])
    
    # Basis statistieken per symbol
    print("Berekenen groepsstatistieken...")
    
    # Meer betekenisvolle features genereren (voorbeeld)
    # Voor een crypto competitie zouden we features toe kunnen voegen zoals:
    
    # 1. Mean encoding van symbol om rekening te houden met crypto-specifieke eigenschappen
    df['symbol_mean_target'] = df.groupby('symbol')['target'].transform('mean')
    feature_cols.append('symbol_mean_target')
    
    # 2. Tijdsdimensie features
    if 'date' in df.columns:
        df['date'] = pd.to_datetime(df['date'])
        df['day_of_week'] = df['date'].dt.dayofweek
        df['month'] = df['date'].dt.month
        df['quarter'] = df['date'].dt.quarter
        feature_cols.extend(['day_of_week', 'month', 'quarter'])
    
    # 3. Rolling statistieken (als we aanvullende prijs/volume data zouden hebben)
    # Als voorbeeld, simuleren we hier wat prijsdata
    if 'fake_price' not in df.columns:
        df['fake_price'] = np.random.normal(100, 10, size=len(df))
    
    # Bereken rolling statistieken met window size 7
    for window in [7, 14, 30]:
        # Rolling mean
        df[f'price_rolling_mean_{window}'] = df.groupby('symbol')['fake_price'].transform(
            lambda x: x.rolling(window=window, min_periods=1).mean())
        
        # Rolling volatility (std)
        df[f'price_rolling_std_{window}'] = df.groupby('symbol')['fake_price'].transform(
            lambda x: x.rolling(window=window, min_periods=1).std())
        
        # Momentum (% verandering)
        df[f'price_momentum_{window}'] = df.groupby('symbol')['fake_price'].transform(
            lambda x: x.pct_change(periods=window).fillna(0))
        
        feature_cols.extend([
            f'price_rolling_mean_{window}',
            f'price_rolling_std_{window}',
            f'price_momentum_{window}'
        ])
    
    # 4. Kruisende moving averages (technische indicators)
    df['sma_short'] = df.groupby('symbol')['fake_price'].transform(
        lambda x: x.rolling(window=7, min_periods=1).mean())
    df['sma_long'] = df.groupby('symbol')['fake_price'].transform(
        lambda x: x.rolling(window=21, min_periods=1).mean())
    df['sma_cross'] = (df['sma_short'] > df['sma_long']).astype(int)
    feature_cols.append('sma_cross')
    
    # Random noise feature (als placeholder)
    df['random_feature'] = np.random.normal(0, 1, size=len(df))
    feature_cols.append('random_feature')
    
    # Log execution time
    end_time = time.time()
    print(f"Feature generatie voltooid in {end_time - start_time:.2f} seconden")
    print(f"Gegenereerde features: {len(feature_cols)}")
    
    return feature_cols

# Train een GPU-versneld LightGBM model
def train_lgbm_model(train_df, feature_cols, target_col="target", use_gpu=False):
    """
    Traint een LightGBM model met GPU acceleratie indien beschikbaar
    
    Args:
        train_df: DataFrame met trainingsdata
        feature_cols: Lijst met feature kolommen
        target_col: Naam van de target kolom
        use_gpu: Boolean om GPU te gebruiken
    
    Returns:
        Getraind LightGBM model
    """
    print(f"Training model op {'GPU' if use_gpu else 'CPU'}...")
    start_time = time.time()
    
    # Bereken optimale parameters op basis van dataset grootte
    num_samples = len(train_df)
    num_features = len(feature_cols)
    
    # Pas hyperparameters aan voor GPU training
    params = {
        'objective': 'regression',
        'metric': 'rmse',
        'n_estimators': 2000,
        'learning_rate': 0.01,
        'max_depth': 5,
        'num_leaves': 2**5,
        'colsample_bytree': 0.1,
        'verbosity': -1,
        'early_stopping_rounds': 50
    }
    
    # GPU-specifieke parameters toevoegen indien nodig
    if use_gpu:
        params.update({
            'device': 'gpu',
            'gpu_platform_id': 0,
            'gpu_device_id': 0,
            'use_gpu_hist': True,
            'gpu_use_dp': True  # Gebruik dubbele precisie voor betere nauwkeurigheid
        })
    
    # Split data voor early stopping
    from sklearn.model_selection import train_test_split
    X_train, X_val, y_train, y_val = train_test_split(
        train_df[feature_cols], 
        train_df[target_col],
        test_size=0.2,
        random_state=42
    )
    
    # Maak LGBMRegressor met aangepaste parameters
    model = lgb.LGBMRegressor(**params)
    
    # Fit het model met evaluation set
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        verbose=100  # Toon training voortgang elke 100 iteraties
    )
    
    # Evalueer het model
    val_preds = model.predict(X_val)
    rmse = mean_squared_error(y_val, val_preds, squared=False)
    print(f"Validatie RMSE: {rmse:.6f}")
    
    # Check vroeg stoppen
    print(f"Model stopte na {model.best_iteration_} iteraties")
    
    # Print feature importance
    feature_importance = pd.DataFrame({
        'Feature': feature_cols,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\nTop 10 belangrijkste features:")
    print(feature_importance.head(10))
    
    # Plot feature importance
    plt.figure(figsize=(10, 6))
    plt.barh(feature_importance['Feature'][:15], feature_importance['Importance'][:15])
    plt.xlabel('Importance')
    plt.title('Feature Importance (Top 15)')
    plt.gca().invert_yaxis()
    plt.show()
    
    end_time = time.time()
    print(f"Model training voltooid in {end_time - start_time:.2f} seconden")
    
    return model

# Hoofdprogramma
if __name__ == "__main__":
    # Controleer GPU beschikbaarheid
    use_gpu = check_gpu_availability()
    
    # Laad trainingsdata (vervang dit met je echte data loading logica)
    print("Laden van trainingsdata...")
    
    # Voorbeeld: als je trainingsdata uit een lokaal bestand laadt
    # train_df = pd.read_csv('/path/to/train.csv')
    
    # Voor demonstratie, maken we synthetische data
    symbols = ['BTC', 'ETH', 'XRP', 'ADA', 'SOL', 'DOT', 'AVAX', 'MATIC']
    dates = pd.date_range(start='2020-01-01', end='2023-01-01', freq='D')
    
    data = []
    for symbol in symbols:
        for date in dates:
            data.append({
                'symbol': symbol,
                'date': date,
                'target': np.random.normal(0, 1)  # random target waarde
            })
    
    train_df = pd.DataFrame(data)
    print(f"Trainingsdata geladen: {train_df.shape}")
    
    # Genereer features
    feature_cols = generate_training_features(train_df)
    
    # Train model met GPU indien beschikbaar
    model = train_lgbm_model(train_df, feature_cols, use_gpu=use_gpu)
    
    # Sla model op
    import pickle
    with open('numerai_crypto_model.pkl', 'wb') as f:
        pickle.dump(model, f)
    
    print("Model opgeslagen als 'numerai_crypto_model.pkl'")

## Evaluation and model flow nfold lgbm

In [None]:
n_fold = 5

importances = []

for fold in range(n_fold):
    print('Fold: '+str(fold))

    train = pd.read_parquet('../input/crypto-forecasting-static-feature-engineering/train_fold_'+str(fold)+'.parquet')
    test = pd.read_parquet('../input/crypto-forecasting-static-feature-engineering/test_fold_'+str(fold)+'.parquet')
    
    if DEBUG:
        timestamp_sample_train = train.timestamp.unique()[:np.int(len(train.timestamp.unique())*0.05)]
        timestamp_sample_test = test.timestamp.unique()[:np.int(len(test.timestamp.unique())*0.05)]
        train = train[train.timestamp.isin(timestamp_sample_train)]
        test = test[test.timestamp.isin(timestamp_sample_test)]

    y_train = train['Target']
    y_test = test['Target']

    features = [col for col in train.columns if col not in {'timestamp', 'Target', 'Target_M','weights'}]

    weights_train = train[['weights']]
    weights_test = test[['weights']]

    train = train[features]
    test = test[features]
    
    train_dataset = lgb.Dataset(train, y_train, feature_name = features, categorical_feature= ['Asset_ID'])
    val_dataset = lgb.Dataset(test, y_test, feature_name = features, categorical_feature= ['Asset_ID'])

    train_dataset.add_w = weights_train
    val_dataset.add_w = weights_test

    val_data = test
    val_y = y_test

    del train
    
    evals_result = {}
    
    # parameters
    params = {'n_estimators': 2000,
            'objective': 'regression',
            'metric': 'None',
            'boosting_type': 'gbdt',
            'max_depth': -1,
            'learning_rate': 0.05,
            'subsample': 0.72,
            'subsample_freq': 4,
            'feature_fraction': 0.4,
            'lambda_l1': 1,
            'lambda_l2': 1,
            'seed': 46,
            'verbose': -1,
            }

    model = lgb.train(params = params,
                      train_set = train_dataset, 
                      valid_sets = [val_dataset],
                      #early_stopping_rounds=1000,
                      verbose_eval = 100,
                      feval=eval_wcorr,
                      evals_result = evals_result 
                     )
    
    importances.append(model.feature_importance(importance_type='gain'))
    
    plt.plot(np.array(evals_result['valid_0']['eval_wcorr']), label='fold '+str(fold))
    
plt.legend(loc="upper left")
plt.show()

## Model trainen met H2O XGBoost via Sparkling Water

In [None]:
# Train model met H2O XGBoost via Sparkling Water
print("Training H2O XGBoost model via Sparkling Water...")
start_time = time.time()

# Configureer XGBoost model
from h2o.estimators.xgboost import H2OXGBoostEstimator

xgb_model = H2OXGBoostEstimator(
    ntrees=2000,
    max_depth=5,
    learn_rate=0.01,
    sample_rate=0.8,
    col_sample_rate=0.8,
    tree_method="auto",  # auto selecteert GPU indien beschikbaar
    booster="gbtree",
    seed=42
)

# Train het model
xgb_model.train(x=features, y="target", training_frame=train_h2o)

training_time = time.time() - start_time
print(f"Training completed in {training_time:.2f} seconds")

# Toon model informatie
print(xgb_model)

## Feature importance visualiseren

In [None]:
# Feature importance visualiseren
feature_importance = xgb_model.varimp(use_pandas=True)
if feature_importance is not None:
    plt.figure(figsize=(10, 8))
    plt.barh(range(len(feature_importance[:20])), feature_importance[:20]['relative_importance'])
    plt.yticks(range(len(feature_importance[:20])), feature_importance[:20]['variable'])
    plt.title('H2O XGBoost Feature Importance (top 20)')
    plt.xlabel('Relative Importance')
    plt.tight_layout()
    plt.show()

## Model opslaan als MOJO

In [None]:
# Sla het model op als MOJO (Model Object, Optimized)
mojo_path = xgb_model.download_mojo(path="./", get_genmodel_jar=True)
print(f"Model saved as MOJO: {mojo_path}")

## Validatiedata laden en voorbereiden met PySpark

In [None]:
# Download validatiedata voor testen
print("Downloading validation data for testing...")
napi.download_dataset(f"{DATA_VERSION}/validation.parquet")

# Laad validatiedata met Spark
print("Loading validation data with Spark...")
validation_spark = spark.read.parquet(f"{DATA_VERSION}/validation.parquet")

# Selecteer alleen de benodigde kolommen
columns_to_select = ["era", "data_type"] + features
validation_spark = validation_spark.select(*columns_to_select)

# Filter alleen validatie data
validation_spark = validation_spark.filter(col("data_type") == "validation")

# Neem een kleine subset voor geheugenefficiëntie
validation_spark = validation_spark.limit(1000)

# Maak een feature vector van alle features
validation_spark = assembler.transform(validation_spark)

# Converteer Spark DataFrame naar H2O Frame
validation_h2o = h2o_context.asH2OFrame(validation_spark)

## Voorspellingen maken met het model

In [None]:
# Maak voorspellingen met het model
print("Making predictions...")
predictions_h2o = xgb_model.predict(validation_h2o)

# Converteer H2O Frame terug naar Spark DataFrame
predictions_spark = h2o_context.asSparkFrame(predictions_h2o)

# Toon voorspellingen
print("Sample predictions:")
predictions_spark.show(5)

## Voorspellingsfunctie definiëren

In [None]:
# Definieer voorspellingsfunctie die werkt met H2O model
def predict(
    live_features: pd.DataFrame,
    live_benchmark_models: pd.DataFrame
) -> pd.DataFrame:
    # Converteer pandas DataFrame naar Spark DataFrame
    live_features_spark = spark.createDataFrame(live_features[features])
    
    # Maak een feature vector van alle features
    live_features_spark = assembler.transform(live_features_spark)
    
    # Converteer Spark DataFrame naar H2O Frame
    live_features_h2o = h2o_context.asH2OFrame(live_features_spark)
    
    # Maak voorspellingen met het H2O model
    preds = xgb_model.predict(live_features_h2o)
    
    # Converteer H2O voorspellingen terug naar pandas
    predictions = h2o.as_list(preds)["predict"].values
    
    # Maak submission DataFrame
    submission = pd.Series(predictions, index=live_features.index)
    return submission.to_frame("prediction")

## Voorspellingsfunctie testen

In [None]:
# Converteer Spark DataFrame terug naar pandas voor testen
validation_pd = validation_spark.toPandas()

# Test voorspellingsfunctie
print("Testing prediction function...")
# Maak een lege DataFrame voor benchmark_models (niet gebruikt in onze voorspellingsfunctie)
empty_benchmark = pd.DataFrame(index=validation_pd.index)
predictions = predict(validation_pd, empty_benchmark)

print(f"Predictions shape: {predictions.shape}")
print("\nSample predictions:")
print(predictions.head())

## Voorspellingsfunctie opslaan met cloudpickle

In [None]:
# Pickle voorspellingsfunctie
print("Saving prediction function with cloudpickle...")
p = cloudpickle.dumps(predict)
with open("numerai_sparkling_water_model.pkl", "wb") as f:
    f.write(p)

print("Prediction function saved as 'numerai_sparkling_water_model.pkl'")

## Kaggle specifieke functies voor het opslaan van resultaten

In [None]:
# Opslaan van resultaten in Kaggle output
# Dit maakt het mogelijk om de resultaten te downloaden of als dataset te gebruiken
try:
    # Maak een output directory
    !mkdir -p /kaggle/working/output
    
    # Kopieer de belangrijke bestanden
    !cp numerai_sparkling_water_model.pkl /kaggle/working/output/
    !cp {mojo_path} /kaggle/working/output/
    
    print("Model bestanden opgeslagen in Kaggle output directory")
except Exception as e:
    print(f"Fout bij opslaan in Kaggle output: {e}")

## Voordelen van Sparkling Water

In [None]:
# Hier zou je een vergelijking kunnen maken tussen standaard H2O en Sparkling Water
print("Sparkling Water Voordelen:")
print("1. Gedistribueerde verwerking met Spark voor grote datasets")
print("2. Combinatie van Spark's data processing met H2O's machine learning algoritmes")
print("3. Betere schaalbaarheid voor complexe modellen en grote datasets")
print("4. Mogelijkheid om Spark ML Pipeline te integreren met H2O modellen")
print(f"5. Onze training duurde {training_time:.2f} seconden met Sparkling Water")

In [None]:
# Sluit H2O cluster af
h2o.cluster().shutdown()

# Sluit Spark sessie af
spark.stop()

In [None]:
# Financial Modeling Prep API Integration
import requests
import pandas as pd

FMP_API_KEY = "aDFEO9rxgvGL3VQgPcBxXblSZ3laRLap"
DEEPSEEK_API_KEY = "sk-6a3502649b0048259e0009a328c71960"

# Function to get economic indicators from Financial Modeling Prep
def get_economic_indicators():
    url = f"https://financialmodelingprep.com/api/v3/economic/economic_indicators?apikey={FMP_API_KEY}"
    response = requests.get(url)
    data = response.json()
    return pd.DataFrame(data)

# Get country and currency data
def get_country_currency_data():
    url = f"https://financialmodelingprep.com/api/v3/fx?apikey={FMP_API_KEY}"
    response = requests.get(url)
    fx_data = response.json()
    
    # Get country profiles for ISO codes
    url = f"https://financialmodelingprep.com/api/v4/country_list?apikey={FMP_API_KEY}"
    response = requests.get(url)
    country_data = response.json()
    
    # Create comprehensive country-currency mapping
    country_df = pd.DataFrame(country_data)
    fx_df = pd.DataFrame(fx_data)
    
    # Extract currency codes from FX pairs
    currency_codes = set()
    for pair in fx_df["ticker"].values:
        if "/" in pair:
            base, quote = pair.split("/")
            currency_codes.add(base)
            currency_codes.add(quote)
    
    # Create final mapping dataframe
    mapping_data = []
    for country in country_df.to_dict("records"):
        country_name = country.get("name", "")
        country_code = country.get("code", "")
        currency_name = country.get("currency", "")
        currency_code = ""
        
        # Try to find currency code
        for code in currency_codes:
            if len(code) == 3 and code.upper() in currency_name.upper():
                currency_code = code
                break
        
        mapping_data.append({
            "country_name": country_name,
            "country_code": country_code,
            "currency_name": currency_name,
            "currency_code": currency_code
        })
    
    return pd.DataFrame(mapping_data)

# Get economic indicators
try:
    economic_indicators = get_economic_indicators()
    print("Economic Indicators:")
    print(economic_indicators.head())
except Exception as e:
    print(f"Error fetching economic indicators: {e}")

# Get country-currency mapping
try:
    country_currency_mapping = get_country_currency_data()
    print("
Country-Currency Mapping:")
    print(country_currency_mapping.head(20))
    
    # Save the mapping to CSV
    country_currency_mapping.to_csv("country_currency_mapping.csv", index=False)
    print("
Saved country-currency mapping to CSV file")
except Exception as e:
    print(f"Error creating country-currency mapping: {e}")


In [None]:
# DeepSeek API Integration for Crypto-Country Association
import requests
import json

def get_crypto_country_associations(cryptocurrencies):
    url = "https://api.deepseek.com/v1/chat/completions"
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {DEEPSEEK_API_KEY}"
    }
    
    crypto_list = ", ".join(cryptocurrencies)
    
    data = {
        "model": "deepseek-chat",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that provides accurate information about cryptocurrencies."
            },
            {
                "role": "user",
                "content": f"For each of these cryptocurrencies: {crypto_list}, provide the country where they have their entity registered or where they primarily report taxes. Return the data in JSON format with cryptocurrency name, country name, and ISO country code."
            }
        ],
        "temperature": 0.1,
        "max_tokens": 2000
    }
    
    try:
        response = requests.post(url, headers=headers, json=data)
        response_data = response.json()
        
        if "choices" in response_data and len(response_data["choices"]) > 0:
            content = response_data["choices"][0]["message"]["content"]
            
            # Extract JSON from the response
            try:
                # Try to find JSON in the response
                start_idx = content.find("{")
                end_idx = content.rfind("}")
                
                if start_idx != -1 and end_idx != -1:
                    json_str = content[start_idx:end_idx+1]
                    return json.loads(json_str)
                else:
                    return {"error": "No JSON found in response", "raw_response": content}
            except json.JSONDecodeError:
                return {"error": "Failed to parse JSON", "raw_response": content}
        else:
            return {"error": "No response from DeepSeek API"}
    except Exception as e:
        return {"error": str(e)}

# Example usage
cryptocurrencies = ["Bitcoin", "Ethereum", "Ripple", "Cardano", "Solana"]
try:
    crypto_country_data = get_crypto_country_associations(cryptocurrencies)
    print("Cryptocurrency Country Associations:")
    print(json.dumps(crypto_country_data, indent=2))
    
    # Convert to DataFrame and save
    if not isinstance(crypto_country_data, dict) or not crypto_country_data.get("error"):
        crypto_df = pd.DataFrame(crypto_country_data)
        crypto_df.to_csv("crypto_country_associations.csv", index=False)
        print("
Saved cryptocurrency country associations to CSV file")
except Exception as e:
    print(f"Error getting cryptocurrency country associations: {e}")
