# 2_数据准备
本notebook主要介绍如何获取数据，为后续的指标计算和策略实现做准备。

### 1. 数据来源介绍
获取行情数据的常用方法有以下几种：

1. Tushare：专注于金融数据的开源Python库，提供基础的股票、基金、期货、期权等金融数据, 可免费下载日线/分钟线数据
2. Alpha Vantage：全球股票（美股、港股、A股部分支持，但 A 股数据不如 tushare 全面）、ETF、外汇、加密货币，可免费下载日线/分钟线数据
3. Yahoo Finance：全球股票（美股、港股、A股部分支持，但 A 股数据不如 tushare 全面）、ETF、外汇、加密货币，可免费下载日线数据，分钟线数据会出发限制
4. ByBit: 加密货币数据，除level 2 数据均可免费下载. (Binance 提供部分免费的Level2数据)

### 2. 安装并配置所依赖的库Tushare，yfinance

In [3]:
# 安装Tushare（如果尚未安装）
# !pip install tushare# 

In [1]:
# 导入必要的库
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import warnings
import os
import sys
from datetime import datetime
from dotenv import load_dotenv, find_dotenv

# Find the .env file in the parent directory
dotenv_path = find_dotenv("../.env") 

if not dotenv_path:
    print("未找到 .env 文件，请确保 .env 文件存在")
else:
    load_dotenv(dotenv_path)
    print("成功加载 .env 文件")

# 添加上一级目录到 sys.path
notebook_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(notebook_dir, '..'))
sys.path.append(parent_dir)

# 设置显示选项
pd.set_option('display.float_format', lambda x: '%.4f' % x)  
# 绘图风格（可选）
plt.style.use('seaborn-v0_8-bright')  
# 设置中文显示
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']  
plt.rcParams['axes.unicode_minus'] = False 

成功加载 .env 文件


In [2]:
# Yahoo Finance 获取特斯拉过去五年日线数据
from data_processing.yahoo_finance import load_data_yf, flatten_yf_columns, standardize_columns  
start_date = datetime.strptime('2020-03-13', '%Y-%m-%d')
end_date = datetime.strptime('2025-03-12', '%Y-%m-%d') 
print(f"获取数据时间范围：{start_date.strftime('%Y-%m-%d')} 到 {end_date.strftime('%Y-%m-%d')}")  

# 下载特斯拉日线数据
ticker = 'TSLA'  
interval ="1d"
data = load_data_yf(ticker=ticker, start_date=start_date, end_date=end_date, interval=interval)
data = flatten_yf_columns(data)  
data = standardize_columns(data) 
print(data.head(10)) 

获取数据时间范围：2020-03-13 到 2025-03-12
从本地缓存加载数据
             close    high     low    open     volume
Date                                                 
2020-03-13 36.4413 40.5047 33.4667 39.6667  339604500
2020-03-16 29.6713 32.9913 29.4780 31.3000  307342500
2020-03-17 28.6800 31.4567 26.4000 29.3340  359919000
2020-03-18 24.0813 26.9907 23.3673 25.9333  356793000
2020-03-19 28.5093 30.1333 23.8973 24.9800  452932500
2020-03-20 28.5020 31.8000 28.3860 29.2133  424282500
2020-03-23 28.9527 29.4667 27.3667 28.9067  246817500
2020-03-24 33.6667 34.2460 31.6000 31.8200  343428000
2020-03-25 35.9500 37.1333 34.0740 36.3500  318340500
2020-03-26 35.2107 37.3333 34.1500 36.4927  260710500


In [3]:
#从Alpha Vantage获取特斯拉2024年30分钟数据
from data_processing.alpha_vantage import load_data_av, load_data_year  
start_date = datetime.strptime('2024-03-13', '%Y-%m-%d')
end_date = datetime.strptime('2025-03-12', '%Y-%m-%d') 
print(f"获取数据时间范围：{start_date.strftime('%Y-%m-%d')} 到 {end_date.strftime('%Y-%m-%d')}")  

# 下载特斯拉小时线数据
ticker = 'TSLA'  
year = 2024
interval = "30m"
# avdata = load_data_av(ticker=ticker, start_date=start_date, end_date=end_date, interval=interval)
avdata=load_data_year(ticker, year, interval="30min")
display(avdata.head()) 

获取数据时间范围：2024-03-13 到 2025-03-12
获取2024-01的数据...
2024-01的数据已保存到本地缓存
获取2024-02的数据...
2024-02的数据已保存到本地缓存
获取2024-03的数据...
2024-03的数据已保存到本地缓存
获取2024-04的数据...
2024-04的数据已保存到本地缓存
获取2024-05的数据...
2024-05的数据已保存到本地缓存
获取2024-06的数据...
2024-06的数据已保存到本地缓存
获取2024-07的数据...
2024-07的数据已保存到本地缓存
获取2024-08的数据...
2024-08的数据已保存到本地缓存
获取2024-09的数据...
2024-09的数据已保存到本地缓存
获取2024-10的数据...
2024-10的数据已保存到本地缓存
获取2024-11的数据...
2024-11的数据已保存到本地缓存
获取2024-12的数据...
2024-12的数据已保存到本地缓存
2024年的数据已保存到本地缓存


Unnamed: 0,open,high,low,close,volume
2024-01-02 04:00:00,248.05,250.97,248.05,250.41,155860
2024-01-02 04:30:00,250.41,250.46,249.78,250.38,44094
2024-01-02 05:00:00,250.37,250.51,249.54,249.64,69108
2024-01-02 05:30:00,249.64,249.81,248.77,248.85,59484
2024-01-02 06:00:00,248.83,248.83,245.1,245.1,171227


In [4]:
# tu_share 下载宁德时代日线数据
from data_processing.tu_share import load_data_ts, standardize_ts_columns
import datetime

# 设置时间范围
start_date = datetime.datetime.strptime('2020-03-13', '%Y-%m-%d')
end_date = datetime.datetime.strptime('2025-03-12', '%Y-%m-%d')

#获取单只股票的历史数据
ts_code = '300750.SZ'
df = load_data_ts(ts_code, start_date, end_date, freq="daily")
df = standardize_ts_columns(df)
print(df.head())

数据已保存到本地缓存
              ts_code     open     high      low    close  pre_close   change  \
datetime                                                                        
2020-03-13  300750.SZ 126.1000 133.8000 126.0100 132.1000   134.2000  -2.1000   
2020-03-16  300750.SZ 128.0000 128.8800 118.8900 119.0600   132.1000 -13.0400   
2020-03-17  300750.SZ 121.0000 122.9300 111.2500 117.9500   119.0600  -1.1100   
2020-03-18  300750.SZ 118.9000 123.5000 116.0000 117.3500   117.9500  -0.6000   
2020-03-19  300750.SZ 119.0000 119.5000 111.6900 115.0400   117.3500  -2.3100   

            pct_chg      volume       amount  
datetime                                      
2020-03-13  -1.5648 274552.8400 3557983.7860  
2020-03-16  -9.8713 490484.8500 5992811.6550  
2020-03-17  -0.9323 392203.2800 4571660.1030  
2020-03-18  -0.5087 313383.7600 3757216.3500  
2020-03-19  -1.9685 331563.0700 3803916.5360  


In [None]:
# tu_share 下载贵州茅台分钟线数据
from data_processing.tu_share import get_ts_data, standardize_ts_columns
import datetime


# 设置时间范围
# start_date = datetime.datetime.strptime('2020-03-13', '%Y-%m-%d')
# end_date = datetime.datetime.strptime('2025-03-12', '%Y-%m-%d')
start_date = '2022-03-03'  
end_date = '2025-02-28'  

#获取单只股票的历史数据
ts_code = '600519.SH'
df = get_ts_data(ts_code, start_date, end_date, freq="30min")
# df = standardize_ts_columns(df)
print(df.head())

数据已保存至: ./data/600519.SH-2022-03-03-2025-02-28-30min.csv
     ts_code           trade_time     close      open      high       low  \
0  600519.SH  2025-02-27 15:00:00 1485.5600 1483.9900 1488.0000 1483.6800   
1  600519.SH  2025-02-27 14:30:00 1484.0000 1487.4700 1489.0900 1483.1100   
2  600519.SH  2025-02-27 14:00:00 1487.5100 1487.8800 1489.0900 1484.8000   
3  600519.SH  2025-02-27 13:30:00 1487.9200 1485.9000 1488.7700 1477.1000   
4  600519.SH  2025-02-27 11:30:00 1485.5000 1486.8800 1489.9000 1480.0800   

          vol         amount  
0 511762.0000 760222460.0000  
1 378310.0000 562264260.0000  
2 476032.0000 708129900.0000  
3 421410.0000 625599740.0000  
4 467955.0000 695581500.0000  


  data['adj_factor'] = data['adj_factor'].fillna(method='bfill')


In [7]:
import time
import pandas as pd
import requests
from datetime import datetime

def load_data_bybit(
    symbol: str,
    start_date: datetime,
    end_date: datetime,
    interval: str = "1d",
    category: str = "linear",
) -> pd.DataFrame:
    if not isinstance(start_date, datetime):
        start_date = datetime.combine(start_date, datetime.min.time())
    if not isinstance(end_date, datetime):
        end_date = datetime.combine(end_date, datetime.max.time())

    interval_map = {"1d": "D", "240": "240", "60": "60", "30": "30", "15": "15", "5": "5", "1": "1"}
    if interval not in interval_map:
        raise ValueError(f"不支持的interval: {interval}")
    bybit_interval = interval_map[interval]

    url = "https://api.bybit.com/v5/market/kline"
    limit = 1000
    all_data = []
    start_ms = int(start_date.timestamp() * 1000)
    end_ms = int(end_date.timestamp() * 1000)
    cur_end = end_ms

    while cur_end > start_ms:
        params = {
            "category": category,
            "symbol": symbol,
            "interval": bybit_interval,
            "start": start_ms,
            "end": cur_end,
            "limit": limit
        }
        resp = requests.get(url, params=params)
        data = resp.json()
        klines = data.get("result", {}).get("list", [])
        if not klines:
            break
        # 按时间升序
        klines = sorted(klines, key=lambda x: int(x[0]))
        all_data.extend(klines)
        # 用本批次最早K线的时间戳推进
        earliest_ts = int(klines[0][0])
        if earliest_ts <= start_ms:
            break
        cur_end = earliest_ts - 1
        if len(klines) < limit:
            break
        time.sleep(0.2)

    if not all_data:
        raise ValueError("未获取到任何K线数据")

    df = pd.DataFrame(all_data, columns=[
        "timestamp", "open", "high", "low", "close", "volume", "turnover"
    ])
    df["datetime"] = pd.to_datetime(df["timestamp"].astype(int), unit="ms")
    df = df.sort_values("datetime").drop_duplicates("datetime").reset_index(drop=True)
    for col in ["open", "high", "low", "close", "volume"]:
        df[col] = pd.to_numeric(df[col], errors="coerce")
    df = df[["datetime", "open", "high", "low", "close", "volume"]]
    df = df[(df["datetime"] >= start_date) & (df["datetime"] <= end_date)]
    return df

In [25]:
# ByBit 下载比特币永续合约过去五年日线数据
from datetime import datetime, timedelta

# 设定时间区间
end_date = datetime.today()
start_date = end_date - timedelta(days=5*365)


bydata = load_data_bybit(
    symbol="BTCUSDT",
    start_date=start_date,
    end_date=end_date,
    interval="1d",      # 日线
    category="linear"   # USDT永续
)

print(bydata.head())

    datetime      open      high       low     close      volume
0 2020-05-10 9546.0000 9569.0000 8153.0000 8725.0000 101828.0060
1 2020-05-11 8725.0000 9158.5000 8182.0000 8559.0000  52692.8370
2 2020-05-12 8559.0000 8973.0000 8531.5000 8813.0000  12481.2420
3 2020-05-13 8813.0000 9400.5000 8794.5000 9301.5000  11279.5480
4 2020-05-14 9301.5000 9940.0000 9258.5000 9788.0000  27688.8850


In [26]:
print(bydata.info())       # 看看总共有多少行、列，各字段数据类型 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1825 entries, 0 to 1824
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   datetime  1825 non-null   datetime64[ns]
 1   open      1825 non-null   float64       
 2   high      1825 non-null   float64       
 3   low       1825 non-null   float64       
 4   close     1825 non-null   float64       
 5   volume    1825 non-null   float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 85.7 KB
None


In [None]:
# ByBit 下载比特币永续合约过去一年分钟线数据
from datetime import datetime, timedelta

# 设定时间区间
end_date = datetime.today()
start_date = end_date - timedelta(days=365)


bydata_min = load_data_bybit(
    symbol="BTCUSDT",
    start_date=start_date,
    end_date=end_date,
    interval="30",      # 30m 分钟线
    category="linear"   # USDT永续
)

print(bydata_min.head())
print(bydata_min.info())       # 看看总共有多少行、列，各字段数据类型 

              datetime       open       high        low      close    volume
16 2024-05-09 08:30:00 61280.0000 61422.0000 61234.0000 61293.2000  979.5730
17 2024-05-09 09:00:00 61293.2000 61353.7000 60950.7000 61053.8000 2654.6530
18 2024-05-09 09:30:00 61053.8000 61259.1000 60890.1000 61259.0000 3245.6220
19 2024-05-09 10:00:00 61259.0000 61297.3000 60946.0000 60998.2000 1754.3400
20 2024-05-09 10:30:00 60998.2000 61166.6000 60736.0000 60770.8000 3629.0550
<class 'pandas.core.frame.DataFrame'>
Index: 17504 entries, 16 to 17519
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   datetime  17504 non-null  datetime64[ns]
 1   open      17504 non-null  float64       
 2   high      17504 non-null  float64       
 3   low       17504 non-null  float64       
 4   close     17504 non-null  float64       
 5   volume    17504 non-null  float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 957.2 KB
None
