# Climate Risk Premium Analysis: Data Collection

**Project**: Analyzing Climate Risk Premiums in US Equity Markets  
**Notebook**: 1. Data Collection  
**Author**: Anush Nepal  

## Objective
This notebook collects stock price data for companies in three climate-sensitive sectors:
- **Energy**: Direct exposure to transition risks and carbon regulations
- **Insurance**: Companies that price climate risks into their business models
- **Real Estate**: Physical assets exposed to climate events

Gathering 7 years of data (2017-2024) to analyze how climate events impact stock prices and whether investors price climate risk into valuations.

## 1. Importing Required Libraries
Following are the Python libraries we'll use for data collection and analysis:

In [5]:
import pandas as pd
import numpy as np
import yfinance as yf # For stock data collection

from datetime import datetime, timedelta
import time
import os 
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully.")
print(f"Pandas version: {pd.__version__}")
print(f"Current date: {datetime.now().strftime('%Y-%m-%d')}")


Libraries imported successfully.
Pandas version: 2.2.3
Current date: 2025-08-13


## 2. Defining Company Lists by Sector
We'll focus on large-cap companies in each sector to ensure data quality and liquidity. These firms represent the major players that institutional investors would focus on. 

In [6]:
# Energy Sector (major oil, gas, and renewable energy companies)
energy_tickers = [
    'XOM', # Exxon Mobil
    'CVX', # Chevron
    'COP', # ConocoPhillips
    'EOG', #EOG Resources
    'SLB', # Schlumberger
    'PXD', # Pioneer Natural Resources
    'KMI', # Kinder Morgan
    'OXY', # Occidental Petroleum
    'VLO', # Valero Energy
    'PSX', # Phillips 66
    'MPC', # Marathon Petroleum
    'HAL', # Halliburton
    'BKR', # Baker Hughes
    'DVN', # Devon Energy
    'FANG' # Diamondback Energy
]

# Insurance Sector (property & casualty insurers most exposed to climate risk)
insurance_tickers = [
    'BRK-B', # Berkshire Hathaway
    'PGR',   # Progressive
    'TRV',   # Travelers
    'ALL',   # Allstate
    'CB',    # Chubb
    'AIG',   # American International Group
    'HIG',   # Hartford Financial
    'CNA',   # CNA Financial
    'RLI',   # RLI Corp
    'AFG',   # American Financial Group
    'CINF',  # Cincinnati Financial
    'WRB',   # W.R. Berkley
    'Y',     # Alleghany
    'EG',    # Everest Group
    'KMPR'   # Kemper
]

# Real Estate Sector (REITs and real estate companies)
real_estate_tickers = [
    'PLD',   # Prologis (logistics real estate)
    'AMT',   # American Tower (cell towers)
    'CCI',   # Crown Castle (cell towers)
    'EQIX',  # Equinix (data centers)
    'WELL',  # Welltower (healthcare real estate)
    'DLR',   # Digital Realty Trust
    'SPG',   # Simon Property Group (malls)
    'O',     # Realty Income
    'AVTR',  # Avantax (residential)
    'EXR',   # Extended Stay America
    'AVB',   # AvalonBay Communities
    'EQR',   # Equity Residential
    'MAA',   # Mid-America Apartment Communities
    'ESS',   # Essex Property Trust
    'UDR'    # UDR Inc
]

# Combining all sectors into a master dictionary
sector_tickers = {
    'Energy': energy_tickers,
    'Insurance': insurance_tickers,
    'Real Estate': real_estate_tickers
}

# Summary
total_companies = sum(len(tickers) for tickers in sector_tickers.values())
print(f"Total companies selected: {total_companies}")
for sector, tickers in sector_tickers.items():
    print(f"{sector}: {len(tickers)} companies")
    print(f"- Sample tickers: {', '.join(tickers[:5])}...")

Total companies selected: 45
Energy: 15 companies
- Sample tickers: XOM, CVX, COP, EOG, SLB...
Insurance: 15 companies
- Sample tickers: BRK-B, PGR, TRV, ALL, CB...
Real Estate: 15 companies
- Sample tickers: PLD, AMT, CCI, EQIX, WELL...


## 3. Defining Climate Events for Analysis
We'll focus on major climate events from 2017-2024 that had significant economic impact and media coverage. These events provide natural experiments to study market reactions.

In [7]:
# Major climate events
climate_events = {
    'Hurricane Harvey': {
        'date': '2017-08-25',
        'type': 'Hurricane',
        'description': 'Category 4 hurricane, $125B+ damages, major oil refinery impacts'},
    'Hurricane Irma': {
        'date': '2017-09-10', 
        'type': 'Hurricane',
        'description': 'Category 4 hurricane, Florida impact, insurance claims spike'},
    'Camp Fire': {
        'date': '2018-11-08',
        'type': 'Wildfire', 
        'description': 'Deadliest CA wildfire, PG&E bankruptcy, massive insurance losses'},
    'Australia Bushfires': {
        'date': '2020-01-03',
        'type': 'Wildfire',
        'description': 'Record-breaking bushfires, global climate concerns'},
    'Texas Winter Storm': {
        'date': '2021-02-15',
        'type': 'Extreme Cold',
        'description': 'Power grid failure, energy infrastructure collapse'},
    'Hurricane Ian': {
        'date': '2022-09-28',
        'type': 'Hurricane', 
        'description': 'Category 4, $112B+ damages, major insurance event'},
    'European Heat Wave': {
        'date': '2023-07-18',
        'type': 'Heat Wave',
        'description': 'Record temperatures, infrastructure stress, energy demand'}
}

#Converting to DataFrame
events_df = pd.DataFrame.from_dict(climate_events, orient='index').reset_index()
events_df.columns = ['Event_Name', 'Date', 'Type', 'Description']
events_df['Date'] = pd.to_datetime(events_df['Date'])
events_df = events_df.sort_values('Date')

print("Climate Events for Analysis:")
print()
for _, event in events_df.iterrows():
    print(f"{event['Date'].strftime('%Y-%m-%d')}: {event['Event_Name']} ({event['Type']})")
    print(f" {event['Description']}")
    print()
print(f"\nTotal events: {len(events_df)}")
print(f"Date range: {events_df['Date'].min().strftime('%Y-%m-%d')} to {events_df['Date'].max().strftime('%Y-%m-%d')}")

Climate Events for Analysis:

2017-08-25: Hurricane Harvey (Hurricane)
 Category 4 hurricane, $125B+ damages, major oil refinery impacts

2017-09-10: Hurricane Irma (Hurricane)
 Category 4 hurricane, Florida impact, insurance claims spike

2018-11-08: Camp Fire (Wildfire)
 Deadliest CA wildfire, PG&E bankruptcy, massive insurance losses

2020-01-03: Australia Bushfires (Wildfire)
 Record-breaking bushfires, global climate concerns

2021-02-15: Texas Winter Storm (Extreme Cold)
 Power grid failure, energy infrastructure collapse

2022-09-28: Hurricane Ian (Hurricane)
 Category 4, $112B+ damages, major insurance event

2023-07-18: European Heat Wave (Heat Wave)
 Record temperatures, infrastructure stress, energy demand


Total events: 7
Date range: 2017-08-25 to 2023-07-18


## 4. Data Collection Functions
Creating helper functions to download stock data efficiently and handle potential errors, as well as ensure reusablility of code and easy debugging.

In [8]:
def download_stock_data(ticker, start_date, end_date, max_retries=3):
    """
    Downloading stock data for a single ticker with error handling.

    Parameters:
    ticker (str): Stock ticker symbol
    start_date (str): Start date in 'YYYY-MM-DD' format
    end_date (str): End date in 'YYYY-MM-DD' format
    max_entries (int): Maximum number of retry attempts

    Returns:
    pandas.DataFrame: Stock data with OHLCV columns
    """

    for attempt in range(max_retries):
        try: # Downloading data using yfinance
            stock = yf.Ticker(ticker)
            data = stock.history(start=start_date, end=end_date)

            if data.empty:
                print(f"Warning: No data found for {ticker}")
                return None

                # Adding ticker column for identification
                data['Ticker'] = ticker
                data.reset_index(inplace=True)

                # Cleaning column names (removing any extra spaces)
                data.columns = data.columns.str.strip()
                print(f"Successfully downloaded {ticker}: {len(data)} records")
                return data
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed for {ticker}: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(2) # To wait 2 seconds before retry
            else:
                print(f"Failed to download {ticker} after {max_retries} attempts")
                return None

def download_sector_data(tickers, sector_name, start_date, end_date):
    """
    Dowloading stock data for all tickers in a sector.

    Parameters:
    tickers (list): List of ticker symbols
    sector_name (str): Name of the sector for logging
    start_date (str): Start date in 'YYYY-MM-DD' format
    end_date (str): End date in 'YYYY-MM-DD' format
    
    Returns:
    pandas.DataFrame: Combined stock data for all tickers
    """
    
    print(f"\nDownloading {sector_name} sector data...")
    print(f"Tickers: {', '.join(tickers)}")
    print()
    
    sector_data = []
    successful_downloads = 0
    for i, ticker in enumerate(tickers, 1):
        print(f"[{i}/{len(tickers)}] Downloading {ticker}...")
        data = download_stock_data(ticker, start_date, end_date)
        if data is not None:
            data['Sector'] = sector_name
            sector_data.append(data)
            successful_downloads += 1
            
        time.sleep(0.5) # Adding small delay (API)
        
    if sector_data:
        combined_data = pd.concat(sector_data, ignore_index=True)
        print(f"\n{sector_name} sector complete: {successful_downloads}/{len(tickers)} successful downloads")
        print(f"  Total records: {len(combined_data):,}")
        return combined_data
    else:
        print(f"\nNo data collected for {sector_name} sector")
        return pd.DataFrame()
        
def save_data(data, filename, folder='../data/raw/'):
    """
    Saving data to CSV file with error handling.
    
    Parameters:
    data (pandas.DataFrame): Data to save
    filename (str): Name of the file (without extension)
    folder (str): Folder path to save the file
    """
    try:
        os.makedirs(folder, exist_ok=True) # Creating folder if it doesn't exist
        filepath = os.path.join(folder, f"{filename}.csv") # File path
        data.to_csv(filepath, index=False) # Saving to CSV
        print(f"Data saved to: {filepath}")
        print(f"  File size: {len(data):,} rows * {len(data.columns)} columns")

    except Exception as e:
        print(f"Error saving {filename}: {str(e)}")

print("Data collection functions defined successfully.")

Data collection functions defined successfully.


## 5. Executing Data Collection
Downloading all the stock data (7 years of daily data for 45 companies).

In [9]:
total_companies = sum(len(tickers) for tickers in sector_tickers.values())
start_date = '2017-01-01'
end_date = '2024-12-31'

print(f"Starting data collection for period: {start_date} to {end_date}")
print(f"Total companies to download: {total_companies}")
print(f"Estimated time: {total_companies * 2 / 60:.1f} minutes")
print()

all_sector_data = []
collection_start_time = time.time()

# Downloading data for each sector
for sector, tickers in sector_tickers.items():
    sector_data = download_sector_data(tickers, sector, start_date, end_date)
    if not sector_data.empty:
        all_sector_data.append(sector_data)
        save_data(sector_data, f"{sector.lower().replace(' ', '_')}_stock_data")
    print()

# Combining all sectors
if all_sector_data:
    complete_dataset = pd.concat(all_sector_data, ignore_index=True)
    save_data(complete_dataset, "complete_stock_data") # Saving combined dataset
    collection_end_time = time.time()
    total_time = (collection_end_time - collection_start_time) / 60
    print(f"\nData collection complete.")
    print(f"Total time: {total_time:.1f} minutes")
    print(f"Total records collected: {len(complete_dataset):,}")
    print(f"Date range: {complete_dataset['Date'].min()} to {complete_dataset['Date'].max()}")
    print(f"Unique companies: {complete_dataset['Ticker'].nunique()}")

else:
    print("\nNo data was successfully collected.")

Starting data collection for period: 2017-01-01 to 2024-12-31
Total companies to download: 45
Estimated time: 1.5 minutes


Downloading Energy sector data...
Tickers: XOM, CVX, COP, EOG, SLB, PXD, KMI, OXY, VLO, PSX, MPC, HAL, BKR, DVN, FANG

[1/15] Downloading XOM...
[2/15] Downloading CVX...
[3/15] Downloading COP...
[4/15] Downloading EOG...
[5/15] Downloading SLB...
[6/15] Downloading PXD...


$PXD: possibly delisted; no timezone found


[7/15] Downloading KMI...
[8/15] Downloading OXY...
[9/15] Downloading VLO...
[10/15] Downloading PSX...
[11/15] Downloading MPC...
[12/15] Downloading HAL...
[13/15] Downloading BKR...
[14/15] Downloading DVN...
[15/15] Downloading FANG...

No data collected for Energy sector


Downloading Insurance sector data...
Tickers: BRK-B, PGR, TRV, ALL, CB, AIG, HIG, CNA, RLI, AFG, CINF, WRB, Y, EG, KMPR

[1/15] Downloading BRK-B...
[2/15] Downloading PGR...
[3/15] Downloading TRV...
[4/15] Downloading ALL...
[5/15] Downloading CB...
[6/15] Downloading AIG...
[7/15] Downloading HIG...
[8/15] Downloading CNA...
[9/15] Downloading RLI...
[10/15] Downloading AFG...
[11/15] Downloading CINF...
[12/15] Downloading WRB...
[13/15] Downloading Y...


$Y: possibly delisted; no timezone found


[14/15] Downloading EG...
[15/15] Downloading KMPR...

No data collected for Insurance sector


Downloading Real Estate sector data...
Tickers: PLD, AMT, CCI, EQIX, WELL, DLR, SPG, O, AVTR, EXR, AVB, EQR, MAA, ESS, UDR

[1/15] Downloading PLD...
[2/15] Downloading AMT...
[3/15] Downloading CCI...
[4/15] Downloading EQIX...
[5/15] Downloading WELL...
[6/15] Downloading DLR...
[7/15] Downloading SPG...
[8/15] Downloading O...
[9/15] Downloading AVTR...
[10/15] Downloading EXR...
[11/15] Downloading AVB...
[12/15] Downloading EQR...
[13/15] Downloading MAA...
[14/15] Downloading ESS...
[15/15] Downloading UDR...

No data collected for Real Estate sector


No data was successfully collected.


In [10]:
# Testing with just one ticker to see what's happening
print("Testing single ticker download...")

# Trying Exxon Mobil (XOM)
test_ticker = "XOM"
test_data = yf.Ticker(test_ticker)
test_result = test_data.history(start='2017-01-01', end='2024-12-31')

print(f"Data shape: {test_result.shape}")
print(f"Columns: {list(test_result.columns)}")
print(f"First few rows:")
print(test_result.head())
print(f"Is data empty? {test_result.empty}")

Testing single ticker download...
Data shape: (2011, 7)
Columns: ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits']
First few rows:
                                Open       High        Low      Close  \
Date                                                                    
2017-01-03 00:00:00-05:00  61.858571  62.130652  61.321202  61.824558   
2017-01-04 00:00:00-05:00  61.981001  62.001406  61.049107  61.144337   
2017-01-05 00:00:00-05:00  61.348386  61.423210  60.158014  60.232838   
2017-01-06 00:00:00-05:00  60.396100  60.525342  59.736292  60.198837   
2017-01-09 00:00:00-05:00  60.008393  60.008393  58.872439  59.205742   

                             Volume  Dividends  Stock Splits  
Date                                                          
2017-01-03 00:00:00-05:00  10360600        0.0           0.0  
2017-01-04 00:00:00-05:00   9434200        0.0           0.0  
2017-01-05 00:00:00-05:00  14443200        0.0           0.0  
2017-01-06 00:00:00-

In [11]:
# Debugging our function with the working data
print("Debugging our download function...")

def debug_download_stock_data(ticker, start_date, end_date):
    """Debug version to see what's happening"""
    
    try:
        print(f"Step 1: Creating ticker object for {ticker}")
        stock = yf.Ticker(ticker)
        print(f"Step 2: Downloading data from {start_date} to {end_date}")
        data = stock.history(start=start_date, end=end_date)
        print(f"Step 3: Data shape: {data.shape}")
        print(f"Step 4: Data empty? {data.empty}")
        print(f"Step 5: Data columns: {list(data.columns)}")
        print(f"Step 6: Data index type: {type(data.index)}")
        
        if data.empty:
            print(f"Data is empty for {ticker}")
            return None
        
        print(f"Step 7: Adding ticker column")
        data['Ticker'] = ticker
        print(f"Step 8: Resetting index")
        data.reset_index(inplace=True)
        print(f"Step 9: Final data shape: {data.shape}")
        print(f"Step 10: Final columns: {list(data.columns)}")
        print(f"Successfully processed {ticker}: {len(data)} records")
        return data
        
    except Exception as e:
        print(f"Error processing {ticker}: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

# Testing with XOM
test_result = debug_download_stock_data("XOM", "2017-01-01", "2024-12-31")
if test_result is not None:
    print(f"\nFinal result preview:")
    print(test_result.head())

Debugging our download function...
Step 1: Creating ticker object for XOM
Step 2: Downloading data from 2017-01-01 to 2024-12-31
Step 3: Data shape: (2011, 7)
Step 4: Data empty? False
Step 5: Data columns: ['Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits']
Step 6: Data index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Step 7: Adding ticker column
Step 8: Resetting index
Step 9: Final data shape: (2011, 9)
Step 10: Final columns: ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'Ticker']
Successfully processed XOM: 2011 records

Final result preview:
                       Date       Open       High        Low      Close  \
0 2017-01-03 00:00:00-05:00  61.858571  62.130652  61.321202  61.824558   
1 2017-01-04 00:00:00-05:00  61.981001  62.001406  61.049107  61.144337   
2 2017-01-05 00:00:00-05:00  61.348386  61.423210  60.158014  60.232838   
3 2017-01-06 00:00:00-05:00  60.396100  60.525342  59.736292  60.198837   


In [12]:
# Testing original function
print("Testing original function...")

def download_stock_data(ticker, start_date, end_date, max_retries=3):
    """
    Downloading stock data for a single ticker with error handling.
    """
    
    for attempt in range(max_retries):
        try:
            stock = yf.Ticker(ticker) # Downloading data using yfinance
            data = stock.history(start=start_date, end=end_date)
            if data.empty:
                print(f"No data found for {ticker}")
                return None
            data['Ticker'] = ticker # Adding ticker column for identification
            data.reset_index(inplace=True)
            data.columns = data.columns.str.strip() # Cleaning column names (removing any extra spaces)
            print(f"Successfully downloaded {ticker}: {len(data)} records")
            return data
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed for {ticker}: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(2)  # Wait 2 seconds before retry
            else:
                print(f"Failed to download {ticker} after {max_retries} attempts")
                return None

# Testing with XOM using original function
original_result = download_stock_data("XOM", "2017-01-01", "2024-12-31")
if original_result is not None:
    print(f"Original function success.")
    print(f"Shape: {original_result.shape}")
    print(f"Columns: {list(original_result.columns)}")
else:
    print("Original function failed.")

Testing original function...
Successfully downloaded XOM: 2011 records
Original function success.
Shape: (2011, 9)
Columns: ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'Ticker']


In [13]:
# Testing sector function with just a few reliable tickers
print("Testing sector function with reliable tickers...")

def download_sector_data(tickers, sector_name, start_date, end_date):
    """
    Downloading stock data for all tickers in a sector.
    """
    
    print(f"\nDownloading {sector_name} sector data...")
    print(f"Tickers: {', '.join(tickers)}")
    print()
    sector_data = []
    successful_downloads = 0
    
    for i, ticker in enumerate(tickers, 1):
        print(f"[{i}/{len(tickers)}] Downloading {ticker}...")
        data = download_stock_data(ticker, start_date, end_date)
        if data is not None:
            data['Sector'] = sector_name
            sector_data.append(data)
            successful_downloads += 1
            print(f" Added {ticker} to sector data")
        else:
            print(f" {ticker} returned None")
        time.sleep(0.5)
    
    if sector_data:
        combined_data = pd.concat(sector_data, ignore_index=True)
        print(f"\n{sector_name} sector complete: {successful_downloads}/{len(tickers)} successful downloads")
        print(f" Total records: {len(combined_data):,}")
        return combined_data
    else:
        print(f"\nNo data collected for {sector_name} sector")
        return pd.DataFrame()

# Testing with just 3 very reliable energy stocks
test_energy_tickers = ['XOM', 'CVX', 'COP']  # Exxon, Chevron, ConocoPhillips
test_result = download_sector_data(test_energy_tickers, "Energy_Test", "2017-01-01", "2024-12-31")
if not test_result.empty:
    print(f"\nSuccess. Downloaded {len(test_result):,} total records")
    print(f"Unique tickers: {test_result['Ticker'].unique()}")
else:
    print("\nTest failed.")

Testing sector function with reliable tickers...

Downloading Energy_Test sector data...
Tickers: XOM, CVX, COP

[1/3] Downloading XOM...
Successfully downloaded XOM: 2011 records
 Added XOM to sector data
[2/3] Downloading CVX...
Successfully downloaded CVX: 2011 records
 Added CVX to sector data
[3/3] Downloading COP...
Successfully downloaded COP: 2011 records
 Added COP to sector data

Energy_Test sector complete: 3/3 successful downloads
 Total records: 6,033

Success. Downloaded 6,033 total records
Unique tickers: ['XOM' 'CVX' 'COP']


In [14]:
# Replacing problematic tickers with reliable ones
print("Fixing ticker lists...")

# Fixed Energy Sector (replacing PXD with HPE - HP Enterprise, and keep others)
energy_tickers = [
    'XOM',   # Exxon Mobil
    'CVX',   # Chevron
    'COP',   # ConocoPhillips
    'EOG',   # EOG Resources
    'SLB',   # Schlumberger
    'WMB',   # Williams Companies (replaced PXD)
    'KMI',   # Kinder Morgan
    'OXY',   # Occidental Petroleum
    'VLO',   # Valero Energy
    'PSX',   # Phillips 66
    'MPC',   # Marathon Petroleum
    'HAL',   # Halliburton
    'BKR',   # Baker Hughes
    'DVN',   # Devon Energy
    'FANG'   # Diamondback Energy
]

# Fixed Insurance Sector (replacing Y with MET - MetLife)
insurance_tickers = [
    'BRK-B', # Berkshire Hathaway
    'PGR',   # Progressive
    'TRV',   # Travelers
    'ALL',   # Allstate
    'CB',    # Chubb
    'AIG',   # American International Group
    'HIG',   # Hartford Financial
    'CNA',   # CNA Financial
    'RLI',   # RLI Corp
    'AFG',   # American Financial Group
    'CINF',  # Cincinnati Financial
    'WRB',   # W.R. Berkley
    'MET',   # MetLife (replaced Y)
    'EG',    # Everest Group
    'KMPR'   # Kemper
]

# Real Estate - keeping as is
real_estate_tickers = [
    'PLD',   # Prologis
    'AMT',   # American Tower
    'CCI',   # Crown Castle
    'EQIX',  # Equinix
    'WELL',  # Welltower
    'DLR',   # Digital Realty Trust
    'SPG',   # Simon Property Group
    'O',     # Realty Income
    'AVB',   # AvalonBay Communities
    'EXR',   # Extended Stay America
    'EQR',   # Equity Residential
    'MAA',   # Mid-America Apartment Communities
    'ESS',   # Essex Property Trust
    'UDR',   # UDR Inc
    'ARE'    # Alexandria Real Estate
]

# Updating the master dictionary
sector_tickers = {
    'Energy': energy_tickers,
    'Insurance': insurance_tickers,
    'Real Estate': real_estate_tickers
}
print("Updated ticker lists with reliable companies")
print(f"Total companies: {sum(len(tickers) for tickers in sector_tickers.values())}")

Fixing ticker lists...
Updated ticker lists with reliable companies
Total companies: 45


In [15]:
# Final data collection (all sectors)
print("FINAL DATA COLLECTION")
print()

start_date = '2017-01-01'
end_date = '2024-12-31'
all_sector_data = []
collection_start_time = time.time()

# Downloading all sectors
for sector, tickers in sector_tickers.items():
    print(f"\nStarting {sector} sector ({len(tickers)} companies)...")
    sector_data = download_sector_data(tickers, sector, start_date, end_date)
    if not sector_data.empty:
        all_sector_data.append(sector_data)
        save_data(sector_data, f"{sector.lower().replace(' ', '_')}_stock_data")
        print(f" {sector} sector saved successfully!")

# Combining and saving everything
if all_sector_data:
    complete_dataset = pd.concat(all_sector_data, ignore_index=True)
    save_data(complete_dataset, "complete_stock_data")
    save_data(events_df, "climate_events") # Saving climate events too
    total_time = (time.time() - collection_start_time) / 60
    print()
    print("Complete success.")
    print(f" Total time: {total_time:.1f} minutes")
    print(f" Total records: {len(complete_dataset):,}")
    print(f" Companies: {complete_dataset['Ticker'].nunique()}")
    print(f" Date range: {complete_dataset['Date'].min()} to {complete_dataset['Date'].max()}")
    print(f" Sectors: {', '.join(complete_dataset['Sector'].unique())}")
    print()
else:
    print("Collection failed!")

FINAL DATA COLLECTION


Starting Energy sector (15 companies)...

Downloading Energy sector data...
Tickers: XOM, CVX, COP, EOG, SLB, WMB, KMI, OXY, VLO, PSX, MPC, HAL, BKR, DVN, FANG

[1/15] Downloading XOM...
Successfully downloaded XOM: 2011 records
 Added XOM to sector data
[2/15] Downloading CVX...
Successfully downloaded CVX: 2011 records
 Added CVX to sector data
[3/15] Downloading COP...
Successfully downloaded COP: 2011 records
 Added COP to sector data
[4/15] Downloading EOG...
Successfully downloaded EOG: 2011 records
 Added EOG to sector data
[5/15] Downloading SLB...
Successfully downloaded SLB: 2011 records
 Added SLB to sector data
[6/15] Downloading WMB...
Successfully downloaded WMB: 2011 records
 Added WMB to sector data
[7/15] Downloading KMI...
Successfully downloaded KMI: 2011 records
 Added KMI to sector data
[8/15] Downloading OXY...
Successfully downloaded OXY: 2011 records
 Added OXY to sector data
[9/15] Downloading VLO...
Successfully downloaded VLO: 2011 rec

## 6. Data Quality Check
Examining collected data to ensure quality and identify any potential issues before analysis.

In [19]:
print("Dataset Overview")
print()
print(f"- Shape: {complete_dataset.shape}")
print(f"- Memory usage: {complete_dataset.memory_usage(deep=True).sum() / 1024 / 1024:.1f} MB")
print(f"- Columns: {list(complete_dataset.columns)}")
print()

print("\nMissing Data Check")
print()
missing_data = complete_dataset.isnull().sum()
if missing_data.sum() > 0:
    print("Missing data found:")
    print(missing_data[missing_data > 0])
else:
    print("No missing data found.")
print()

print("\nDate Range Verification")
print()
complete_dataset['Date'] = pd.to_datetime(complete_dataset['Date'])
print(f"- First date: {complete_dataset['Date'].min()}")
print(f"- Last date: {complete_dataset['Date'].max()}")
print(f"- Total trading days: {complete_dataset['Date'].nunique():,}")
print(f"- Time span: {(complete_dataset['Date'].max() - complete_dataset['Date'].min()).days} days")
print()

print("\nSector Distribution")
print()
sector_stats = complete_dataset.groupby('Sector').agg({
    'Ticker': 'nunique',
    'Close': ['count', 'mean', 'std']
}).round (2)
sector_stats.columns = ['Companies', 'Records', 'Avg_Price', 'Price_StdDev']
print(sector_stats)
print()

print("\nCompany Verification")
print()
companies_per_sector = complete_dataset.groupby('Sector')['Ticker'].unique()
for sector, tickers in companies_per_sector.items():
    print(f"{sector}: {len(tickers)} companies")
    print(f" Tickers: {', '.join(sorted(tickers))}")
    print()

print("\nPrice Range Analysis")
print()
price_summary = complete_dataset.groupby('Sector')['Close'].agg(['min', 'max', 'mean']).round(2)
print("Price Summary by Sector:")
print(price_summary)
print()

# Checking extreme values or potential data errors
extreme_prices = complete_dataset[(complete_dataset['Close'] < 1) | (complete_dataset['Close'] > 1000)]
if len(extreme_prices) > 0:
    print(f"\nFound {len(extreme_prices)} records with extreme prices (< $1 or > $1000):")
    print(extreme_prices[['Date', 'Ticker', 'Sector', 'Close']].head())
else:
    print("\nNo extreme price values detected")
print()

print("\nVolume Analysis")
print()
volume_stats = complete_dataset.groupby('Sector')['Volume'].agg(['mean', 'median']).round(0)
print("Average Daily Volume by Sector:")
print(volume_stats)
print()
print()
print("Data Quality Summary")
print()
print(f"- Total records: {len(complete_dataset):,}")
print(f"- Companies: {complete_dataset['Ticker'].nunique()}")
print(f"- Sectors: {complete_dataset['Sector'].nunique()}")
print(f"- Date range: {complete_dataset['Date'].nunique():,} trading days")
print(f"- Data completeness: {(1 - complete_dataset.isnull().sum().sum() / complete_dataset.size) * 100:.1f}%")
print("Dataset ready for analysis.")

Dataset Overview

- Shape: (90495, 10)
- Memory usage: 15.0 MB
- Columns: ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'Ticker', 'Sector']


Missing Data Check

No missing data found.


Date Range Verification

- First date: 2017-01-03 00:00:00-05:00
- Last date: 2024-12-30 00:00:00-05:00
- Total trading days: 2,011
- Time span: 2918 days


Sector Distribution

             Companies  Records  Avg_Price  Price_StdDev
Sector                                                  
Energy              15    30165      58.28         38.54
Insurance           15    30165     101.28         82.09
Real Estate         15    30165     141.83        138.19


Company Verification

Energy: 15 companies
 Tickers: BKR, COP, CVX, DVN, EOG, FANG, HAL, KMI, MPC, OXY, PSX, SLB, VLO, WMB, XOM

Insurance: 15 companies
 Tickers: AFG, AIG, ALL, BRK-B, CB, CINF, CNA, EG, HIG, KMPR, MET, PGR, RLI, TRV, WRB

Real Estate: 15 companies
 Tickers: AMT, ARE, AVB, CCI, DLR, EQIX, EQR, ES

## 7. Collection Summary
### Tasks Completed:
- Successfully collected **90,495 stock price records** from **45 companies** across 3 climate-sensitive sectors
- **100% data completeness** with no mission values
- **8 years of daily data** (2017-2024) covering 2,011 trading days
- **7 major climate events** identified for analysis
- All data saved in organized CSV format for subsequent analysis

### Data Files Created:
- `complete_stock_data.csv`: Master dataset (90,495 records)
- `energy_stock_data.csv`: Energy sector (30,165 records)
- `insurance_stock_data.csv`: Insurance sector (30,165 records) 
- `real_estate_stock_data.csv`: Real Estate sector (30,165 records)
- `climate_events.csv`: Climate events for analysis (7 events)

**Dataset ready for climate risk premium analysis.**