## Phase I Project Proposal
### Predicting Stock Sectors from Volatility Patterns

#### Name: Becky Zheng, DS 3000

### Introduction

Stock market volatility varies across different sectors of the economy. Technology stocks are typically considered more volatile than utility stocks, but can this be shown through data? Being able to understand these volatility patterns across sectors can help investors make more informed decisions about their portfolio and how they manage risk.

For this project, I'm interested in two questions:
1. "Which sectors exhibit the highest volatility"? This could help investors decide the risk to reward ratio in each sector.
2. "Can we predict a stock's sector based on its volume and price volatility"? This could help investors understand if volatility and volume patterns are specific to certain sectors, which could be useful to identify trends in different sectors.



### Data Collection

I plan to use the Alpha Vantage API to collect stock market data. I will gather historical price and volume data for at least 30 stocks across 5 different sectors (Technology, Healthcare, Financial, Energy, and Consumer). For each stock, I will calculate volatility metrics from daily price movements and average trading volumes. The data from the code below is saved into a csv file!

In [42]:
import requests
import pandas as pd
import numpy as np
import time

# alpha vantage API key
API_KEY = '9KAQYR5SA8ETF0HC'

# define stocks by sector
stocks_by_sector = {
    'Technology': ['AAPL', 'MSFT', 'GOOGL', 'META', 'NVDA', 'TSLA'],
    'Healthcare': ['JNJ', 'UNH', 'PFE', 'ABBV', 'TMO', 'DHR'],
    'Financial': ['JPM', 'BAC', 'WFC', 'GS', 'MS', 'C'],
    'Energy': ['XOM', 'CVX', 'COP', 'SLB', 'EOG', 'MPC'],
    'Consumer': ['AMZN', 'WMT', 'DIS', 'TGT', 'LOW', 'COST']
}

# flatten the dictionary to get all tickers
all_stocks = []
for sector, tickers in stocks_by_sector.items():
    for ticker in tickers:
        all_stocks.append({'Ticker': ticker, 'Sector': sector})

# collect data for each stock
stock_data = []

for stock_info in all_stocks:
    ticker = stock_info['Ticker']
    sector = stock_info['Sector']

    try:
        # make API call to Alpha Vantage
        url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={ticker}&outputsize=compact&apikey={API_KEY}'
        response = requests.get(url)
        data = response.json()

        if 'Time Series (Daily)' in data:
            time_series = data['Time Series (Daily)']

            # convert to DataFrame
            dates = list(time_series.keys())
            closes = [float(time_series[date]['4. close']) for date in dates]
            volumes = [float(time_series[date]['5. volume']) for date in dates]
            highs = [float(time_series[date]['2. high']) for date in dates]
            lows = [float(time_series[date]['3. low']) for date in dates]

            # calculate
            # volatility = standard deviation of daily returns
            closes_array = np.array(closes)
            daily_returns = np.diff(closes_array) / closes_array[:-1]
            volatility = np.std(daily_returns)

            # avg metrics
            avg_volume = np.mean(volumes)
            avg_close = np.mean(closes)
            price_range = max(highs) - min(lows)

            stock_data.append({
                'Ticker': ticker,
                'Sector': sector,
                'Volatility': volatility,
                'Avg_Volume': avg_volume,
                'Avg_Close_Price': avg_close,
                'Price_Range': price_range
            })
            print(f"Data collected for {ticker} ({sector})")
        else:
            print(f"No data for {ticker}: {data.get('Note', data.get('Error Message', 'Unknown error'))}")

        time.sleep(15)  # wait 15 seconds between calls to stay under limit (so silly)

    except Exception as e:
        print(f"Failed to collect {ticker}: {e}")

# create DataFrame
df = pd.DataFrame(stock_data)

print(f"\nSuccessfully collected data for {len(df)} stocks")
print(f"\nFirst few rows:")
print(df.head(10))

# show basic statistics
print("\n*** Data Summary ***")
print(df.describe())

# save to CSV in case code doesn't run
df.to_csv('stock_sector_data.csv', index=False)
print("\nData saved to 'stock_sector_data.csv'")

Data collected for AAPL (Technology)
Data collected for MSFT (Technology)
Data collected for GOOGL (Technology)
Data collected for META (Technology)
Data collected for NVDA (Technology)
Data collected for TSLA (Technology)
Data collected for JNJ (Healthcare)
Data collected for UNH (Healthcare)
Data collected for PFE (Healthcare)
Data collected for ABBV (Healthcare)
Data collected for TMO (Healthcare)
Data collected for DHR (Healthcare)
Data collected for JPM (Financial)
Data collected for BAC (Financial)
Data collected for WFC (Financial)
Data collected for GS (Financial)
Data collected for MS (Financial)
Data collected for C (Financial)
Data collected for XOM (Energy)
Data collected for CVX (Energy)
Data collected for COP (Energy)
Data collected for SLB (Energy)
Data collected for EOG (Energy)
Data collected for MPC (Energy)
Data collected for AMZN (Consumer)
Data collected for WMT (Consumer)
No data for DIS: Unknown error
No data for TGT: Unknown error
No data for LOW: Unknown error


### Data Usage and Remaining Issues

I think I hit the API limit when it came down to the last four, but I really do not have the time to rerun it...

The numeric features that I collected are volatility, average trading volume, average closing price, and price range.

The categorical features I will gather is the sector that each stock belongs to (Tech, Healthcare, Financial, Energy, Consumer)

For my first question about which sectors are most volatile, I will use descriptive statistics to compare average volatility across the five sectors and visualize the differences. This will help identify which sectors carry more risk. For my second question about predicting a stock's sector from its volatility and volume, I will use classification methods. Classification is a type of machine learning that predicts categories based on numeric patterns in the data. If certain sectors consistently show higher volatility or trading volumes, a classification algorithm can learn these patterns and predict which sector a stock belongs to based on those features.

The data is relatively clean since it comes from a structured API, but I may need to handle any missing values and normalize the numeric features since they are on very different scales (volume is in millions while volatility is a small decimal). Once cleaned, this data can be used for both statistical analysis and machine learning classification!