## Module 1 Homework (2025 cohort)

### Question 1: [Index] S&P 500 Stocks Added to the Index

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:
1. Create a DataFrame with company tickers, names, and the year they were added.
2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

*Context*: 
> "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

In [82]:
# IMPORTS
import numpy as np
import pandas as pd

#Fin Data Sources
import yfinance as yf
import pandas_datareader as pdr

#Data viz
import plotly.graph_objs as go
import plotly.express as px

import time
from datetime import date


In [83]:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
tables[0].head()
name_date = ['Symbol', 'Date added']
sp500 = tables[0][name_date]
sp500['Date added'] = pd.to_datetime(sp500['Date added']).dt.year
sp500_years = sp500.groupby('Date added').count().sort_values(by='Symbol', ascending=False)

sp500_years[0:2]
#Since 1957 is the year the S&P 500 was created, the correct answer is 2017

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp500['Date added'] = pd.to_datetime(sp500['Date added']).dt.year


Unnamed: 0_level_0,Symbol
Date added,Unnamed: 1_level_1
1957,53
2017,23


*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

In [84]:
sp500['Years since added'] = date.today().year - sp500['Date added']
sp500 = sp500[sp500['Years since added'] > 20]
sp500['Symbol'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sp500['Years since added'] = date.today().year - sp500['Date added']


219

---
### Question 2. [Macro] Indexes YTD (as of 1 May 2025)

**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:
* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)	
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

Context: 
> [Global Valuations: Who's Cheap, Who's Not?](https://simplywall.st/article/beyond-the-us-global-markets-after-yet-another-tariff-update) article suggests "Other regions may be growing faster than the US and you need to diversify."

Reference: Yahoo Finance World Indices - https://finance.yahoo.com/world-indices/


In [85]:
ticker_map = {
    'S&P 500 (US)': '^GSPC',
    'Shanghai Composite (China)': '000001.SS', # Corrected ticker
    'Hang Seng (Hong Kong)': '^HSI',
    'S&P/ASX 200 (Australia)': '^AXJO',
    'Nifty 50 (India)': '^NSEI',
    'S&P/TSX Composite (Canada)': '^GSPTSE',
    'DAX (Germany)': '^GDAXI',
    'FTSE 100 (UK)': '^FTSE',
    'Nikkei 225 (Japan)': '^N225',
    'IPC Mexico (Mexico)': '^MXX',
    'Ibovespa (Brazil)': '^BVSP'
}

# Define the start and end dates for the YTD performance
start_date = '2025-01-01'
end_date = '2025-05-01'

# 1. Initialize an empty list to store results before the loop.
performance_data = []

# Fetch the data for each ticker
for name, ticker in ticker_map.items():
    try:
        print(f"Fetching data for {name} ({ticker})...")
        data = yf.Ticker(ticker).history(start=start_date, end=end_date)

        # 2. Check if the DataFrame is empty before trying to access data.
        if not data.empty:
            # 3. Use .iloc for robust, position-based indexing.
            start_price = data['Close'].iloc[0]
            end_price = data['Close'].iloc[-1]
            ytd_performance = (end_price - start_price) / start_price * 100

            # 4. Append the result dictionary to the list.
            performance_data.append({
                'Index Name': name,
                'Ticker': ticker,
                'YTD Performance (%)': ytd_performance
            })
        else:
            print(f"-> No data found for {ticker} in the specified date range.")

    except Exception as e:
        print(f"-> Error fetching data for {ticker}: {e}")

# 5. Create the DataFrame once from the list of results after the loop.
if performance_data:
    ytd_df = pd.DataFrame(performance_data)

    # Sort the DataFrame by performance in descending order
    ytd_df_sorted = ytd_df.sort_values(by='YTD Performance (%)', ascending=False)

    print("\n--- Year-to-Date Performance (1 Jan 2025 - 1 May 2025) ---")
    # Use to_string() to display the full DataFrame without truncation
    print(ytd_df_sorted.to_string(index=False))
else:
    print("\nCould not retrieve performance data for any of the tickers.")

Fetching data for S&P 500 (US) (^GSPC)...
Fetching data for Shanghai Composite (China) (000001.SS)...
Fetching data for Hang Seng (Hong Kong) (^HSI)...
Fetching data for S&P/ASX 200 (Australia) (^AXJO)...
Fetching data for Nifty 50 (India) (^NSEI)...
Fetching data for S&P/TSX Composite (Canada) (^GSPTSE)...
Fetching data for DAX (Germany) (^GDAXI)...
Fetching data for FTSE 100 (UK) (^FTSE)...
Fetching data for Nikkei 225 (Japan) (^N225)...
Fetching data for IPC Mexico (Mexico) (^MXX)...
Fetching data for Ibovespa (Brazil) (^BVSP)...

--- Year-to-Date Performance (1 Jan 2025 - 1 May 2025) ---
                Index Name    Ticker  YTD Performance (%)
       IPC Mexico (Mexico)      ^MXX            13.049444
     Hang Seng (Hong Kong)      ^HSI            12.720018
         Ibovespa (Brazil)     ^BVSP            12.438710
             DAX (Germany)    ^GDAXI            12.346378
             FTSE 100 (UK)     ^FTSE             2.842590
          Nifty 50 (India)     ^NSEI             2.49

*Additional*: How many of these indexes have better returns than the S&P 500 over 3, 5, and 10 year periods? Do you see the same trend?
Note: For simplicity, ignore currency conversion effects.)

In [86]:
ticker_map = {
    'S&P 500 (US)': '^GSPC',
    'Shanghai Composite (China)': '000001.SS', # Corrected ticker
    'Hang Seng (Hong Kong)': '^HSI',
    'S&P/ASX 200 (Australia)': '^AXJO',
    'Nifty 50 (India)': '^NSEI',
    'S&P/TSX Composite (Canada)': '^GSPTSE',
    'DAX (Germany)': '^GDAXI',
    'FTSE 100 (UK)': '^FTSE',
    'Nikkei 225 (Japan)': '^N225',
    'IPC Mexico (Mexico)': '^MXX',
    'Ibovespa (Brazil)': '^BVSP'
}

# Define the start and end dates for the YTD performance
start_list = ['2022-05-01', '2020-05-01', '2015-05-01']
end_date = '2025-05-01'

# 1. Initialize an empty list to store results before the loop.
performance_data = []

for start_date in start_list:
    # Fetch the data for each ticker
    for name, ticker in ticker_map.items():
        try:
            print(f"Fetching data for {name} ({ticker})...")
            data = yf.Ticker(ticker).history(start=start_date, end=end_date)

            # 2. Check if the DataFrame is empty before trying to access data.
            if not data.empty:
                # 3. Use .iloc for robust, position-based indexing.
                start_price = data['Close'].iloc[0]
                end_price = data['Close'].iloc[-1]
                ytd_performance = (end_price - start_price) / start_price * 100

                # 4. Append the result dictionary to the list.
                performance_data.append({
                    'Index Name': name,
                    'Ticker': ticker,
                    'YTD Performance (%)': ytd_performance
                })
            else:
                print(f"-> No data found for {ticker} in the specified date range.")

        except Exception as e:
            print(f"-> Error fetching data for {ticker}: {e}")

    # 5. Create the DataFrame once from the list of results after the loop.
    if performance_data:
        ytd_df = pd.DataFrame(performance_data)
        # Sort the DataFrame by performance in descending order
        ytd_df_sorted = ytd_df.sort_values(by='YTD Performance (%)', ascending=False)
        print(f"\n--- Year-to-Date Performance {int(end_date[:4]) - int(start_date[:4])} years ---")

                # Use to_string() to display the full DataFrame without truncation
        print(ytd_df_sorted.to_string(index=False))
    else:
        print("\nCould not retrieve performance data for any of the tickers.")
    #Clear the performance data for the next iteration
    performance_data.clear()


Fetching data for S&P 500 (US) (^GSPC)...
Fetching data for Shanghai Composite (China) (000001.SS)...
Fetching data for Hang Seng (Hong Kong) (^HSI)...
Fetching data for S&P/ASX 200 (Australia) (^AXJO)...
Fetching data for Nifty 50 (India) (^NSEI)...
Fetching data for S&P/TSX Composite (Canada) (^GSPTSE)...
Fetching data for DAX (Germany) (^GDAXI)...
Fetching data for FTSE 100 (UK) (^FTSE)...
Fetching data for Nikkei 225 (Japan) (^N225)...
Fetching data for IPC Mexico (Mexico) (^MXX)...
Fetching data for Ibovespa (Brazil) (^BVSP)...

--- Year-to-Date Performance 3 years ---
                Index Name    Ticker  YTD Performance (%)
             DAX (Germany)    ^GDAXI            61.395129
          Nifty 50 (India)     ^NSEI            42.562875
        Nikkei 225 (Japan)     ^N225            34.404756
              S&P 500 (US)     ^GSPC            34.020480
         Ibovespa (Brazil)     ^BVSP            26.658164
S&P/TSX Composite (Canada)   ^GSPTSE            20.053451
             

### Question 3. [Index] S&P 500 Market Corrections Analysis


**Calculate the median duration (in days) of significant market corrections in the S&P 500 index.**

For this task, define a correction as an event when a stock index goes down by **more than 5%** from the closest all-time high maximum.

Steps:
1. Download S&P 500 historical data (1950-present) using yfinance
2. Identify all-time high points (where price exceeds all previous prices)
3. For each pair of consecutive all-time highs, find the minimum price in between
4. Calculate drawdown percentages: (high - low) / high × 100
5. Filter for corrections with at least 5% drawdown
6. Calculate the duration in days for each correction period
7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

*Context:* 
> * Investors often wonder about the typical length of market corrections when deciding "when to buy the dip" ([Reddit discussion](https://www.reddit.com/r/investing/comments/1jrqnte/when_are_you_buying_the_dip/?rdt=64135)).

> * [A Wealth of Common Sense - How Often Should You Expect a Stock Market Correction?](https://awealthofcommonsense.com/2022/01/how-often-should-you-expect-a-stock-market-correction/)

*Hint (use this data to compare with your results)*: Here is the list of top 10 largest corrections by drawdown:
* 2007-10-09 to 2009-03-09: 56.8% drawdown over 517 days
* 2000-03-24 to 2002-10-09: 49.1% drawdown over 929 days
* 1973-01-11 to 1974-10-03: 48.2% drawdown over 630 days
* 1968-11-29 to 1970-05-26: 36.1% drawdown over 543 days
* 2020-02-19 to 2020-03-23: 33.9% drawdown over 33 days
* 1987-08-25 to 1987-12-04: 33.5% drawdown over 101 days
* 1961-12-12 to 1962-06-26: 28.0% drawdown over 196 days
* 1980-11-28 to 1982-08-12: 27.1% drawdown over 622 days
* 2022-01-03 to 2022-10-12: 25.4% drawdown over 282 days
* 1966-02-09 to 1966-10-07: 22.2% drawdown over 240 days

---


In [87]:
# 1. Download S&P 500 historical data (1950-present) using yfinance
start_date = '1950-01-01'
end_date = date.today()

snp500 = yf.Ticker('^GSPC').history(start=start_date, end=end_date)

# 2. Identify all-time high points (where price exceeds all previous prices)
snp500['All Time High'] = snp500['Close'].cummax()
snp500['Is_ATH'] = snp500['Close'] >= snp500['All Time High']
all_time_highs = snp500[snp500['Is_ATH']].copy()
all_time_highs

# 3. For each pair of consecutive all-time highs, find the minimum price in between
corrections = []
for	i in range(len(all_time_highs)-1):
	# Get the current and previous all-time high
	start_ath_date = all_time_highs.index[i]
	end_ath_date = all_time_highs.index[i+1]
	start_high_price = all_time_highs['All Time High'].iloc[i]
	period_between_highs = snp500[start_ath_date:end_ath_date]
	min_price_in_period = period_between_highs['High'].min()
	min_price_date = period_between_highs['High'].idxmin()

# 4. Calculate drawdown percentages: (high - low) / high × 100
	drawdown = (start_high_price - min_price_in_period) / start_high_price * 100
	
	corrections.append({
        'Start_Date': start_ath_date,
        'Start_High': start_high_price,
        'Min_Price_Date': min_price_date,
        'Min_Price': min_price_in_period,
        'End_Date': end_ath_date,
        'Drawdown_Percent': drawdown
    })

corrections_df = pd.DataFrame(corrections)

corrections_df
# 5. Filter for corrections with at least 5% drawdown
significant_corrections = corrections_df[corrections_df['Drawdown_Percent'] >= 5].copy()
# 6. Calculate the duration in days for each correction period
significant_corrections['Duration'] = (significant_corrections['End_Date'] - significant_corrections['Start_Date']).dt.days

# Check with the top 10 most significant corrections
# Value does not match, WHY?
significant_corrections_check = significant_corrections.sort_values(by='Drawdown_Percent', ascending=False).head(10)
for index, row in significant_corrections_check.iterrows():
	start_date = row['Start_Date'].strftime('%Y-%m-%d')
	end_date = row['End_Date'].strftime('%Y-%m-%d')
	drawdown_percent = row['Drawdown_Percent']
	duration_days = (row['End_Date'] - row['Start_Date']).days
	print(f"{start_date} to {end_date}: {drawdown_percent:.1f}% drawdown over {duration_days} days")

# 7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

percentiles = significant_corrections['Duration'].quantile([0.25, 0.5,0.75]).to_dict()
print("\n--- Correction Durations Percentiles ---")
print(f"25th Percentile: {percentiles[0.25]} days")
print(f"50th Percentile (Median): {percentiles[0.5]} days")
print(f"75th Percentile: {percentiles[0.75]} days")


2007-10-09 to 2013-03-28: 55.6% drawdown over 1997 days
2000-03-24 to 2007-05-30: 47.7% drawdown over 2622 days
1973-01-11 to 1980-07-17: 47.4% drawdown over 2743 days
1968-11-29 to 1972-03-06: 34.3% drawdown over 1193 days
1987-08-25 to 1989-07-26: 33.0% drawdown over 701 days
2020-02-19 to 2020-08-18: 32.1% drawdown over 180 days
1961-12-12 to 1963-09-03: 27.3% drawdown over 629 days
1980-11-28 to 1982-11-03: 26.7% drawdown over 705 days
2022-01-03 to 2024-01-19: 24.8% drawdown over 746 days
1956-08-03 to 1958-09-24: 21.5% drawdown over 782 days

--- Correction Durations Percentiles ---
25th Percentile: 70.5 days
50th Percentile (Median): 116.5 days
75th Percentile: 340.75 days


### Question 4.  [Stocks] Earnings Surprise Analysis for Amazon (AMZN)


**Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.**

Steps:
1. Load earnings data from CSV ([ha1_Amazon.csv](ha1_Amazon.csv)) containing earnings dates, EPS estimates, and actual EPS. Make sure you are using the correct delimiter to read the data, such as in this command ```python pandas.read_csv("ha1_Amazon.csv", delimiter=';') ```
2. Download complete historical price data using yfinance
3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), compute the *return* as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)
4. Identify positive earnings surprises (where "actual EPS > estimated EPS"). Both fields should be present in the file. You should obtain 36 data points for use in the descriptive analysis (median) later. 
5. Calculate 2-day percentage changes following positive earnings surprises. Show your answer in % (closest number to the 2nd digit): *return* * 100.0
6. (Optional) Compare the median 2-day percentage change for positive surprises vs. all historical dates. Do you see the difference?

Context: Earnings announcements, especially when they exceed analyst expectations, can significantly impact stock prices in the short term.

Reference: Yahoo Finance earnings calendar - https://finance.yahoo.com/calendar/earnings?symbol=AMZN

In [88]:

# 1. Load earnings data from CSV 

earnings_data = pd.read_csv("D:/vapi_phase/Finance/Repo/stock-markets-analytics-zoomcamp/cohorts/2025/ha1_Amazon.csv", delimiter=';')
earnings_data.head(50)

# 2. Download complete historical price data using yfinance
ticker = 'AMZN'
start_date = '1997-05-15'
end_date = date.today()
historical_prices = yf.Ticker(ticker).history(start=start_date, end=end_date)

# 3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), 
# compute the *return* as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)
historical_prices['2-Day Return'] = historical_prices['Close'].pct_change(periods=2) * 100.0
historical_prices.index = historical_prices.index.strftime('%Y-%m-%d')
historical_prices.reset_index(inplace=True)
historical_prices['Date'] = pd.to_datetime(historical_prices['Date'])

# 4. Identify positive earnings surprises (where "actual EPS > estimated EPS"). 
# Both fields should be present in the file. You should obtain 36 data points for use in the descriptive analysis (median) later. 
columns_to_clean = ['EPS Estimate', 'Reported EPS']

for col in columns_to_clean:
	earnings_data[col] = earnings_data[col].astype(str).str.replace(r'[^0-9.-]', '', regex=True)
	earnings_data[col] = earnings_data[col].replace(['', '-', '.'], np.nan)
	earnings_data = earnings_data[earnings_data[col].notna()]
	earnings_data[col] = pd.to_numeric(earnings_data[col])
earnings_data['Positive Surprise'] = earnings_data['Reported EPS'] > earnings_data['EPS Estimate']
earnings_data_positive = earnings_data[earnings_data['Positive Surprise']].copy()
earnings_data_positive.reset_index(drop=True, inplace=True)
historical_prices

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,2-Day Return
0,1997-05-15,0.121875,0.125000,0.096354,0.097917,1443120000,0.0,0.0,
1,1997-05-16,0.098438,0.098958,0.085417,0.086458,294000000,0.0,0.0,
2,1997-05-19,0.088021,0.088542,0.081250,0.085417,122136000,0.0,0.0,-12.765910
3,1997-05-20,0.086458,0.087500,0.081771,0.081771,109344000,0.0,0.0,-5.421125
4,1997-05-21,0.081771,0.082292,0.068750,0.071354,377064000,0.0,0.0,-16.463936
...,...,...,...,...,...,...,...,...,...
7065,2025-06-16,212.309998,217.059998,211.600006,216.100006,33284200,0.0,0.0,1.341212
7066,2025-06-17,215.199997,217.410004,214.559998,214.820007,32086300,0.0,0.0,1.282414
7067,2025-06-18,215.089996,217.960007,212.339996,212.520004,44360500,0.0,0.0,-1.656641
7068,2025-06-20,214.679993,214.889999,208.270004,209.690002,75297800,0.0,0.0,-2.388048


In [96]:
# 5. Calculate 2-day percentage changes following positive earnings surprises. 
# Show your answer in % (closest number to the 2nd digit): *return* * 100.0

earnings_data_positive['Earnings Date'] = pd.to_datetime(earnings_data_positive['Earnings Date'])
earnings_data_positive['Earnings Date'] = earnings_data_positive['Earnings Date'].dt.date
earnings_data_positive['Earnings Date'] = earnings_data_positive['Earnings Date'].astype(str)
historical_prices['Date'] = historical_prices['Date'].astype(str)
# just add historical_prices['2-Day Return'] to the earnings_data_positive['2-Day Return'] where the dates match

earnings_compiled = earnings_data_positive.merge(historical_prices[['Date','2-Day Return']], 
													left_on='Earnings Date', right_on='Date', how='left')

earnings_compiled
earnings_compiled['2-Day Return'] = earnings_compiled['2-Day Return'].astype(float)
earnings_compiled['2-Day Return'] = earnings_compiled['2-Day Return'].round(2)
# Median 2-day percentage change in AMZN stock price following positive surprise earnings days
earnings_compiled['2-Day Return'].median()
# earnings_compiled

0.85

In [90]:

# 6. (Optional) Compare the median 2-day percentage change for positive surprises vs. all historical dates. Do you see the difference?
