<a href="https://colab.research.google.com/github/YannPhamVan/stock-markets-analytics-zoomcamp/blob/main/cohorts/2025/01-intro-and-data-sources-homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Module 1 Homework (2025 cohort)

In this homework, we're going to download finance data from various sources and make simple calculations or analysis.

---
### Question 1: [Index] S&P 500 Stocks Added to the Index

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:
1. Create a DataFrame with company tickers, names, and the year they were added.
2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

*Context*:
> "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.


In [1]:
# IMPORTS
import numpy as np
import pandas as pd
from datetime import date, timedelta
import yfinance as yf

In [2]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")[0][["Symbol", "Security", "Date added"]]
df

Unnamed: 0,Symbol,Security,Date added
0,MMM,3M,1957-03-04
1,AOS,A. O. Smith,2017-07-26
2,ABT,Abbott Laboratories,1957-03-04
3,ABBV,AbbVie,2012-12-31
4,ACN,Accenture,2011-07-06
...,...,...,...
498,XYL,Xylem Inc.,2011-11-01
499,YUM,Yum! Brands,1997-10-06
500,ZBRA,Zebra Technologies,2019-12-23
501,ZBH,Zimmer Biomet,2001-08-07


In [3]:
df["Date added"] = pd.to_datetime(df["Date added"])
df["Year"] = df["Date added"].dt.year
df

Unnamed: 0,Symbol,Security,Date added,Year
0,MMM,3M,1957-03-04,1957
1,AOS,A. O. Smith,2017-07-26,2017
2,ABT,Abbott Laboratories,1957-03-04,1957
3,ABBV,AbbVie,2012-12-31,2012
4,ACN,Accenture,2011-07-06,2011
...,...,...,...,...
498,XYL,Xylem Inc.,2011-11-01,2011
499,YUM,Yum! Brands,1997-10-06,1997
500,ZBRA,Zebra Technologies,2019-12-23,2019
501,ZBH,Zimmer Biomet,2001-08-07,2001


In [4]:
df.groupby("Year").count().sort_values(by="Symbol", ascending=False)

Unnamed: 0_level_0,Symbol,Security,Date added
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1957,53,53,53
2016,23,23,23
2017,23,23,23
2019,22,22,22
2008,17,17,17
2024,16,16,16
2022,16,16,16
2023,15,15,15
2021,15,15,15
2015,14,14,14


In [5]:
df.loc[(date.today().year - df['Year']) > 20].count()

Unnamed: 0,0
Symbol,219
Security,219
Date added,219
Year,219



---
### Question 2. [Macro] Indexes YTD (as of 1 May 2025)

**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:
* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

Context:
> [Global Valuations: Who's Cheap, Who's Not?](https://simplywall.st/article/beyond-the-us-global-markets-after-yet-another-tariff-update) article suggests "Other regions may be growing faster than the US and you need to diversify."

Reference: Yahoo Finance World Indices - https://finance.yahoo.com/world-indices/




In [6]:
start_date = '2024-12-30'
end_date = '2025-05-02'
tickers = ["^GSPC", "000001.SS", "^HSI", "^AXJO", "^NSEI", "^GSPTSE", "^GDAXI", "^FTSE", "^N225", "^MXX", "^BVSP"]

In [7]:
data = yf.download(tickers, start=start_date, end=end_date)["Close"]
index_values = data.T
index_values.fillna(method="ffill", limit=2, axis=1, inplace=True)
index_values

YF.download() has changed argument auto_adjust default to True


[*********************100%***********************]  11 of 11 completed
  index_values.fillna(method="ffill", limit=2, axis=1, inplace=True)


Date,2024-12-30,2024-12-31,2025-01-01,2025-01-02,2025-01-03,2025-01-06,2025-01-07,2025-01-08,2025-01-09,2025-01-10,...,2025-04-18,2025-04-21,2025-04-22,2025-04-23,2025-04-24,2025-04-25,2025-04-28,2025-04-29,2025-04-30,2025-05-01
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000001.SS,3407.325928,3351.762939,3351.762939,3262.561035,3211.429932,3206.923096,3229.644043,3230.167969,3211.393066,3168.523926,...,3276.72998,3291.434082,3299.75708,3296.35498,3297.288086,3295.060059,3288.415039,3286.655029,3279.031006,3279.031006
^AXJO,8235.0,8159.100098,8159.100098,8201.200195,8250.5,8288.5,8285.099609,8349.099609,8329.200195,8294.099609,...,7819.100098,7819.100098,7816.700195,7920.5,7968.200195,7968.200195,7997.100098,8070.600098,8126.200195,8145.600098
^BVSP,120283.0,120283.0,120283.0,120125.0,118533.0,120022.0,121163.0,119625.0,119781.0,118856.0,...,129650.0,129650.0,130464.0,132216.0,134580.0,134739.0,135016.0,135093.0,135067.0,135067.0
^FTSE,8121.0,8173.0,8173.0,8260.099609,8224.0,8249.700195,8245.299805,8251.0,8319.700195,8248.5,...,8275.700195,8275.700195,8328.599609,8403.200195,8407.400391,8415.299805,8417.299805,8463.5,8494.900391,8496.799805
^GDAXI,19909.140625,19909.140625,19909.140625,20024.660156,19906.080078,20216.189453,20340.570312,20329.939453,20317.099609,20214.789062,...,21205.859375,21205.859375,21293.529297,21961.970703,22064.509766,22242.449219,22271.669922,22425.830078,22496.980469,22496.980469
^GSPC,5906.939941,5881.629883,5881.629883,5868.549805,5942.470215,5975.379883,5909.029785,5918.25,5918.25,5827.040039,...,5282.700195,5158.200195,5287.759766,5375.859863,5484.77002,5525.209961,5528.75,5560.830078,5569.060059,5604.140137
^GSPTSE,24620.599609,24727.900391,24727.900391,24898.0,25073.5,24999.800781,24929.900391,25051.699219,25073.400391,24767.699219,...,24192.800781,24008.900391,24306.0,24472.699219,24727.5,24710.5,24798.599609,24874.5,24841.699219,24795.599609
^HSI,20041.419922,20059.949219,20059.949219,19623.320312,19760.269531,19688.289062,19447.580078,19279.839844,19240.890625,19064.289062,...,21395.140625,21395.140625,21562.320312,22072.619141,21909.759766,21980.740234,21971.960938,22008.109375,22119.410156,22119.410156
^MXX,48837.71875,49513.269531,49513.269531,49765.199219,48957.238281,49493.558594,50085.5,49634.261719,49807.960938,49596.699219,...,53018.570312,53758.75,54777.839844,55766.578125,56382.0,56720.121094,56980.128906,55613.429688,56259.28125,56259.28125
^N225,39894.539062,39894.539062,39894.539062,,,39307.050781,40083.300781,39981.058594,39605.089844,39190.398438,...,34730.28125,34279.921875,34220.601562,34868.628906,35039.148438,35705.738281,35839.988281,35839.988281,36045.378906,36452.300781


In [8]:
index_values["ytd_2025-05-01"] = index_values["2025-05-01"] / index_values["2025-01-01"] - 1

In [9]:
snp500_ytd_2025_05_01 = index_values.loc["^GSPC"]["ytd_2025-05-01"]

In [10]:
index_values["ytd_2025-05-01"].sort_values(ascending=False)

Unnamed: 0_level_0,ytd_2025-05-01
Ticker,Unnamed: 1_level_1
^MXX,0.136247
^GDAXI,0.129982
^BVSP,0.12291
^HSI,0.102665
^FTSE,0.039618
^NSEI,0.024904
^GSPTSE,0.002738
^AXJO,-0.001655
000001.SS,-0.0217
^GSPC,-0.047179


In [11]:
index_values.loc[index_values["ytd_2025-05-01"] > snp500_ytd_2025_05_01]["ytd_2025-05-01"].count()

np.int64(9)

*Additional*: How many of these indexes have better returns than the S&P 500 over 3, 5, and 10 year periods? Do you see the same trend?
Note: For simplicity, ignore currency conversion effects.)

In [12]:
def sort_returns(tickers: list, years_before: int) -> pd.DataFrame:
    end_date = date.today()
    # Calculate start date using timedelta
    start_date = end_date - timedelta(days=years_before*365 + 5) # Add a few extra days to ensure we get data on or after the exact start date
    data = yf.download(tickers, start=start_date, end=end_date)["Close"]
    index_values = data.T
    index_values.fillna(method="ffill", limit=4, axis=1, inplace=True)

    # Access the last available date for the end price
    end_price_date = index_values.columns[-1]
    # Access the first available date for the start price
    start_price_date = index_values.columns[0]

    index_values[f"return_{years_before}_years"] = index_values[end_price_date] / index_values[start_price_date] - 1
    return index_values[f"return_{years_before}_years"].sort_values(ascending=False)

In [13]:
sort_returns(tickers, 3)

[*********************100%***********************]  11 of 11 completed
  index_values.fillna(method="ffill", limit=4, axis=1, inplace=True)


Unnamed: 0_level_0,return_3_years
Ticker,Unnamed: 1_level_1
^GDAXI,0.74044
^NSEI,0.539595
^GSPC,0.502365
^N225,0.410345
^GSPTSE,0.29492
^BVSP,0.261889
^AXJO,0.179385
^FTSE,0.172853
^HSI,0.162583
^MXX,0.14209


In [14]:
sort_returns(tickers, 5)

[*********************100%***********************]  11 of 11 completed
  index_values.fillna(method="ffill", limit=4, axis=1, inplace=True)


Unnamed: 0_level_0,return_5_years
Ticker,Unnamed: 1_level_1
^GDAXI,1.126757
^N225,0.818761
^GSPTSE,0.742508
^MXX,0.635202
^BVSP,0.628953
^AXJO,0.497186
000001.SS,0.185494
^HSI,0.018724
^FTSE,
^GSPC,


In [15]:
sort_returns(tickers, 10)

[*********************100%***********************]  11 of 11 completed
  index_values.fillna(method="ffill", limit=4, axis=1, inplace=True)


Unnamed: 0_level_0,return_10_years
Ticker,Unnamed: 1_level_1
^NSEI,1.976995
^GSPC,1.814153
^BVSP,1.601921
^GDAXI,1.083976
^N225,0.84583
^GSPTSE,0.745356
^AXJO,0.456265
^MXX,0.319639
^FTSE,0.263218
^HSI,-0.172315



---
### Question 3. [Index] S&P 500 Market Corrections Analysis


**Calculate the median duration (in days) of significant market corrections in the S&P 500 index.**

For this task, define a correction as an event when a stock index goes down by **more than 5%** from the closest all-time high maximum.

Steps:
1. Download S&P 500 historical data (1950-present) using yfinance
2. Identify all-time high points (where price exceeds all previous prices)
3. For each pair of consecutive all-time highs, find the minimum price in between
4. Calculate drawdown percentages: (high - low) / high × 100
5. Filter for corrections with at least 5% drawdown
6. Calculate the duration in days for each correction period
7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

*Context:*
> * Investors often wonder about the typical length of market corrections when deciding "when to buy the dip" ([Reddit discussion](https://www.reddit.com/r/investing/comments/1jrqnte/when_are_you_buying_the_dip/?rdt=64135)).

> * [A Wealth of Common Sense - How Often Should You Expect a Stock Market Correction?](https://awealthofcommonsense.com/2022/01/how-often-should-you-expect-a-stock-market-correction/)

*Hint (use this data to compare with your results)*: Here is the list of top 10 largest corrections by drawdown:
* 2007-10-09 to 2009-03-09: 56.8% drawdown over 517 days
* 2000-03-24 to 2002-10-09: 49.1% drawdown over 929 days
* 1973-01-11 to 1974-10-03: 48.2% drawdown over 630 days
* 1968-11-29 to 1970-05-26: 36.1% drawdown over 543 days
* 2020-02-19 to 2020-03-23: 33.9% drawdown over 33 days
* 1987-08-25 to 1987-12-04: 33.5% drawdown over 101 days
* 1961-12-12 to 1962-06-26: 28.0% drawdown over 196 days
* 1980-11-28 to 1982-08-12: 27.1% drawdown over 622 days
* 2022-01-03 to 2022-10-12: 25.4% drawdown over 282 days
* 1966-02-09 to 1966-10-07: 22.2% drawdown over 240 days


In [16]:
ticker_obj = yf.Ticker("^GSPC")
snp500_daily = ticker_obj.history(start = '1950-01-01')
snp500_daily = snp500_daily[["High", "Low", "Close"]]
snp500_daily.index = snp500_daily.index.date
snp500_daily

Unnamed: 0,High,Low,Close
1950-01-03,16.660000,16.660000,16.660000
1950-01-04,16.850000,16.850000,16.850000
1950-01-05,16.930000,16.930000,16.930000
1950-01-06,16.980000,16.980000,16.980000
1950-01-09,17.080000,17.080000,17.080000
...,...,...,...
2025-05-21,5938.370117,5830.910156,5844.609863
2025-05-22,5878.080078,5825.819824,5842.009766
2025-05-23,5829.509766,5767.410156,5802.819824
2025-05-27,5924.330078,5854.069824,5921.540039


In [17]:
# Calculate the cumulative maximum of the 'High' column
#snp500_daily['all_time_high'] = snp500_daily['High'].cummax()
snp500_daily['all_time_high'] = snp500_daily['Close'].cummax()
snp500_daily

Unnamed: 0,High,Low,Close,all_time_high
1950-01-03,16.660000,16.660000,16.660000,16.660000
1950-01-04,16.850000,16.850000,16.850000,16.850000
1950-01-05,16.930000,16.930000,16.930000,16.930000
1950-01-06,16.980000,16.980000,16.980000,16.980000
1950-01-09,17.080000,17.080000,17.080000,17.080000
...,...,...,...,...
2025-05-21,5938.370117,5830.910156,5844.609863,6144.149902
2025-05-22,5878.080078,5825.819824,5842.009766,6144.149902
2025-05-23,5829.509766,5767.410156,5802.819824,6144.149902
2025-05-27,5924.330078,5854.069824,5921.540039,6144.149902


In [18]:
snp500_daily["new_high"] = snp500_daily["Close"] == snp500_daily["all_time_high"].shift(-1)
snp500_daily.head(10)

Unnamed: 0,High,Low,Close,all_time_high,new_high
1950-01-03,16.66,16.66,16.66,16.66,False
1950-01-04,16.85,16.85,16.85,16.85,False
1950-01-05,16.93,16.93,16.93,16.93,False
1950-01-06,16.98,16.98,16.98,16.98,False
1950-01-09,17.08,17.08,17.08,17.08,True
1950-01-10,17.030001,17.030001,17.030001,17.08,False
1950-01-11,17.09,17.09,17.09,17.09,True
1950-01-12,16.76,16.76,16.76,17.09,False
1950-01-13,16.67,16.67,16.67,17.09,False
1950-01-16,16.719999,16.719999,16.719999,17.09,False


In [19]:
snp500_daily["cycle_id"] = snp500_daily["new_high"].cumsum()
snp500_daily.head(10)

Unnamed: 0,High,Low,Close,all_time_high,new_high,cycle_id
1950-01-03,16.66,16.66,16.66,16.66,False,0
1950-01-04,16.85,16.85,16.85,16.85,False,0
1950-01-05,16.93,16.93,16.93,16.93,False,0
1950-01-06,16.98,16.98,16.98,16.98,False,0
1950-01-09,17.08,17.08,17.08,17.08,True,1
1950-01-10,17.030001,17.030001,17.030001,17.08,False,1
1950-01-11,17.09,17.09,17.09,17.09,True,2
1950-01-12,16.76,16.76,16.76,17.09,False,2
1950-01-13,16.67,16.67,16.67,17.09,False,2
1950-01-16,16.719999,16.719999,16.719999,17.09,False,2


In [20]:
lows_between_highs = (
    snp500_daily[~snp500_daily["new_high"]]  # on garde seulement la phase de correction
    .groupby("cycle_id")["Low"]
    .min()  # on cherche le plus bas de chaque cycle
    .shift(-1)  # on aligne sur le sommet précédent
)

In [21]:
lows_between_highs

Unnamed: 0_level_0,Low
cycle_id,Unnamed: 1_level_1
0,17.030001
1,16.670000
2,16.990000
3,17.070000
4,17.440001
...,...
667,6072.899902
668,5773.310059
669,5923.930176
670,4835.040039


In [22]:
# On crée une nouvelle colonne pour le low du cycle suivant
snp500_daily["correction_low"] = snp500_daily["cycle_id"].map(lows_between_highs)

In [23]:
# 1. Identifier les dates des sommets
high_dates = snp500_daily[snp500_daily["new_high"]].index.to_list()

drawdowns = []

# 2. Boucle sur chaque paire de sommets successifs
for i in range(len(high_dates) - 1):
    high_date = high_dates[i]
    next_high_date = high_dates[i + 1]
    #high_value = snp500_daily.loc[high_date, "High"]
    high_value = snp500_daily.loc[high_date, "Close"]

    # Slice entre deux sommets
    #lows_between = snp500_daily.loc[high_date:next_high_date]["Low"]
    lows_between = snp500_daily.loc[high_date:next_high_date]["Close"]
    low_point = lows_between.min()
    low_date = lows_between.idxmin()

    drawdown_pct = round(((high_value - low_point) / high_value * 100), 1)

    drawdowns.append({
        "high_date": high_date,
        "low_date": low_date,
        "high": high_value,
        "low": low_point,
        "drawdown_pct": drawdown_pct
    })

# Résultat sous forme de DataFrame
drawdown_df = pd.DataFrame(drawdowns)
drawdown_df = drawdown_df.loc[drawdown_df["drawdown_pct"] >= 5].copy()
drawdown_df["correction_duration_days"] = (drawdown_df["low_date"] - drawdown_df["high_date"]).apply(lambda x: x.days)

In [24]:
drawdown_df.sort_values(by="drawdown_pct", ascending=False).head(10)

Unnamed: 0,high_date,low_date,high,low,drawdown_pct,correction_duration_days
467,2007-10-09,2009-03-09,1565.150024,676.530029,56.8,517
462,2000-03-24,2002-10-09,1527.459961,776.76001,49.1,929
224,1973-01-11,1974-10-03,120.239998,62.279999,48.2,630
211,1968-11-29,1970-05-26,108.370003,69.290001,36.1,543
593,2020-02-19,2020-03-23,3386.149902,2237.399902,33.9,33
311,1987-08-25,1987-12-04,336.769989,223.919998,33.5,101
148,1961-12-12,1962-06-26,72.639999,52.32,28.0,196
237,1980-11-28,1982-08-12,140.520004,102.419998,27.1,622
639,2022-01-03,2022-10-12,4796.560059,3577.030029,25.4,282
194,1966-02-09,1966-10-07,94.059998,73.199997,22.2,240


In [25]:
drawdown_df.describe()

Unnamed: 0,high,low,drawdown_pct,correction_duration_days
count,73.0,73.0,73.0,73.0
mean,976.162205,848.210959,12.369863,110.794521
std,1352.862401,1196.94692,10.977296,177.134362
min,19.4,16.68,5.0,7.0
25%,94.059998,69.610001,6.1,21.0
50%,359.799988,295.459991,7.7,39.0
75%,1418.780029,1247.410034,14.0,87.0
max,5667.200195,5186.330078,56.8,929.0



---
### Question 4.  [Stocks] Earnings Surprise Analysis for Amazon (AMZN)


**Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.**

Steps:
1. Load earnings data from CSV ([ha1_Amazon.csv](ha1_Amazon.csv)) containing earnings dates, EPS estimates, and actual EPS
2. Download complete historical price data using yfinance
3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), compute the return as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)
4. Identify positive earnings surprises (where "actual EPS > estimated EPS" OR "Surprise (%)>0")
5. Calculate 2-day percentage changes following positive earnings surprises
6. Compare the median 2-day percentage change for positive surprises vs. all historical dates

Context: Earnings announcements, especially when they exceed analyst expectations, can significantly impact stock prices in the short term.

Reference: Yahoo Finance earnings calendar - https://finance.yahoo.com/calendar/earnings?symbol=AMZN

*Additional*: Is there a correlation between the magnitude of the earnings surprise and the stock price reaction? Does the market react differently to earnings surprises during bull vs. bear markets?)


In [26]:
earnings_df = pd.read_csv("/content/ha1_Amazon.csv", sep=";")
earnings_df.head()

Unnamed: 0,Symbol,Company,Earnings Date,EPS Estimate,Reported EPS,Surprise (%)
0,AMZN,Amazon.com Inc,"April 29, 2026 at 6 AM EDT",-,-,-
1,AMZN,Amazon.com Inc,"February 4, 2026 at 4 PM EST",-,-,-
2,AMZN,Amazon.com Inc,"October 29, 2025 at 6 AM EDT",-,-,-
3,AMZN,Amazon.com Inc,"July 30, 2025 at 4 PM EDT",-,-,-
4,AMZN,"Amazon.com, Inc.","May 1, 2025 at 4 PM EDT",1.36,1.59,+16.73


In [27]:
# Convert 'Earnings Date' column to datetime objects
earnings_df['Earnings Date'] = earnings_df['Earnings Date'].str.split(' at ', expand=True)[0]
earnings_df['Earnings Date'] = pd.to_datetime(earnings_df['Earnings Date'])

In [28]:
earnings_df.index = earnings_df["Earnings Date"]
earnings_df.drop(columns={"Earnings Date", "Symbol", "Company"}, inplace=True)
earnings_df

Unnamed: 0_level_0,EPS Estimate,Reported EPS,Surprise (%)
Earnings Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2026-04-29,-,-,-
2026-02-04,-,-,-
2025-10-29,-,-,-
2025-07-30,-,-,-
2025-05-01,1.36,1.59,+16.73
...,...,...,...
1998-07-22,-,-,+1.34
1998-04-27,-,-,+13.92
1998-01-22,-,-,+11.41
1997-10-27,-,-,+13.29


In [29]:
cols = ["EPS Estimate", "Reported EPS", "Surprise (%)"]

for col in cols:
    earnings_df[col] = pd.to_numeric(earnings_df[col], errors="coerce")
earnings_df

Unnamed: 0_level_0,EPS Estimate,Reported EPS,Surprise (%)
Earnings Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2026-04-29,,,
2026-02-04,,,
2025-10-29,,,
2025-07-30,,,
2025-05-01,1.36,1.59,16.73
...,...,...,...
1998-07-22,,,1.34
1998-04-27,,,13.92
1998-01-22,,,11.41
1997-10-27,,,13.29


In [30]:
earnings_df = earnings_df.loc[(earnings_df["Surprise (%)"] > 0) |
    (earnings_df["Reported EPS"] > earnings_df["EPS Estimate"])]

In [31]:
ticker_obj = yf.Ticker("AMZN")
amazon_daily = ticker_obj.history(period = "max", interval = "1d")
amazon_daily.index = amazon_daily.index.date
amazon_daily

Unnamed: 0,Open,High,Low,Close,Volume,Dividends,Stock Splits
1997-05-15,0.121875,0.125000,0.096354,0.097917,1443120000,0.0,0.0
1997-05-16,0.098438,0.098958,0.085417,0.086458,294000000,0.0,0.0
1997-05-19,0.088021,0.088542,0.081250,0.085417,122136000,0.0,0.0
1997-05-20,0.086458,0.087500,0.081771,0.081771,109344000,0.0,0.0
1997-05-21,0.081771,0.082292,0.068750,0.071354,377064000,0.0,0.0
...,...,...,...,...,...,...,...
2025-05-21,201.610001,203.460007,200.059998,201.119995,42460900,0.0,0.0
2025-05-22,201.380005,205.759995,200.160004,203.100006,38938900,0.0,0.0
2025-05-23,198.899994,202.369995,197.850006,200.990005,33393500,0.0,0.0
2025-05-27,203.089996,206.690002,202.190002,206.020004,34843100,0.0,0.0


In [32]:
amazon_daily["2days_return"] = (amazon_daily["Close"].shift(-1) / amazon_daily["Close"].shift(1) -1) #* 100
amazon_daily = amazon_daily[["Close", "2days_return"]].copy()
amazon_daily

Unnamed: 0,Close,2days_return
1997-05-15,0.097917,
1997-05-16,0.086458,-0.127659
1997-05-19,0.085417,-0.054211
1997-05-20,0.081771,-0.164639
1997-05-21,0.071354,-0.146494
...,...,...
2025-05-21,201.119995,-0.004753
2025-05-22,203.100006,-0.000646
2025-05-23,200.990005,0.014377
2025-05-27,206.020004,0.022339


In [33]:
merged_df = pd.merge(amazon_daily, earnings_df, how='outer', left_index=True, right_index=True)
merged_df

Unnamed: 0,Close,2days_return,EPS Estimate,Reported EPS,Surprise (%)
1997-05-15,0.097917,,,,
1997-05-16,0.086458,-0.127659,,,
1997-05-19,0.085417,-0.054211,,,
1997-05-20,0.081771,-0.164639,,,
1997-05-21,0.071354,-0.146494,,,
...,...,...,...,...,...
2025-05-21,201.119995,-0.004753,,,
2025-05-22,203.100006,-0.000646,,,
2025-05-23,200.990005,0.014377,,,
2025-05-27,206.020004,0.022339,,,


In [34]:
positive_dates = earnings_df.index
returns_after_surprise = amazon_daily.loc[positive_dates, "2days_return"].dropna()

In [35]:
all_returns = amazon_daily["2days_return"].dropna()
len(returns_after_surprise), len(all_returns)

(85, 7051)

In [36]:
median_surprise = returns_after_surprise.median()
median_all = all_returns.median()

In [37]:
print("Median 2-day return after positive surprises:", round(median_surprise * 100, 3), "%")
print("Median 2-day return overall:", round(median_all * 100, 3), "%")

Median 2-day return after positive surprises: 1.247 %
Median 2-day return overall: 0.165 %


In [38]:
median_surprise / median_all

7.538463656006185


---
### Question 5.  [Exploratory, optional] Brainstorm potential idea for your capstone project

**Free text answer**

Describe the capstone project you would like to pursue, considering your aspirations, ML model predictions, and prior knowledge. Even if you are unsure at this stage, try to generate an idea you would like to explore-such as a specific asset class, country, industry vertical, or investment strategy. Be as specific as possible.

*Example: I want to build a short-term prediction model for the US/India/Brazil stock markets, focusing on the largest stocks over a 30-day investment horizon. I plan to use RSI and MACD technical indicators and news coverage data to generate predictions.*


I aim to modernize my personal investment tools by automating and industrializing the algorithmic analysis I currently perform with Excel and macros.
My approach is not focused on classic predictive models (ML or deep learning) but rather on efficient data processing and advanced analysis, particularly studying correlations between portfolio lines.
The goal is to build a robust technical component that can be integrated into a future SaaS, improving productivity and decision-making without revealing my proprietary methods.
However, I wonder if such a project, centered on algorithmic optimization and data analysis without predictive modeling, would be accepted within the Zoomcamp framework, which seems more prediction-oriented.


---
### Question 6. [Exploratory, optional] Investigate new metrics

**Free text answer**

Using the data sources we have covered (or any others you find relevant), download and explore a few additional metrics or time series that could be valuable for your project. Briefly explain why you think each metric is useful. This does not need to be a comprehensive list-focus on demonstrating your ability to generate data requests based on your project description, identify and locate the necessary data, and explain how you would retrieve it using Python.



Given my focus on portfolio construction and diversification, I plan to explore additional metrics that help assess overlap and correlation between assets. Useful time series could include:

Rolling correlation between assets: to track how asset relationships evolve over time and avoid concentration risk.

Average True Range (ATR) or Volatility indices: to better understand asset risk profiles and adjust weights accordingly.

Sharpe Ratio over rolling windows: to evaluate risk-adjusted performance dynamically.

These metrics can be calculated using data from Yahoo Finance or other APIs like Alpha Vantage. I would use yfinance to fetch historical prices, then compute the metrics using pandas and numpy. For example, rolling correlations could be generated via df.rolling(window=30).corr() on log returns.

These metrics support my project’s goal of algorithmic portfolio management without relying on explicit ML predictions.

---

## Submitting the solutions

Form for submitting: https://courses.datatalks.club/sma-zoomcamp-2025/homework/hw01

---
## Leaderboard

Leaderboard link: https://courses.datatalks.club/sma-zoomcamp-2025/leaderboard

---