# NW Summer 25 

### Asset Recommender System

The code below builds out a content-based recommender system. In order to promote quick decision making with a small number of users, a KNN-style approach where top stocks are featurized and used in an instance-based way.

Intake user preferences (possibly from a form), such as asset industry [sector], risk tolerance (growth vs. value), desired PE ratio, market-cap level (low, medium, or high), EPS, dividend level (low or none, medium, or high). Then, perform a KNN-style query of the pre-engineered database of [asset] instances to find the top n best fitting assets.  

### 1: Dependency installation & imports

In [13]:
!pip install yfinance
!pip install pandas
!pip install scikit-learn

Collecting yfinance
  Downloading yfinance-0.2.62-py2.py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.4/118.4 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting frozendict>=2.3.4
  Downloading frozendict-2.4.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.9/116.9 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
Collecting multitasking>=0.0.7
  Downloading multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Collecting peewee>=3.16.2
  Downloading peewee-3.18.1.tar.gz (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m79.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting beautifulsoup4>=4.11.1
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)

In [14]:
import yfinance as yf 
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

### 2: Retrieving instance data for the model

Retrieve raw data for industry, PE, EPS, market cap, dividend, and risk (growth vs value) for a large number of popular assets. For a well-rounded and diverse set of data, pull the assets from the Russell 3000. 

In [3]:
# first, just testing out the timing to see if it's unreasonable to simply use
# yf.ticker for all the tickers we want to take into consideration

tickers = [
    # Dow Jones Industrial Average (30 companies)
    "MMM", "AXP", "AMGN", "AAPL", "BA", "CAT", "CVX", "CSCO", "KO", "GS",
    "HD", "HON", "IBM", "JNJ", "JPM", "MCD", "MRK", "MSFT", "NKE", "NVDA",
    "PG", "CRM", "TRV", "UNH", "VZ", "V", "WBA", "WMT", "DIS", "DOW",

    # S&P 500 (examples beyond Dow, aiming for diversity)
    "GOOGL", "AMZN", "TSLA", "FB", "BRK.B", "JPM", "V", "JNJ", "WMT", "PG",
    "XOM", "UNH", "HD", "MA", "BAC", "VZ", "PFE", "DIS", "ADBE", "CMCSA",
    "NFLX", "PYPL", "INTC", "CSCO", "PEP", "KO", "T", "COST", "ACN", "PM",
    "NOC", "TXN", "AVGO", "CHTR", "QCOM", "CVS", "MO", "NEE", "UPS", "MDLZ",
    "SBUX", "ISRG", "GILD", "UNP", "AMAT", "LOW", "BKNG", "CME", "GSK", "RTX",

    # Nasdaq 100 (tech-heavy, but with other sectors represented)
    "ASML", "ADBE", "AMAT", "AMD", "ANSS", "ATVI", "BIIB", "CDNS", "CHTR", "CMCSA",
    "COST", "CPRT", "CRWD", "CSX", "DASH", "DXCM", "EA", "EXC", "FANG", "FISV",
    "FTNT", "GILD", "GOOG", "HON", "IDXX", "ILMN", "INTU", "JD", "KDP", "KLAC",
    "LCID", "LRCX", "MAR", "MELI", "MNST", "MRVL", "MTCH", "MU", "NTES", "NXPI",
    "ODFL", "OKTA", "ON", "ORLY", "PAYX", "PCAR", "PDD", "PEP", "PYPL", "QCOM"
]

import time
start_time = time.time()
for t in tickers:
    print(yf.Ticker(t).info)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# output was just under 23 seconds for 100 tickers. for the russell 3000 this would mean about 
# 15 minutes. i want to see if i could just dump all the ticker info into a csv and then 
# use it quickly via pandas

ea, Australia, Taiwan, China, Spain, France, Iceland, New Zealand, and Sweden. The company offers branded and private-label products in a range of merchandise categories. It offers merchandise, such as sundries, dry groceries, candies, coolers, freezers, deli, liquor, and tobacco; appliances, electronics, health and beauty aids, hardware, lawn and garden, sporting goods, tires, toys and seasonal products, office supplies, automotive care products, postages, tickets, apparel, small appliances, furniture, domestics, housewares, special order kiosks, and jewelry; and meat, produce, service deli, and bakery products. The company also operates gasoline, pharmacies, optical, food courts, hearing-aid centers, and tire installation centers; and offers business delivery, travel, grocery, and various other services online. It also operates e-commerce websites. The company was formerly known as Costco Companies, Inc. and changed its name to Costco Wholesale Corporation in August 1999. Costco Whol

Since the time requirement for a few thousand API calls isn't too bad, just make a list of all the stocks in the Russell 3000 and then call yf.Ticker with each one. To obtain the list of Russell tickers (a fairly recent list), I'm using a dataset from GitHub. 

In [4]:
kaggle_df = pd.read_csv("https://raw.githubusercontent.com/bfaure/StockSpider/refs/heads/master/iShares-Russell-3000-ETF_fund.csv", skiprows=8)
kaggle_df.head()

Unnamed: 0,Ticker,Name,Asset Class,Weight (%),Price,Shares,Market Value,Notional Value,Sector,SEDOL,ISIN,Exchange
0,AAPL,APPLE INC,Equity,3.72,229.28,1462169.0,335246108.32,335246108.32,Information Technology,2046251,US0378331005,NASDAQ
1,MSFT,MICROSOFT CORP,Equity,2.93,115.15,2292554.0,263987593.1,263987593.1,Information Technology,2588173,US5949181045,NASDAQ
2,AMZN,AMAZON COM INC,Equity,2.69,1971.31,122971.0,242413962.01,242413962.01,Consumer Discretionary,2000019,US0231351067,NASDAQ
3,BRKB,BERKSHIRE HATHAWAY INC CLASS B,Equity,1.4,215.29,585969.0,126153266.01,126153266.01,Financials,2073390,US0846707026,New York Stock Exchange Inc.
4,JPM,JPMORGAN CHASE & CO,Equity,1.28,113.97,1011631.0,115295585.07,115295585.07,Financials,2190385,US46625H1005,New York Stock Exchange Inc.


In [5]:
russell_tickers = kaggle_df["Ticker"].tolist()

In [6]:
russell_tickers

['AAPL',
 'MSFT',
 'AMZN',
 'BRKB',
 'JPM',
 'FB',
 'JNJ',
 'XOM',
 'GOOG',
 'GOOGL',
 'BAC',
 'V',
 'UNH',
 'PFE',
 'T',
 'CVX',
 'HD',
 'CSCO',
 'WFC',
 'VZ',
 'INTC',
 'PG',
 'BA',
 'MA',
 'MRK',
 'C',
 'KO',
 'DIS',
 'NVDA',
 'CMCSA',
 'NFLX',
 'PEP',
 'DWDP',
 'ABBV',
 'IBM',
 'ORCL',
 'WMT',
 'AMGN',
 'MDT',
 'ADBE',
 'PM',
 'MCD',
 'ABT',
 'HON',
 'MMM',
 'UNP',
 'MO',
 'ACN',
 'CRM',
 'QCOM',
 'AVGO',
 'TXN',
 'GE',
 'UTX',
 'NKE',
 'COST',
 'PYPL',
 'LLY',
 'BMY',
 'GILD',
 'TMO',
 'BKNG',
 'LOW',
 'COP',
 'CAT',
 'LMT',
 'SLB',
 'USB',
 'UPS',
 'CVS',
 'GS',
 'NEE',
 'AXP',
 'EOG',
 'SBUX',
 'ANTM',
 'BDX',
 'BIIB',
 'TJX',
 'DHR',
 'ADP',
 'AGN',
 'AET',
 'CELG',
 'ISRG',
 'OXY',
 'PNC',
 'AMT',
 'ATVI',
 'MDLZ',
 'WBA',
 'CB',
 'CSX',
 'SYK',
 'SCHW',
 'RTN',
 'CME',
 'FDX',
 'BLK',
 'CL',
 'MS',
 'DUK',
 'CHTR',
 'INTU',
 'SPG',
 'BSX',
 'ESRX',
 'ILMN',
 'GD',
 'MU',
 'NOC',
 'NSC',
 'DE',
 'VLO',
 'CI',
 'SPGI',
 'VRTX',
 'EMR',
 'PX',
 'FOXA',
 'ITW',
 'PSX',
 'BK',
 'A

In [7]:
ticker = tickers[0]
yticker = yf.Ticker(ticker)
yticker.info.keys()

dict_keys(['address1', 'city', 'state', 'zip', 'country', 'phone', 'website', 'industry', 'industryKey', 'industryDisp', 'sector', 'sectorKey', 'sectorDisp', 'longBusinessSummary', 'fullTimeEmployees', 'companyOfficers', 'auditRisk', 'boardRisk', 'compensationRisk', 'shareHolderRightsRisk', 'overallRisk', 'governanceEpochDate', 'compensationAsOfEpochDate', 'irWebsite', 'executiveTeam', 'maxAge', 'priceHint', 'previousClose', 'open', 'dayLow', 'dayHigh', 'regularMarketPreviousClose', 'regularMarketOpen', 'regularMarketDayLow', 'regularMarketDayHigh', 'dividendRate', 'dividendYield', 'exDividendDate', 'payoutRatio', 'fiveYearAvgDividendYield', 'beta', 'trailingPE', 'forwardPE', 'volume', 'regularMarketVolume', 'averageVolume', 'averageVolume10days', 'averageDailyVolume10Day', 'bid', 'ask', 'bidSize', 'askSize', 'marketCap', 'fiftyTwoWeekLow', 'fiftyTwoWeekHigh', 'priceToSalesTrailing12Months', 'fiftyDayAverage', 'twoHundredDayAverage', 'trailingAnnualDividendRate', 'trailingAnnualDividen

In [8]:
# let's just time the first 10 to see how long it takes

keep_fields = recommended_info_fields = [
    'longName',
    'sector',
    'industry',
    'country',
    'exchange',
    'marketCap',
    'trailingPE',
    'forwardPE',
    'dividendYield',
    'beta',
    'earningsQuarterlyGrowth',
    'revenueGrowth',
    'profitMargins',
    'returnOnEquity',
    'priceToBook',
    'currentPrice',
    'volume',
    'fiftyTwoWeekChangePercent',
    'averageAnalystRating',
    'recommendationKey',
    'longBusinessSummary'
]

def get_ticker_dict(ticker):
    yticker = yf.Ticker(ticker)
    ticker_dict = {"Ticker": ticker}
    for field in keep_fields:
        ticker_dict[field] = yticker.info.get(field, np.nan)
    return ticker_dict

res = get_ticker_dict("MSFT")

In [9]:
pd.DataFrame(res, index=[0])

Unnamed: 0,Ticker,longName,sector,industry,country,exchange,marketCap,trailingPE,forwardPE,dividendYield,...,revenueGrowth,profitMargins,returnOnEquity,priceToBook,currentPrice,volume,fiftyTwoWeekChangePercent,averageAnalystRating,recommendationKey,longBusinessSummary
0,MSFT,Microsoft Corporation,Technology,Software - Infrastructure,United States,NMS,3500132007936,36.50543,31.499666,0.7,...,0.133,0.35789,0.3361,10.875751,470.92,15307425,6.77005,1.4 - Strong Buy,strong_buy,Microsoft Corporation develops and supports so...


In [10]:
!pip install tqdm
import time
from tqdm.auto import tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
  from .autonotebook import tqdm as notebook_tqdm


In [14]:
lst = []
for t in tqdm(russell_tickers[2500:3000], desc="Fetching ticker data"):
    lst.append(get_ticker_dict(t))

res = pd.DataFrame(lst)
pd.DataFrame(res).to_csv("batch2500_3000.csv")

Fetching ticker data:   8%|▊         | 39/500 [00:06<01:17,  5.94it/s]HTTP Error 404: 
Fetching ticker data:  10%|█         | 52/500 [00:08<01:16,  5.87it/s]HTTP Error 404: 
Fetching ticker data:  41%|████      | 205/500 [00:33<00:38,  7.58it/s]HTTP Error 404: 
Fetching ticker data:  75%|███████▌  | 377/500 [01:01<00:19,  6.41it/s]HTTP Error 404: 
Fetching ticker data:  97%|█████████▋| 485/500 [01:19<00:02,  6.73it/s]HTTP Error 404: 
Fetching ticker data:  99%|█████████▉| 496/500 [01:21<00:00,  6.86it/s]HTTP Error 404: 
Fetching ticker data: 100%|██████████| 500/500 [01:22<00:00,  6.08it/s]


### 3: Creating & cleaning the final dataset

In [15]:
batch_0 = pd.read_csv("batch0_500.csv")
batch_1 = pd.read_csv("batch500_1000.csv")
batch_2 = pd.read_csv("batch1000_1500.csv")
batch_3 = pd.read_csv("batch1500_2000.csv")
batch_4 = pd.read_csv("batch2000_2500.csv")
batch_5 = pd.read_csv("batch2500_3000.csv")

In [16]:
dataset = pd.concat([batch_0, batch_1, batch_2, batch_3, batch_4, batch_5], ignore_index=True)

In [17]:
dataset.shape[0]

3000

In [18]:
dataset.dropna().shape[0]

631

Looks like there is a lot of missing data, so let's see where those na's are coming from. 

In [20]:

dataset.isna().sum()

Unnamed: 0                      0
Ticker                          0
longName                      843
sector                        928
industry                      928
country                       928
exchange                      837
marketCap                     925
trailingPE                   1448
forwardPE                     988
dividendYield                1745
beta                          950
earningsQuarterlyGrowth      1613
revenueGrowth                 986
profitMargins                 928
returnOnEquity               1063
priceToBook                   933
currentPrice                  926
volume                        907
fiftyTwoWeekChangePercent     928
averageAnalystRating         1703
recommendationKey             926
longBusinessSummary           912
dtype: int64

Our instance-based model doesn't need all these fields, so first let's filter out the columns we don't need.

In [21]:
# we want to keep some fields for a growth/value labeling purpose
fields = [
    # value stock features
    'priceToBook',
    'trailingPE',
    'dividendYield',

    # growth stock features
    'revenueGrowth',
    'earningsQuarterlyGrowth'
]
model = dataset[["Ticker", "longName", "sector", "industry", "country", "marketCap", "forwardPE", *fields]]

And now for some data cleaning...

In [22]:
model['dividendYield'] = model['dividendYield'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model['dividendYield'] = model['dividendYield'].fillna(0)


In [23]:
model.dropna().shape[0]

1265

Ok, so a large number of our NAs were just companies without a dividend yield - this isn't an issue. However, we see below that a lot of these tickers are not returning any data, for example "FB". This is because some tickers have been changed, ex: "FB" --> "META". 

In [24]:
model.isna().sum()

Ticker                        0
longName                    843
sector                      928
industry                    928
country                     928
marketCap                   925
forwardPE                   988
priceToBook                 933
trailingPE                 1448
dividendYield                 0
revenueGrowth               986
earningsQuarterlyGrowth    1613
dtype: int64

To address this issue, we'll use AI to create a mapping .csv file for us, mapping the old tickers to the new ones. Then, we'll call the API once again and fill in the rows. 

In [25]:
model[model['longName'].isna()]['Ticker'].to_csv("outdated_tickers.csv", index=False)

In [26]:
model.columns

Index(['Ticker', 'longName', 'sector', 'industry', 'country', 'marketCap',
       'forwardPE', 'priceToBook', 'trailingPE', 'dividendYield',
       'revenueGrowth', 'earningsQuarterlyGrowth'],
      dtype='object')

In [27]:
# this is the mapping we'll use
mapping = {
    'BRKB': 'BRK-B',
    'FB': 'META',
    'DWDP': 'PHM',
    'UTX': 'RTX',
    'ANTM': 'CVH',
    'AGN': 'ABBV',
    'CELG': 'BMY',
    'RTN': 'RTX',
    'BBT': 'PNC',
    'APC': 'APA',
    'WP': 'WRK',
    'CXO': 'COP',
    'ALXN': 'AZN',
    'RHT': 'IBM',
    'CTL': 'LUMN',
    'XLNX': 'AVGO',
    'CBS': 'VIAC',
    'WCG': 'CNC',
    'BLL': 'APD',
    'SYMC': 'CRWD',
    'VIAB': 'PARA',
    'DISCK': 'DISCA',
    'BHGE': 'BKR',
    'HCP': 'PEAK',
    'SIVB': None,   # Delisted (Silicon Valley Bank)
    'LOXO': None,   # Acquired, no ticker
    'CHK': 'OVV',
    'MYL': None,    # Mylan merged into Pfizer Upjohn (ticker changed, no direct mapping)
    'TWTR': None,   # Twitter rebranded to X, ticker still TWTR but company changed
    'SIX': None,    # No current ticker
    'WRE': None,
    'WBT': None,
    'UBSH': None,
    'LSXMA': None,
    'MBFI': None,
    'FCEA': None,
    'GLIBA': None,
    'LSXMK': None,
    'BKI': None,
    'BFB': None,
    'QRTEA': None,
    'STAY': None,
    'JCOM': None,
    'STMP': None,
    'PRSP': None,
    'SRC': None,
    'WMGI': None,
    'UNVR': None,
    'CMD': None,
    'TRCO': None,
    'APY': None,
    'CFX': None,
    'MDR': None,
    'GDI': None,
    'ELLI': None,
    'IMMU': None,
    'CXP': None,
    'JWA': None,
    'TECD': None,
    'AGR': None,
    'MOGA': None,
    'BFA': None,
    'AHL': None,
    'WBT': None,
    'UBSH': None,
    'LEXEA': None,
    'CSFL': None,
    'LGFB': None,
    'GWB': None,
    'ESL': None,
    'MLHR': None,
    'DSW': None,
    'MDP': None,
    'CBM': None,
    'SRCI': None,
    'HSC': None,
    'XLRN': None,
    'ROIC': None,
    'MSFUT': None,
    'PLT': None,
    'AXE': None,
    'CLNC': None,
    'CHSP': None,
    'ADSW': None,
    'GCP': None,
    'BIG': None,
    'ATU': None,
    'BEL': None,
    'SNHY': None,
    'MANT': None,
    'CLDR': None,
    'WDR': None,
    'NGHC': None,
    'RAVN': None,
    'EVBG': None,
    'CWEN US': None,
    'CTB': None,
    'MNR': None,
    'LGFA': None,
    'DPLO': None,
    'WPG': None,
    'BMCH': None,
    'TWNK': None,
    'PJC': None,
    'SWM': None,
    'KNL': None,
    'CISN': None,
}

Update the old tickers and refresh the information...

In [28]:
model.head()

Unnamed: 0,Ticker,longName,sector,industry,country,marketCap,forwardPE,priceToBook,trailingPE,dividendYield,revenueGrowth,earningsQuarterlyGrowth
0,AAPL,Apple Inc.,Technology,Consumer Electronics,United States,3014046000000.0,24.284006,45.135334,31.384151,0.52,0.051,0.048
1,MSFT,Microsoft Corporation,Technology,Software - Infrastructure,United States,3479990000000.0,31.318394,10.813164,36.155212,0.7,0.133,0.177
2,AMZN,"Amazon.com, Inc.",Consumer Cyclical,Internet Retail,United States,2292240000000.0,35.108128,7.491846,35.16531,0.0,0.086,0.642
3,BRKB,,,,,,,,,0.0,,
4,JPM,JPMorgan Chase & Co.,Financial Services,Banks - Diversified,United States,744226400000.0,15.997313,2.245887,13.14009,2.1,0.048,0.091


In [29]:
cols_to_update = ['longName', 'sector', 'industry', 'country', 'marketCap',
                      'forwardPE', 'priceToBook', 'trailingPE', 'dividendYield',
                      'revenueGrowth', 'earningsQuarterlyGrowth']

# remove the rows with the outdated tickers:
model = model[~model['Ticker'].isin(mapping.keys())]

# now, retrieve the new data with the updated tickers and add that new data to the dataset
mapping = {v: k for k, v in mapping.items() if v is not None}
for ticker in mapping.keys():
    try:
        yf_ticker = yf.Ticker(ticker)
        info = yf_ticker.info

        new_row = {'Ticker': ticker}
        for col in cols_to_update:
            new_row[col] = info.get(col, None)

        model = pd.concat([model, pd.DataFrame([new_row])], ignore_index=True)
    except:
        print(f'error getting ticker {ticker}')

  model = pd.concat([model, pd.DataFrame([new_row])], ignore_index=True)


In [30]:
model[model["Ticker"] == "BRK-B"]

Unnamed: 0,Ticker,longName,sector,industry,country,marketCap,forwardPE,priceToBook,trailingPE,dividendYield,revenueGrowth,earningsQuarterlyGrowth
2895,BRK-B,Berkshire Hathaway Inc.,Financial Services,Insurance - Diversified,United States,1058941000000.0,24.453686,0.001079,13.153763,,-0.002,-0.638


In [31]:
model.shape[0]

2919

In [32]:
model.isna().sum()

Ticker                        0
longName                    743
sector                      830
industry                    830
country                     830
marketCap                   827
forwardPE                   890
priceToBook                 835
trailingPE                 1353
dividendYield                 9
revenueGrowth               888
earningsQuarterlyGrowth    1521
dtype: int64

Ok, so we've updated any outdated tickers. We're still seeing a lot of missing data for sector and industry. For our purposes, we can drop these and still retain a good amount of diverse assets. 

In [33]:
model.dropna().shape[0]  # 1268 is enough to work with for a solid asset recommender 

1276

In [34]:
model = model.dropna()

### 4: Data exploration

How many assets do we have in each sector?

In [35]:
model["sector"].value_counts()

sector
Financial Services        312
Industrials               207
Consumer Cyclical         171
Technology                139
Real Estate                99
Healthcare                 94
Consumer Defensive         72
Energy                     56
Utilities                  53
Basic Materials            43
Communication Services     30
Name: count, dtype: int64

In [36]:
model.to_csv("model.csv", index=False)

How many companies offer no dividend? How many high dividend companies are there?

In [37]:
model[model["dividendYield"] == 0].shape[0]

293

In [38]:
model[model["dividendYield"] > 3].shape[0]

369

### 5: Feature engineering

The user is going to provide the following specifications: sector, market cap category (low, mid, or high cap) 

In [48]:
model.columns

Index(['Ticker', 'longName', 'sector', 'industry', 'country', 'marketCap',
       'forwardPE', 'priceToBook', 'trailingPE', 'dividendYield',
       'revenueGrowth', 'earningsQuarterlyGrowth', 'marketCapLevel'],
      dtype='object')

Feature 1: market cap level via binning

In [44]:
model['marketCapLevel'] = model['marketCap'].apply(lambda x : 'high' if x > 10e9 else 'mid' if x > 2e9 else 'low')
model['marketCapLevel'].value_counts()

marketCapLevel
high    483
mid     449
low     344
Name: count, dtype: int64

In [47]:
model.head()

Unnamed: 0,Ticker,longName,sector,industry,country,marketCap,forwardPE,priceToBook,trailingPE,dividendYield,revenueGrowth,earningsQuarterlyGrowth,marketCapLevel
0,AAPL,Apple Inc.,Technology,Consumer Electronics,United States,3014046000000.0,24.284006,45.135334,31.384151,0.52,0.051,0.048,high
1,MSFT,Microsoft Corporation,Technology,Software - Infrastructure,United States,3479990000000.0,31.318394,10.813164,36.155212,0.7,0.133,0.177,high
2,AMZN,"Amazon.com, Inc.",Consumer Cyclical,Internet Retail,United States,2292240000000.0,35.108128,7.491846,35.16531,0.0,0.086,0.642,high
3,JPM,JPMorgan Chase & Co.,Financial Services,Banks - Diversified,United States,744226400000.0,15.997313,2.245887,13.14009,2.1,0.048,0.091,high
4,JNJ,Johnson & Johnson,Healthcare,Drug Manufacturers - General,United States,376910800000.0,14.778301,4.824602,17.424917,3.35,0.024,2.379,high


Feature 2: growth vs value classification, taking into account dividend and revenue growth

In [91]:
def classify_gv(stock):
    if stock['dividendYield'] > 3:
        return 'value'
    elif stock['revenueGrowth'] > 0.15 or stock['earningsQuarterlyGrowth'] > 10:
        return 'growth'
    else:
        return 'neutral'

model['growthValueType'] = model.apply(classify_gv, axis=1)

In [92]:
model['growthValueType'].value_counts()

growthValueType
neutral    740
value      369
growth     167
Name: count, dtype: int64

In [98]:
model[model['growthValueType'] == 'growth'].head(15)

Unnamed: 0,Ticker,longName,sector,industry,country,marketCap,forwardPE,priceToBook,trailingPE,dividendYield,revenueGrowth,earningsQuarterlyGrowth,marketCapLevel,growthValueType
26,NVDA,NVIDIA Corporation,Technology,Semiconductors,United States,3485841000000.0,34.692963,41.575043,46.108067,0.03,0.692,0.262,high,growth
46,QCOM,QUALCOMM Incorporated,Technology,Semiconductors,United States,173762000000.0,12.939755,6.278145,16.131824,2.29,0.169,0.209,high,growth
47,AVGO,Broadcom Inc.,Technology,Semiconductors,United States,1148569000000.0,39.59076,4.124803,89.15146,0.97,0.164,1.727,high,growth
53,LLY,Eli Lilly and Company,Healthcare,Drug Manufacturers - General,United States,719176300000.0,35.353043,48.157497,65.13008,0.78,0.452,0.23,high,growth
77,ISRG,"Intuitive Surgical, Inc.",Healthcare,Medical Instruments & Supplies,United States,189251900000.0,67.178116,11.062645,77.19591,0.0,0.192,0.282,high,growth
87,SCHW,The Charles Schwab Corporation,Financial Services,Capital Markets,United States,159557200000.0,23.109211,3.955096,26.610607,1.22,0.181,0.402,high,growth
92,MS,Morgan Stanley,Financial Services,Capital Markets,United States,210855800000.0,16.57377,2.175741,15.426055,2.8,0.163,0.265,high,growth
95,INTU,Intuit Inc.,Technology,Software - Application,United States,212139200000.0,34.210526,10.545949,62.18316,0.54,0.41,-0.146,high,growth
97,BSX,Boston Scientific Corporation,Healthcare,Medical Devices,United States,147072100000.0,35.630825,6.620272,72.56205,0.0,0.209,0.362,high,growth
101,MU,"Micron Technology, Inc.",Technology,Semiconductors,United States,126687700000.0,8.808081,2.605977,27.11962,0.41,0.383,0.996,high,growth


In [101]:
model.to_csv("model.csv", index=False)

In [107]:
model['marketCapLevel'].value_counts()

marketCapLevel
high    483
mid     449
low     344
Name: count, dtype: int64

In [110]:
model.columns

Index(['Ticker', 'longName', 'sector', 'industry', 'country', 'marketCap',
       'forwardPE', 'priceToBook', 'trailingPE', 'dividendYield',
       'revenueGrowth', 'earningsQuarterlyGrowth', 'marketCapLevel',
       'growthValueType'],
      dtype='object')

In [137]:
# retain: Ticker, sector, marketCapLevel, growthValueType
knn_model = model[['Ticker', 'sector', 'marketCapLevel', 'growthValueType', 'forwardPE']]
knn_model = knn_model.set_index('Ticker')

# one hot encode categoricals
categorical_cols = ['sector', 'marketCapLevel', 'growthValueType']
knn_encoded = pd.get_dummies(knn_model, columns=categorical_cols)

In [140]:
knn_encoded.head()

Unnamed: 0_level_0,forwardPE,sector_Basic Materials,sector_Communication Services,sector_Consumer Cyclical,sector_Consumer Defensive,sector_Energy,sector_Financial Services,sector_Healthcare,sector_Industrials,sector_Real Estate,sector_Technology,sector_Utilities,marketCapLevel_high,marketCapLevel_low,marketCapLevel_mid,growthValueType_growth,growthValueType_neutral,growthValueType_value
Ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
AAPL,24.284006,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,True,False
MSFT,31.318394,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,True,False
AMZN,35.108128,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
JPM,15.997313,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,True,False
JNJ,14.778301,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,True


In [167]:
# fit and retrieve top k nn
user_input = {
    'sector': 'Healthcare',
    'marketCapLevel': 'high',
    'growthValueType': 'growth',
    'forwardPE': 15
}

feature_cols = knn_encoded.columns
X = knn_encoded.values

knn = NearestNeighbors(n_neighbors=5, metric='euclidean')
knn.fit(X)

user_df = pd.DataFrame([user_input])
user_encoded = pd.get_dummies(user_df)
user_encoded = user_encoded.reindex(columns=feature_cols, fill_value=0)

distances, indices = knn.kneighbors(user_encoded.values)

nearest_tickers = knn_encoded.iloc[indices[0]].index.tolist()

print("Top 5 similar tickers:", nearest_tickers)


Top 5 similar tickers: ['HCA', 'MDT', 'JNJ', 'FOXA', 'NTRS']


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b273eb39-632e-4d88-a3fd-8f52123be496' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>