### Calculation of PER using machine learning

**What is PER?**  
The Price-to-Earnings ratio (P/E ratio) in finance is a key measure used to assess the valuation of a company in the stock market.  
The PER provides an indication of the number of years it would take to recoup the initial investment if the company were to distribute all its profits to shareholders as dividends.  
A high PER may indicate that the market values the company at a higher level relative to its current earnings, which may imply that investors anticipate future growth. Conversely, a low PER may indicate that the company is undervalued relative to its current earnings, which may represent an investment opportunity.

**How is PER calculated?**  
$$PER = Stock Price / Earnings Per Share$$

**Classic PER analysis:**  
To determine if a company has a high or low PER, it should be compared to the average PER of companies in the same sector.  
If our company has a PER lower (or higher) than the average, then our company is undervalued (or overvalued).

**PER analysis with machine learning:**  
In our study, to find comparable companies, we will consider:
- *The company's sector*
- *Market capitalization*
- *Total assets*
- *Revenue*

We will use a machine learning model to determine what should be the PER of our company according to the aforementioned criteria.  
As input, we will use the logarithm of market capitalization, total assets, and revenue (log-transformed for a normal distribution), as well as the company's sector (dummy variable takes 1 for the studied company's sector and 0 otherwise).  
And as Output, the logarithm of the PER (which we will exponentiate to obtain our predicted PER).

Once our model is trained on a sufficiently large sample, we can calculate what should be the PER of the company according to our machine learning model and compare it with the actual PER of the company.  
- If Actual PER < Predicted PER: action undervalued  
- If Actual PER > Predicted PER: action overvalued

Then we will look at the median of spreads by sector to determine which sector would be rather undervalued or overvalued:  
$spread = predicted PER - actual PER$  
- If spread < 0: sector undervalued  
- If spread > 0: sector overvalued

**Results:** the mean absolute error (MAE) of our machine learning model is still too high (a deviation of 0.89 around the log PER) and therefore needs to be improved.

In [1]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, regularizers
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
from yahooquery import Ticker
from tqdm import tqdm




In [2]:
# We load our dataset with information about companies
df_dataset = pd.read_csv("Dataset French Companies.csv")

list_tickers = df_dataset["Ticker"].tolist()
dic_data = {}
for ticker in tqdm(list_tickers):
    action = Ticker(ticker)
    try:
        # We retrieve the market cap
        df_valuation_measures = action.valuation_measures
        mrkt_cap = df_valuation_measures.loc[df_valuation_measures['periodType'] == "TTM"]["MarketCap"].dropna()[-1]
        
        # We retrieve the PER
        per = df_valuation_measures.loc[df_valuation_measures["periodType"] == "TTM"]["PeRatio"].dropna()[-1]
        
        # We retrieve the Total Assets
        df_balance_sheet = action.balance_sheet()
        assets = df_balance_sheet.loc[df_balance_sheet["periodType"] == "12M"]["TotalAssets"].dropna()[-1]
        
        # We retrieve the Total Revenues
        df_income_statement = action.income_statement()
        revenue = df_income_statement.loc[df_income_statement["periodType"] == "12M"]["TotalRevenue"].dropna()[-1]
        
        if revenue <= 0 or assets <= 0 or mrkt_cap <= 0 or per <= 0:
            continue

        data_list = {"mkrt_cap": float(mrkt_cap), "assets": float(assets), "revenue": float(revenue), "sector": df_dataset.loc[df_dataset['Ticker'] == ticker, 'Sector'].iloc[0], "per": float(per)}
        name = df_dataset.loc[df_dataset['Ticker'] == ticker, 'Name'].iloc[0]
        dic_data[name] = data_list
    except:
        None


100%|██████████| 623/623 [09:18<00:00,  1.12it/s]


In [3]:
# We create a DataFrame to format our data
df = pd.DataFrame(dic_data).T
df["log_mkrt_cap"] = np.log(list(df["mkrt_cap"]))
df["log_assets"] = np.log(list(df["assets"]))
df["log_revenue"] = np.log(list(df["revenue"]))
df["log_per"] = np.log(list(df["per"]))

# We remove extreme values (60 is arbitrary)
df = df.query("per < 60").copy()

# We create a list with all the sectors
sector_list = list(set(df["sector"]))

# We create a dummy variable for the sector
for sector in sector_list:
    df[sector] = np.where(df["sector"] == sector, 1, 0)
df

Unnamed: 0,mkrt_cap,assets,revenue,sector,per,log_mkrt_cap,log_assets,log_revenue,log_per,Santé,Matériaux de base,Services aux consommateurs,Sociétés financières,Services aux Collectivités,Technologies,Biens de consommation,Télécommunications,Industries,Pétrole et Gaz
DOLFINES,1479542.0,6333292.0,2731064.0,Pétrole et Gaz,0.004945,14.207243,15.661331,14.820202,-5.309378,0,0,0,0,0,0,0,0,0,1
ECOSLOPS,3768745.0,34068000.0,10265000.0,Pétrole et Gaz,4.34339,15.142253,17.343869,16.144251,1.468655,0,0,0,0,0,0,0,0,0,1
ENOGIA,10058135.0,16699000.0,7312000.0,Pétrole et Gaz,0.034815,16.123892,16.630859,15.805027,-3.357707,0,0,0,0,0,0,0,0,0,1
EO2,8334105.0,44375000.0,34071000.0,Pétrole et Gaz,11.666667,15.935867,17.608187,17.343957,2.456736,0,0,0,0,0,0,0,0,0,1
ESSO,1393436255.0,4744300000.0,19240000000.0,Pétrole et Gaz,2.644593,21.055039,22.280210,23.680257,0.972517,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
EKINOPS,107659660.0,192612000.0,129097000.0,Télécommunications,6.25,18.494486,19.076188,18.676075,1.832581,0,0,0,0,0,0,0,1,0,0
EUTELSAT COMMUNIC.,1750624619.0,8518400000.0,1213000000.0,Télécommunications,7.307839,21.283238,22.865494,20.916362,1.988948,0,0,0,0,0,0,0,1,0,0
HF COMPANY,12522748.0,24465000.0,5472000.0,Télécommunications,6.571429,16.343057,17.012754,15.515155,1.882731,0,0,0,0,0,0,0,1,0,0
ORANGE,26196091521.0,110052000000.0,44122000000.0,Télécommunications,12.797403,23.988876,25.424219,24.510224,2.549242,0,0,0,0,0,0,0,1,0,0


In [4]:
# We create our machine learning model
X_list = sector_list + ["log_mkrt_cap", "log_assets", "log_revenue"]
X = df[X_list]
y = df["log_per"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

stock_model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
    layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
    layers.Dense(1)
])

stock_model.compile(loss=tf.keras.losses.mae,
                   optimizer=tf.keras.optimizers.Adamax(),
                   metrics=["mae"])

stock_model.fit(X_train, y_train, epochs=50)

stock_model.evaluate(X_test, y_test)[1]


Epoch 1/50


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


0.6529309749603271

In [5]:
# We apply our machine learning model to each company in the dataset to calculate their predicted PER
dic_df_verdict = {}
for ticker in df.T.columns:
    predicted_per = np.exp(stock_model.predict(df[X_list].loc[ticker].to_frame().T))
    actual_per = df.loc[ticker, "per"]
    if actual_per < predicted_per: 
        status = "undervalued"
    else: 
        status = "overvalued"
    dic_df_verdict[ticker] = {"predicted_per": predicted_per[0][0], "actual_per": actual_per, "status": status}



In [6]:
# We calculate the spread between the predicted PER and the actual PER of the company
df_per = pd.DataFrame(dic_df_verdict).T
df_per["spread"] = df_per["actual_per"] - df_per["predicted_per"]
df_per.sort_values(by="spread")

Unnamed: 0,predicted_per,actual_per,status,spread
OSE IMMUNO,19.662292,5.080925,undervalued,-14.581367
STMICROELECTRONICS,24.788849,10.260118,undervalued,-14.528731
ENERGISME,12.069276,0.045107,undervalued,-12.024169
VINCI,24.476555,12.634543,undervalued,-11.842012
EURAZEO,13.654919,2.99687,undervalued,-10.658049
...,...,...,...,...
PROACTIS,9.584207,53.713218,overvalued,44.129011
BIO-UV GRP,9.273505,57.767624,overvalued,48.494119
JACQUET METALS,8.8206,57.703704,overvalued,48.883104
SODITECH,7.976148,57.951175,overvalued,49.975027


In [7]:
# Here we will look at the median of spreads (between actual PER and predicted PER) for each sector
df_per_sector = pd.concat([df_per, pd.DataFrame(dic_data).T["sector"].to_frame()], axis=1).dropna()
dic_median = {}
for sector in sector_list:
    dic_median[sector] = df_per_sector.loc[df_per_sector["sector"] == sector]["spread"].median()
dic_median_final = {"Median spread per sector": dic_median}
pd.DataFrame(dic_median_final).sort_values(by="Median spread per sector")

Unnamed: 0,Median spread per sector
Télécommunications,-2.335055
Pétrole et Gaz,-1.536731
Sociétés financières,-0.110328
Services aux Collectivités,0.764165
Services aux consommateurs,1.074264
Technologies,1.444012
Biens de consommation,1.539215
Santé,1.901026
Industries,1.911401
Matériaux de base,7.242999
