## 💸 Company Market Value Prediction

Given *data about the world's largest companies*, let's try to predict the **market value** of a given company.

We will use a tensorflow/keras neural network to make our predictions.

Data source: https://www.kaggle.com/datasets/jasmeet0516/largest-companies-in-world

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'iframe'

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf

2025-03-09 11:11:17.946053: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
data = pd.read_csv("Largest companies in world.csv")
data

Unnamed: 0,rank,organizationName,country,revenue,profits,assets,marketValue
0,1,JPMorgan Chase,United States,179.93 B,41.8 B,"3,744.3 B",399.59 B
1,2,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,589.47 B,156.36 B,660.99 B,"2,055.22 B"
2,3,ICBC,China,216.77 B,52.47 B,"6,116.82 B",203.01 B
3,4,China Construction Bank,China,203.08 B,48.25 B,"4,977.48 B",172.99 B
4,5,Agricultural Bank of China,China,186.14 B,37.92 B,"5,356.86 B",141.82 B
...,...,...,...,...,...,...,...
2046,1996,Alfa Laval,Sweden,5.35 B,489.5 M,7.82 B,15.6 B
2047,1996,Gap,United States,15.62 B,-202 M,11.39 B,3.17 B
2048,1996,Yes Bank,India,3.34 B,91.6 M,43.22 B,5.6 B
2049,1999,BEKB-BCBE,Switzerland,556 M,167.1 M,42.97 B,2.49 B


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rank              2050 non-null   object
 1   organizationName  2050 non-null   object
 2   country           2050 non-null   object
 3   revenue           2049 non-null   object
 4   profits           2049 non-null   object
 5   assets            2049 non-null   object
 6   marketValue       2049 non-null   object
dtypes: object(7)
memory usage: 112.3+ KB


### Preprocessing

In [4]:
df = data.copy()

In [5]:
df.isna().sum()

rank                1
organizationName    1
country             1
revenue             2
profits             2
assets              2
marketValue         2
dtype: int64

In [6]:
# missing values in the dataframe
missing_value_rows = df.loc[df.isna().any(axis=1) == True, :]
missing_value_rows

Unnamed: 0,rank,organizationName,country,revenue,profits,assets,marketValue
39,,,,,,,
521,471.0,GE HealthCare Consulting,United States,,,,


In [7]:
df = df.drop(missing_value_rows.index, axis=0)

In [8]:
df.isna().sum()

rank                0
organizationName    0
country             0
revenue             0
profits             0
assets              0
marketValue         0
dtype: int64

In [9]:
df

Unnamed: 0,rank,organizationName,country,revenue,profits,assets,marketValue
0,1,JPMorgan Chase,United States,179.93 B,41.8 B,"3,744.3 B",399.59 B
1,2,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,589.47 B,156.36 B,660.99 B,"2,055.22 B"
2,3,ICBC,China,216.77 B,52.47 B,"6,116.82 B",203.01 B
3,4,China Construction Bank,China,203.08 B,48.25 B,"4,977.48 B",172.99 B
4,5,Agricultural Bank of China,China,186.14 B,37.92 B,"5,356.86 B",141.82 B
...,...,...,...,...,...,...,...
2046,1996,Alfa Laval,Sweden,5.35 B,489.5 M,7.82 B,15.6 B
2047,1996,Gap,United States,15.62 B,-202 M,11.39 B,3.17 B
2048,1996,Yes Bank,India,3.34 B,91.6 M,43.22 B,5.6 B
2049,1999,BEKB-BCBE,Switzerland,556 M,167.1 M,42.97 B,2.49 B


In [10]:
# Modifying numerical columns and Converting to correct data types
df['rank'] = df['rank'].str.replace(',', '').astype(int)
for col in ['revenue', 'profits', 'assets', 'marketValue']:
    df[col] = df[col].apply(lambda x: x.replace('B', '').replace('M', '').replace(',', '').strip()).astype('float')
df

Unnamed: 0,rank,organizationName,country,revenue,profits,assets,marketValue
0,1,JPMorgan Chase,United States,179.93,41.80,3744.30,399.59
1,2,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,589.47,156.36,660.99,2055.22
2,3,ICBC,China,216.77,52.47,6116.82,203.01
3,4,China Construction Bank,China,203.08,48.25,4977.48,172.99
4,5,Agricultural Bank of China,China,186.14,37.92,5356.86,141.82
...,...,...,...,...,...,...,...
2046,1996,Alfa Laval,Sweden,5.35,489.50,7.82,15.60
2047,1996,Gap,United States,15.62,-202.00,11.39,3.17
2048,1996,Yes Bank,India,3.34,91.60,43.22,5.60
2049,1999,BEKB-BCBE,Switzerland,556.00,167.10,42.97,2.49


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2049 entries, 0 to 2050
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rank              2049 non-null   int64  
 1   organizationName  2049 non-null   object 
 2   country           2049 non-null   object 
 3   revenue           2049 non-null   float64
 4   profits           2049 non-null   float64
 5   assets            2049 non-null   float64
 6   marketValue       2049 non-null   float64
dtypes: float64(4), int64(1), object(2)
memory usage: 128.1+ KB


In [12]:
# Dropping unused columns
df = df.drop(['rank', 'organizationName'], axis=1)
df

Unnamed: 0,country,revenue,profits,assets,marketValue
0,United States,179.93,41.80,3744.30,399.59
1,Saudi Arabia,589.47,156.36,660.99,2055.22
2,China,216.77,52.47,6116.82,203.01
3,China,203.08,48.25,4977.48,172.99
4,China,186.14,37.92,5356.86,141.82
...,...,...,...,...,...
2046,Sweden,5.35,489.50,7.82,15.60
2047,United States,15.62,-202.00,11.39,3.17
2048,India,3.34,91.60,43.22,5.60
2049,Switzerland,556.00,167.10,42.97,2.49


In [13]:
# One-hot encode the nominal feature columns
dummies = pd.get_dummies(df['country'], prefix='country')
df = pd.concat([df, dummies], axis= 1)
df = df.drop(['country'], axis=1)
df

Unnamed: 0,revenue,profits,assets,marketValue,country_Argentina,country_Australia,country_Austria,country_Belgium,country_Bermuda,country_Brazil,...,country_Sweden,country_Switzerland,country_Taiwan,country_Thailand,country_Turkey,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Vietnam
0,179.93,41.80,3744.30,399.59,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,589.47,156.36,660.99,2055.22,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,216.77,52.47,6116.82,203.01,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,203.08,48.25,4977.48,172.99,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,186.14,37.92,5356.86,141.82,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2046,5.35,489.50,7.82,15.60,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
2047,15.62,-202.00,11.39,3.17,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2048,3.34,91.60,43.22,5.60,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2049,556.00,167.10,42.97,2.49,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False


In [14]:
# Split df into X and y
X = df.drop('marketValue', axis=1)
y = df['marketValue']

In [15]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [16]:
X_train

Unnamed: 0,revenue,profits,assets,country_Argentina,country_Australia,country_Austria,country_Belgium,country_Bermuda,country_Brazil,country_Canada,...,country_Sweden,country_Switzerland,country_Taiwan,country_Thailand,country_Turkey,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Vietnam
1718,7.62,987.40,13.47,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
738,18.93,1.84,24.84,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
16,108.93,14.50,1886.40,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1258,3.76,888.60,94.63,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1579,2.27,377.20,186.99,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1793,2.09,482.70,35.36,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1098,4.00,1.10,31.10,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1934,2.10,611.70,3.52,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
236,35.94,5.23,59.88,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [17]:
y_train

1718      7.22
738      12.49
16      142.36
1258      5.95
1579      4.39
         ...  
1793      8.45
1098     14.72
1934     12.77
236     132.08
1063     38.51
Name: marketValue, Length: 1434, dtype: float64

In [18]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [19]:
X_train

Unnamed: 0,revenue,profits,assets,country_Argentina,country_Australia,country_Austria,country_Belgium,country_Bermuda,country_Brazil,country_Canada,...,country_Sweden,country_Switzerland,country_Taiwan,country_Thailand,country_Turkey,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Vietnam
1718,-0.322508,2.210694,-0.274852,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,9.414085,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
738,-0.215185,-0.635591,-0.250334,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
16,0.638848,-0.599029,3.764038,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,1.533123,-0.026417,-0.052889
1258,-0.359137,1.925360,-0.099835,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
1579,-0.373276,0.448444,0.099336,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1793,-0.374984,0.753126,-0.227648,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
1098,-0.356860,-0.637728,-0.236834,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
1934,-0.374889,1.125677,-0.296309,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,-0.652263,-0.026417,-0.052889
236,-0.053773,-0.625801,-0.174771,-0.026417,-0.133203,-0.059152,-0.052889,-0.059152,-0.106224,-0.167203,...,-0.109532,-0.130466,-0.155839,-0.095648,-0.074901,-0.0838,-0.192032,1.533123,-0.026417,-0.052889


### Training

In [20]:
inputs = tf.keras.Input(shape=(61,))
x = tf.keras.layers.Dense(128, activation='relu')(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='linear')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer='adam',
    loss = 'mse'
)

history = model.fit(
    X_train, 
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100


2025-03-09 11:11:19.980546: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100


### Results

In [21]:
y_pred = np.squeeze(model.predict(X_test))
y_pred



array([ 4.30540428e+01,  8.47149963e+01,  4.35037766e+01,  1.23080490e+02,
        1.10861473e+02,  2.46952763e+01,  3.24130898e+01,  1.26840391e+01,
        7.69616623e+01,  1.67255554e+01,  4.07580414e+01,  6.66974030e+01,
        5.10361633e+01,  8.53139801e+01,  5.48616943e+01,  9.54988174e+01,
        3.57635078e+01,  1.06282797e+01,  5.95956497e+01,  1.04352293e+01,
        2.26722260e+02,  5.56272621e+01,  2.80481529e+01,  4.93739853e+01,
        1.04293503e+02,  2.67971497e+01,  6.72802811e+01,  7.74808273e+01,
        4.81857300e+01,  5.41150856e+01,  2.44541969e+01,  1.87779312e+01,
        3.78569756e+01,  2.27705269e+01,  1.70983715e+01,  7.97764397e+00,
        1.72247696e+02,  1.99783363e+01,  2.69243107e+01,  7.06138992e+01,
        9.48235512e+00,  1.12860382e+02,  7.89486542e+01,  4.23209534e+01,
        5.16139565e+01,  7.01233215e+01,  5.29865608e+01,  2.04170475e+01,
        5.01685600e+01,  3.83889351e+01,  2.13459263e+01,  7.45340805e+01,
        3.02484665e+01,  

In [22]:
y_test

404      38.60
352     122.91
677      25.43
103      51.58
182      47.07
         ...  
1002     38.32
1512      6.58
1827      5.03
1398      8.47
287      67.85
Name: marketValue, Length: 615, dtype: float64

In [23]:
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
rmse

184.098185357547

In [24]:
y_test.describe()

count     615.000000
mean       61.911951
std       196.702525
min         1.020000
25%         8.345000
50%        17.070000
75%        38.295000
max      2309.840000
Name: marketValue, dtype: float64

In [25]:
## R2 score calculation
np.sum((y_test - y_pred)**2)

20843667.238944165

In [26]:
np.sum((y_test - y_test.mean())**2)

23756816.26685854

In [27]:
r2_score = 1 - (np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2))
r2_score

0.12262371334572741

In [29]:
print("RMSE: {:.2f}".format(rmse))
print("R-Squared: {:.4f}".format(r2_score))

fig = px.scatter(
    x=y_pred,
    y=y_test,
    labels={'x': 'Predicted', 'y': 'Actual'},
    title='Actual vs Predicted Values',
    width = 700,
    height = 700
)
fig.show()

RMSE: 184.10
R-Squared: 0.1226
