<a href="https://colab.research.google.com/github/UpLiftL1f3/CSCE5214_LineChasers/blob/main/main_svm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Step 1

- Data Preprocessing

In [2]:
# 1) Grab the data
df = pd.read_csv('data/train.csv')
# rename column name
df.rename(columns = {'opened_position_qty ': 'opened_position_qty'}, inplace = True)

# 2) Remove the last 50,000 rows
df = df.iloc[:-50000, :]

# 3) find missing data
missing_data = df.isnull().sum()

# 4) remove rows with missing data
df = df.dropna()
df_cleaned = df.isnull().sum()
#print(df_cleaned)



# Step 2

- Feature Engineering
  - Feature 1: Midprice and Open/Close Positions
  - Feature 2: Relationship Between Midprice and Bid/Ask Volumes

- Data Split


I chose to train the model based on information given in each timestamp not a collection of timestamps the came before or even individaul time stamps that came before

In [3]:
# Feature Engineering with midprice and Open/Close positions

# Feature: Ratio of opened_position_qty to closed_position_qty
df['open_close_ratio'] = df['opened_position_qty'] / (df['closed_position_qty'] + 1e-6)  # Add small value to avoid division by zero

# Feature: Difference between midprice and last price
df['mid_last_diff'] = df['mid'] - df['last_price']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['open_close_ratio'] = df['opened_position_qty'] / (df['closed_position_qty'] + 1e-6)  # Add small value to avoid division by zero
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mid_last_diff'] = df['mid'] - df['last_price']


In [4]:
# Feature engineering with bid and ask

# Feature: Difference between midprice and best bid/ask prices
#df['mid_bid1_diff'] = df['mid'] - df['bid1']
#df['mid_ask1_diff'] = df['mid'] - df['ask1']

# Feature: Total bid volume compared to midprice
total_bid_vol = df['bid1vol'] + df['bid2vol'] + df['bid3vol'] + df['bid4vol'] + df['bid5vol']
total_ask_vol = df['ask1vol'] + df['ask2vol'] + df['ask3vol'] + df['ask4vol'] + df['ask5vol']

# Optionally, you can create a ratio between bid/ask volume and the midprice:
df['bid_ask_vol_ratio'] = total_bid_vol / (total_ask_vol + 1e-6)  # Add small value to avoid division by zero
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['bid_ask_vol_ratio'] = total_bid_vol / (total_ask_vol + 1e-6)  # Add small value to avoid division by zero


Unnamed: 0,id,last_price,mid,opened_position_qty,closed_position_qty,transacted_qty,d_open_interest,bid1,bid2,bid3,...,bid5vol,ask1vol,ask2vol,ask3vol,ask4vol,ask5vol,y,open_close_ratio,mid_last_diff,bid_ask_vol_ratio
1,1,3842.8,3843.4,6.0,49.0,55.0,-43,3843.0,3842.8,3842.4,...,6.0,1.0,4.0,4.0,1.0,13.0,0.0,0.122449,0.6,1.347826
2,2,3844.0,3844.3,7.0,77.0,84.0,-69,3843.8,3843.6,3843.2,...,12.0,1.0,16.0,10.0,4.0,9.0,0.0,0.090909,0.3,1.025
3,3,3843.8,3843.4,3.0,34.0,37.0,-30,3843.0,3842.8,3842.4,...,4.0,2.0,7.0,1.0,2.0,11.0,1.0,0.088235,-0.4,1.782609
4,4,3843.2,3843.1,3.0,38.0,41.0,-35,3842.8,3842.4,3842.0,...,4.0,1.0,3.0,1.0,11.0,15.0,1.0,0.078947,-0.1,1.096774
5,5,3843.6,3844.2,12.0,17.0,29.0,-5,3843.8,3843.4,3843.2,...,17.0,1.0,12.0,15.0,10.0,3.0,0.0,0.705882,0.6,0.658537


In [5]:
# 5) split the data into inputs and outputs
# what we want the model to expect as input (X) and what we want it to return as output (y)
# 5.1) Assign 'y' from the 'y' column
y = df['y']

# 5.2) Remove the 'y' column from the DataFrame to leave only the features
# Now 'X' contains all the features except 'y', and 'y' is the target variable.
#X = df.drop(columns=['y'])
X = df[['open_close_ratio', 'mid_last_diff', 'bid_ask_vol_ratio']]

# 6) Scale features (important for SVM)
scaler = StandardScaler()
x_scaled = scaler.fit_transform(X)

# 7) split the data in training and validation
X_train, X_val, y_train, y_val = train_test_split(x_scaled, y, test_size=0.2, stratify=y)

X.head()

Unnamed: 0,open_close_ratio,mid_last_diff,bid_ask_vol_ratio
1,0.122449,0.6,1.347826
2,0.090909,0.3,1.025
3,0.088235,-0.4,1.782609
4,0.078947,-0.1,1.096774
5,0.705882,0.6,0.658537


# Step 3

- Model Training

In [6]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Step 1: Initialize the SVM model
# We will start with a linear kernel and class weighting to handle any imbalance
svm_model = SVC(kernel='linear', class_weight='balanced')

# Step 2: Train the SVM model on the training data
svm_model.fit(X_train, y_train)

# Step 3: Make predictions on the validation set
y_val_pred = svm_model.predict(X_val)

# Step 4: Evaluate the model
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy}")

# Step 5: Classification report for more detailed evaluation
print("Classification Report:")
print(classification_report(y_val, y_val_pred))

Validation Accuracy: 0.5721144967682363
Classification Report:
              precision    recall  f1-score   support

         0.0       0.70      0.57      0.63      3405
         1.0       0.44      0.58      0.50      2010

    accuracy                           0.57      5415
   macro avg       0.57      0.57      0.56      5415
weighted avg       0.60      0.57      0.58      5415



# Step 4

- Inspect Feature Importance (Coefficients)

In [7]:
# Get the feature importance (coefficients) for each feature
feature_importance = svm_model.coef_[0]

# Print feature importance values alongside feature names
features = df.columns[:-1]  # Exclude the label column
for feature, importance in zip(features, feature_importance):
    print(f"Feature: {feature}, Importance: {importance}")


Feature: id, Importance: -0.08175383828671556
Feature: last_price, Importance: -0.6168184299226596
Feature: mid, Importance: 0.310679832525409
