## Model Training

In [23]:
import pandas as pd
import numpy as np

In [6]:
apld_data = pd.read_csv('../data/processed data/APLD_merged.csv')
cney_data = pd.read_csv('../data/processed data/CNEY_merged.csv')
ktta_data = pd.read_csv('../data/processed data/KTTA_merged.csv')
onco_data = pd.read_csv('../data/processed data/ONCO_merged.csv')
tnxp_data = pd.read_csv('../data/processed data/TNXP_merged.csv')

#### Random Forest Classifer

In [60]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Creating the target variable (1: price increase, 0: price decrease)
def create_target_variable(df, close_col):
    df['Price_Change'] = (df[close_col].diff() > 0).astype(int)
    df = df.dropna()  # Drop null values 
    return df

apld_data = create_target_variable(apld_data, 'APLD_Close')
cney_data = create_target_variable(cney_data, 'CNEY_Close')
ktta_data = create_target_variable(ktta_data, 'KTTA_Close')
onco_data = create_target_variable(onco_data, 'ONCO_Close')
tnxp_data = create_target_variable(tnxp_data, 'TNXP_Close')

# Sentiment, Volume, and Previous Day's Price
def prepare_features(df, close_col, volume_col, sentiment_col):
    df['Prev_Close'] = df[close_col].shift(1)
    df = df.dropna()  # Drop rows with NaN values 
    X = df[[sentiment_col, volume_col, 'Prev_Close']]
    y = df['Price_Change']
    return X, y

X_apld, y_apld = prepare_features(apld_data, 'APLD_Close', 'APLD_Volume', 'Sentiment')
X_cney, y_cney = prepare_features(cney_data, 'CNEY_Close', 'CNEY_Volume', 'Sentiment')
X_ktta, y_ktta = prepare_features(ktta_data, 'KTTA_Close', 'KTTA_Volume', 'Sentiment')
X_onco, y_onco = prepare_features(onco_data, 'ONCO_Close', 'ONCO_Volume', 'Sentiment')
X_tnxp, y_tnxp = prepare_features(tnxp_data, 'TNXP_Close', 'TNXP_Volume', 'Sentiment')

X_combined = pd.concat([X_apld, X_cney, X_ktta, X_onco, X_tnxp], axis=0)
y_combined = pd.concat([y_apld, y_cney, y_ktta, y_onco, y_tnxp], axis=0)

X_combined = X_combined.assign(Prev_Close=X_combined['Prev_Close'].fillna(X_combined['Prev_Close'].mean()))

X_train_imputed, X_test_imputed, y_train_imputed, y_test_imputed = train_test_split(X_combined, y_combined, test_size=0.3)

# Training: Random Forest Classifier
model_imputed = RandomForestClassifier()
model_imputed.fit(X_train_imputed, y_train_imputed)

# Predictions
y_pred_imputed = model_imputed.predict(X_test_imputed)

# Model Evaluation
accuracy_imputed = accuracy_score(y_test_imputed, y_pred_imputed)
classification_rep_imputed = classification_report(y_test_imputed, y_pred_imputed)

print(f"Accuracy: {accuracy_imputed}")
print("Classification Report:\n", classification_rep_imputed)

Accuracy: 0.5374149659863946
Classification Report:
               precision    recall  f1-score   support

           0       0.48      0.48      0.48        66
           1       0.58      0.58      0.58        81

    accuracy                           0.54       147
   macro avg       0.53      0.53      0.53       147
weighted avg       0.54      0.54      0.54       147



***Accuracy:*** 53.74%

This means the model correctly predicted whether the stock price would increase or decrease 53.95% of the time, which is only slightly better than random guessing (50%).

***Precision:***

- **Class 0 (Price Decrease):** The model predicted a price decrease correctly 48% of the time.

- **Class 1 (Price Increase):** The model predicted a price increase correctly 58% of the time. This indicates that the model performs better at identifying price increases.

***Recall:***

- **Class 0 (Price Decrease):** 48% recall means that of all the actual price decreases, the model correctly identified 48% of them.

- **Class 1 (Price Increase):** 58% recall means that of all the actual price increases, the model correctly identified 58%.

***F1-Score:***

- **Class 0 (Price Decrease):** F1-score of 48% indicates a balance between precision and recall for price decreases. The model struggles more with predicting decreases.

- **Class 1 (Price Increase):** F1-score of 58% shows a better balance for price increases.

***Overall Interpretation:***

- The model is better at predicting price increases (Class 1) than decreases (Class 0). This could be happening because the datasets have an imbalance and certain patterns are more easily detectable for one class.

- The model has moderate performance but struggles with false positives for Class 0, where it incorrectly predicts price decreases more often.

#### Logistic Regression & Gradient Boosting

In [79]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

def prepare_stock_data(df, stock_name):
    df['Date'] = pd.to_datetime(df['Date'])  
    df = df.sort_values('Date')  
    
    # Create binary target: 1 if next day's price increased, 0 otherwise
    df['Target'] = (df[f'{stock_name}_Close'].shift(-1) > df[f'{stock_name}_Close']).astype(int)
    df = df.dropna()  # Drop rows with NaN values 
    
    features = [f'{stock_name}_Close', f'{stock_name}_Volume', 'Sentiment']
    X = df[features]
    y = df['Target']
    
    return X, y, df

X_apld, y_apld, apld_prepared = prepare_stock_data(apld_data, 'APLD')
X_cney, y_cney, cney_prepared = prepare_stock_data(cney_data, 'CNEY')
X_ktta, y_ktta, ktta_prepared = prepare_stock_data(ktta_data, 'KTTA')
X_onco, y_onco, onco_prepared = prepare_stock_data(onco_data, 'ONCO')
X_tnxp, y_tnxp, tnxp_prepared = prepare_stock_data(tnxp_data, 'TNXP')

# Train and test sets for each stock
X_train_apld, X_test_apld, y_train_apld, y_test_apld = train_test_split(X_apld, y_apld, test_size=0.2, shuffle=False)
X_train_cney, X_test_cney, y_train_cney, y_test_cney = train_test_split(X_cney, y_cney, test_size=0.2, shuffle=False)
X_train_ktta, X_test_ktta, y_train_ktta, y_test_ktta = train_test_split(X_ktta, y_ktta, test_size=0.2, shuffle=False)
X_train_onco, X_test_onco, y_train_onco, y_test_onco = train_test_split(X_onco, y_onco, test_size=0.2, shuffle=False)
X_train_tnxp, X_test_tnxp, y_train_tnxp, y_test_tnxp = train_test_split(X_tnxp, y_tnxp, test_size=0.2, shuffle=False)

# Standardize the features for logistic regression
scaler = StandardScaler()
X_train_scaled_apld = scaler.fit_transform(X_train_apld)
X_test_scaled_apld = scaler.transform(X_test_apld)

X_train_scaled_cney = scaler.fit_transform(X_train_cney)
X_test_scaled_cney = scaler.transform(X_test_cney)

X_train_scaled_ktta = scaler.fit_transform(X_train_ktta)
X_test_scaled_ktta = scaler.transform(X_test_ktta)

X_train_scaled_onco = scaler.fit_transform(X_train_onco)
X_test_scaled_onco = scaler.transform(X_test_onco)

X_train_scaled_tnxp = scaler.fit_transform(X_train_tnxp)
X_test_scaled_tnxp = scaler.transform(X_test_tnxp)

# Logistic Regression 
log_reg_apld = LogisticRegression()
log_reg_apld.fit(X_train_scaled_apld, y_train_apld)
y_pred_log_reg_apld = log_reg_apld.predict(X_test_scaled_apld)

log_reg_cney = LogisticRegression()
log_reg_cney.fit(X_train_scaled_cney, y_train_cney)
y_pred_log_reg_cney = log_reg_cney.predict(X_test_scaled_cney)

log_reg_ktta = LogisticRegression()
log_reg_ktta.fit(X_train_scaled_ktta, y_train_ktta)
y_pred_log_reg_ktta = log_reg_ktta.predict(X_test_scaled_ktta)

log_reg_onco = LogisticRegression()
log_reg_onco.fit(X_train_scaled_onco, y_train_onco)
y_pred_log_reg_onco = log_reg_onco.predict(X_test_scaled_onco)

log_reg_tnxp = LogisticRegression()
log_reg_tnxp.fit(X_train_scaled_tnxp, y_train_tnxp)
y_pred_log_reg_tnxp = log_reg_tnxp.predict(X_test_scaled_tnxp)

# Gradient Boosting 
gbc_apld = GradientBoostingClassifier()
gbc_apld.fit(X_train_apld, y_train_apld)
y_pred_gbc_apld = gbc_apld.predict(X_test_apld)

gbc_cney = GradientBoostingClassifier()
gbc_cney.fit(X_train_cney, y_train_cney)
y_pred_gbc_cney = gbc_cney.predict(X_test_cney)

gbc_ktta = GradientBoostingClassifier()
gbc_ktta.fit(X_train_ktta, y_train_ktta)
y_pred_gbc_ktta = gbc_ktta.predict(X_test_ktta)

gbc_onco = GradientBoostingClassifier()
gbc_onco.fit(X_train_onco, y_train_onco)
y_pred_gbc_onco = gbc_onco.predict(X_test_onco)

gbc_tnxp = GradientBoostingClassifier()
gbc_tnxp.fit(X_train_tnxp, y_train_tnxp)
y_pred_gbc_tnxp = gbc_tnxp.predict(X_test_tnxp)

# Models for each stock
log_reg_accuracy_apld = accuracy_score(y_test_apld, y_pred_log_reg_apld)
gbc_accuracy_apld = accuracy_score(y_test_apld, y_pred_gbc_apld)

log_reg_accuracy_cney = accuracy_score(y_test_cney, y_pred_log_reg_cney)
gbc_accuracy_cney = accuracy_score(y_test_cney, y_pred_gbc_cney)

log_reg_accuracy_ktta = accuracy_score(y_test_ktta, y_pred_log_reg_ktta)
gbc_accuracy_ktta = accuracy_score(y_test_ktta, y_pred_gbc_ktta)

log_reg_accuracy_onco = accuracy_score(y_test_onco, y_pred_log_reg_onco)
gbc_accuracy_onco = accuracy_score(y_test_onco, y_pred_gbc_onco)

log_reg_accuracy_tnxp = accuracy_score(y_test_tnxp, y_pred_log_reg_tnxp)
gbc_accuracy_tnxp = accuracy_score(y_test_tnxp, y_pred_gbc_tnxp)

print(f"APLD - Logistic Regression: {log_reg_accuracy_apld}, Gradient Boosting: {gbc_accuracy_apld}")
print(f"CNEY - Logistic Regression: {log_reg_accuracy_cney}, Gradient Boosting: {gbc_accuracy_cney}")
print(f"KTTA - Logistic Regression: {log_reg_accuracy_ktta}, Gradient Boosting: {gbc_accuracy_ktta}")
print(f"ONCO - Logistic Regression: {log_reg_accuracy_onco}, Gradient Boosting: {gbc_accuracy_onco}")
print(f"TNXP - Logistic Regression: {log_reg_accuracy_tnxp}, Gradient Boosting: {gbc_accuracy_tnxp}")

APLD - Logistic Regression: 0.5, Gradient Boosting: 0.5
CNEY - Logistic Regression: 0.59375, Gradient Boosting: 0.46875
KTTA - Logistic Regression: 0.4117647058823529, Gradient Boosting: 0.47058823529411764
ONCO - Logistic Regression: 0.5833333333333334, Gradient Boosting: 0.5416666666666666
TNXP - Logistic Regression: 0.8, Gradient Boosting: 0.4




### APLD
- **Logistic Regression Accuracy:** 50%
- **Gradient Boosting Accuracy:** 50%
  
Both Logistic Regression and Gradient Boosting performed equally, with an accuracy of 50%. This suggests that neither model was able to reliably predict price movements for APLD, likely due to data limitations or a lack of clear signal in the features used.

### CNEY
- **Logistic Regression Accuracy:** 59.4%
- **Gradient Boosting Accuracy:** 46.9%

Logistic Regression performed better, indicating that the linear relationships captured by this model were more predictive for CNEY. Gradient Boosting, a more complex model, struggled potentially due to overfitting or insufficient feature depth.

### KTTA
- **Logistic Regression Accuracy:** 41.2%
- **Gradient Boosting Accuracy:** 47.1%

The performances of both models were close, but neither model performed particularly well. This could indicate that the data for KTTA stock is noisier or more challenging to predict, making it harder for either model to capture meaningful patterns.

### ONCO
- **Logistic Regression Accuracy:** 58.3%
- **Gradient Boosting Accuracy:** 54.2%

Both models performed similarly, with Logistic Regression slightly outperforming Gradient Boosting. This suggests that a simpler model like Logistic Regression may be sufficient for this stock, as it captured the key patterns more effectively.

### TNXP
- **Logistic Regression Accuracy:** 80%
- **Gradient Boosting Accuracy:** 40%

Logistic Regression significantly outperformed Gradient Boosting. This stark difference suggests that the price movements for TNXP may follow a simpler, more linear pattern, making Logistic Regression a much better fit. Gradient Boosting, being more flexible, may have overfit to noise in the data.


### Overall Insights:
- **Logistic Regression** handled the data better in most cases, especially for TNXP and CNEY.
- **Gradient Boosting** underperformed in several cases, potentially due to overfitting or the limitations of the feature set.
- The performance for some stocks (like KTTA and APLD) indicates that more data or additional features, such as better sentiment data, could have improved the models.

These results could be attributed to the nature of the sentiment data and its limited scope, which impacts more complex models like Gradient Boosting. This reinforces the observation that richer data would have improved overall performance.
