CSE 404 Project

### Project Overview

### Data Cleaning and Preprocessing

##### Exploratory Data Analysis (EDA)
- Check the structure of the dataset:
    - Verify column names and data types
- Identify missing values:
    - Drop weekends as they are non trading days.
    - If missing values exist in **Open/Close** and **High/Low**: Fill with the average of previous and next day's values. 
    - If missing values exist in **Volume**: Fill with the median.
- Identify duplicate rows:
  - If duplicate trading days exist for the same stock, drop them.

##### Split dataset into training, test, and validation set
- Use **70/15/15 split** for training, testing, and validation


##### Feature Scaling 
- Scale features using **StandardScaler**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Load the CSV file
stocks = pd.read_csv("stocks.csv")

# Display basic info and first few rows
print(stocks.columns)
stocks.info() 
stocks.head()

In [None]:
print("Data Types:",stocks.dtypes )
print("Missing Values:", stocks.isnull().sum())

Each stock has seperate columns for Close, High, Low, Open, and Volume. There are many NaN values in the dataset for stocks that did not exit or trade on a specific date. 

In [None]:
#Deal with missing values

#Stock market closed on weekends so drop weekends
stocks['Date_'] = pd.to_datetime(stocks['Date_'])
stocks = stocks.loc[stocks['Date_'].dt.dayofweek < 5]
#https://gpttutorpro.com/pandas-dataframe-filtering-using-datetime-methods/

#Replace missing values in Open,Close, High, and Low columns
for col in stocks.columns:
    if "Open" in col or "Close" in col or "High" in col or "Low" in col:
        # Find the first valid index where trading starts
        first_valid_index = stocks[col].first_valid_index()

        if first_valid_index is not None:
            # Fill missing values only after the stock starts trading with average of previous and next day's values.
            stocks.loc[first_valid_index:,col] = stocks.loc[first_valid_index:,col].fillna((stocks[col].shift(1)+stocks[col].shift(-1))/2)

            # Forward-fill and backward-fill only after the first valid trading day (fills with closest)
            stocks.loc[first_valid_index:,col] = stocks.loc[first_valid_index:,col].ffill().bfill()

#https://medium.com/@farisyid/penggunaan-ffill-dan-bfill-pada-proses-data-cleaning-b4f3bfec9767#:~:text='ffill'%20which%20means%20forward%20fill%20and%20'bfill',such%20as%20DataFrame%20or%20Series%20in%20Pandas.&text=Instead%2C%20the%20'bfill'%20method%20fills%20the%20missing,the%20missing%20value%20in%20the%20data%20sequence.

#replace missing values in volume columns with the median
volume_cols = []
for col in stocks.columns:
    if "Volume" in col:
        volume_cols.append(col)
for col in volume_cols:
    stocks.loc[:, col] = stocks[col].fillna(stocks[col].median())

stocks

Missing values in the dataset were handled by first removing weekends, as they are non-trading days. For Open, Close, High, and Low prices, missing values were filled using the average of the previous and next trading day's values, ensuring that data was only adjusted after the stock had begun trading. Volume data was completed using the median to maintain consistency and avoid skewing results. This approach ensures realistic stock data representation.

In [None]:
#check if duplicate rows exist
print("Number of duplicate rows:", stocks.duplicated().sum())


No duplicate rows.

In [None]:
stocks = stocks.drop(['Unnamed: 0', 'Date_'], axis = 1)

In [None]:
stocks["Percent_Change"] = (stocks["Close_AAPL"].shift(-1) - stocks["Close_AAPL"]) / stocks["Close_AAPL"]

def action(per_change):
    if per_change>0.01:
        return 2 # buy
    elif per_change<-0.01:
        return 0 #sell
    else:
        return 1 #hold
        
stocks["Target"] = stocks["Percent_Change"].apply(action)
stocks = stocks.drop(columns=["Percent_Change"])

### Logistic Regression

In [None]:
# ---------------------------
# Logistic Regression Setup
# ---------------------------

# Create the target variable:
# Predict if Close_AAPL increases (1) or decreases (0) the next day.
#stocks["Target"] = (stocks["Close_AAPL"].shift(-1) > stocks["Close_AAPL"]).astype(int)

# Drop the last row (no "next day" available)
stocks = stocks[:-1]

# Remove the Date column (non-numeric)
if "Date" in stocks.columns:
    stocks = stocks.drop(columns=["Date_"])

# Define features (X) and target (y)
X = stocks.drop(columns=["Target"])
y = stocks["Target"]

# ----- Additional Safeguard: Impute any remaining missing values in features -----
# This ensures that even if some NaNs were missed during cleaning, they are filled.
X = X.fillna(X.mean())

# Confirm no NaNs remain
assert X.isnull().sum().sum() == 0, "There are still missing values in the features!"

# ---------------------------
# Split the Data
# ---------------------------
# Use a 70/15/15 split for training, validation, and testing.
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# ---------------------------
# Feature Scaling
# ---------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled   = scaler.transform(X_val)
X_test_scaled  = scaler.transform(X_test)

# ---------------------------
# Train Logistic Regression Model
# ---------------------------
log_reg = LogisticRegression(max_iter=500, random_state=42, multi_class='multinomial')
log_reg.fit(X_train_scaled, y_train)

# ---------------------------
# Classification Report
# ---------------------------
y_pred = log_reg.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=["Sell", "Hold", "Buy"]))

### only for binary
# # ---------------------------
# # 1. Plot the ROC Curve
# # ---------------------------
# # Get predicted probabilities for the positive class on the test set.
# y_test_prob = log_reg.predict_proba(X_test_scaled)[:, 1]

# # Compute the ROC curve and the AUC.
# fpr, tpr, thresholds = roc_curve(y_test, y_test_prob)
# roc_auc = auc(fpr, tpr)

# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
# plt.plot([0, 1], [0, 1], color='red', linestyle='--')
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('ROC Curve for Logistic Regression Model')
# plt.legend(loc='lower right')
# plt.show()

# ---------------------------
# 2. Plot the Logistic Regression Coefficients
# ---------------------------
# Retrieve the coefficients and corresponding feature names.

class_labels = {0: "Sell", 1: "Hold", 2: "Buy"}
for i in range(3):
    coef = log_reg.coef_[i]
    feature_names = X.columns
    
    # Create a DataFrame and sort by absolute coefficient values.
    coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coef})
    coef_df['AbsCoefficient'] = coef_df['Coefficient'].abs()
    coef_df = coef_df.sort_values(by='AbsCoefficient', ascending=False)
    
    # Plot the top 20 features.
    plt.figure(figsize=(10, 8))
    plt.barh(coef_df['Feature'].head(20)[::-1], coef_df['Coefficient'].head(20)[::-1])
    plt.xlabel('Coefficient Value')
    plt.title(f'Top 20 Features Influencing "{class_labels[i]}" Prediction')
    plt.show()

# ---------------------------
# 3. Plot the Confusion Matrix
# ---------------------------
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Sell", "Hold", "Buy"])
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix for Multi-Class Logistic Regression')
plt.show()


In [None]:
# ---------------------------
# Evaluate the Model
# ---------------------------

# Evaluate on Validation Set
y_val_pred = log_reg.predict(X_val_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {val_accuracy:.2f}")
print("Validation Classification Report:\n", classification_report(y_val, y_val_pred, target_names=["Sell", "Hold", "Buy"]))

# Evaluate on Test Set
y_test_pred = log_reg.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"\nTest Accuracy: {test_accuracy:.2f}")
print("Test Classification Report:\n", classification_report(y_test, y_test_pred, target_names=["Sell", "Hold", "Buy"]))

### FF Neural Network Model:

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
def build_model(input_shape):
    model = keras.Sequential([
        layers.Dense(64, activation='relu', input_shape=(input_shape,)),
        layers.Dense(32, activation='relu'),
        layers.Dense(3, activation='softmax')
    ])
    
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

model = build_model(X_train.shape[1])

model.fit(X_train_scaled, y_train, epochs=20, batch_size=32, validation_data=(X_val_scaled, y_val))
y_test_probs = model.predict(X_test_scaled)
y_val_probs = model.predict(X_val_scaled)
y_test_pred_nn = np.argmax(y_test_probs, axis=1)
y_val_pred_nn  = np.argmax(y_val_probs, axis=1)

In [None]:
loss, accuracy = model.evaluate(X_test_scaled, y_test)
print(f'\nTest Accuracy: {accuracy:.2f}')
print("\nTest Set Classification Report:\n", classification_report(y_test, y_test_pred_nn, target_names=["Sell", "Hold", "Buy"]))
print("\nValidation Set Classification Report:\n", classification_report(y_val, y_val_pred_nn, target_names=["Sell", "Hold", "Buy"]))

### Evaluation