## ROOT CAUSES OF LIMITED PERFORMANCE (~72% Accuracy)

---

### 1. CLASS IMBALANCE PROBLEM ⚠️
- **Positive sentiment dominates the dataset**
- **Very few neutral samples** (class imbalance noted in observations)
- Models struggle to learn minority classes effectively
- This explains why precision/recall may be uneven across classes

---

### 2. INSUFFICIENT TEXT PREPROCESSING
**Current preprocessing only does:**
- Lowercase conversion
- Punctuation removal
- Whitespace normalization

**Missing critical steps:**
- No stopword removal (common words add noise)
- No lemmatization/stemming (reduces vocabulary sparsity)
- No handling of special characters, URLs, numbers
- No removal of extremely short/long outlier texts

---

### 3. WEAK CORRELATIONS WITH NUMERIC FEATURES
- Sentiment shows weak linear correlation with stock prices, volume, news length
- The notebook correctly identifies: "text features are more informative"
- Yet the model only uses text embeddings - no feature engineering was performed

---

### 4. SUBOPTIMAL EMBEDDINGS
**Word2Vec Issues:**
- Small training corpus → poor quality embeddings
- `min_count=2` may exclude rare but important words
- Simple averaging of word vectors loses word order/context
- `vector_size=100` may be too small for complex sentiment nuances

**Sentence Transformer:**
- all-MiniLM-L6-v2 is a general-purpose model
- Not fine-tuned for financial sentiment (domain-specific)
- May not capture market-specific terminology

---

### 5. MODEL ARCHITECTURE LIMITATIONS
**Random Forest:**
- `max_depth=5` is very shallow for complex text patterns
- May be underfitting
- No hyperparameter tuning performed

**Neural Network:**
- Simple architecture (3 layers)
- No regularization besides dropout
- Fixed `epochs=30` without early stopping
- No learning rate scheduling
- Validation data only used for monitoring, not for preventing overfitting

---

## PRACTICAL RECOMMENDATIONS TO IMPROVE BEYOND 72%

### Priority 1: Address Class Imbalance 🎯
- Use class weights in models:
```python
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = dict(enumerate(class_weights))
```

- Apply SMOTE for oversampling minority classes:
```python
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_embeddings, y_train)
```
- Use stratified sampling (already done - good!)

---

### Priority 2: Enhanced Text Preprocessing 📝
```python
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required resources
nltk.download('stopwords')
nltk.download('wordnet')

def enhanced_preprocess(text):
    # Existing steps...
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)  # Remove URLs
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))
    # NEW: Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    # NEW: Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)
```
---

### Priority 3: Feature Engineering 🔧
- Create hybrid features combining text + numeric signals:
```python
additional_features = df[['Price_Change', 'Price_Range', 'Volume', 'News_Length', 'Word_Count']].fillna(0)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
additional_features_scaled = scaler.fit_transform(additional_features)
X_train_hybrid = np.concatenate([X_train_st, additional_features_scaled_train], axis=1)
X_test_hybrid = np.concatenate([X_test_st, additional_features_scaled_test], axis=1)
```
---

### Priority 4: Better Embeddings 🚀
- **Option A: Domain-specific model**
```python
from sentence_transformers import SentenceTransformer
model_sentf = SentenceTransformer('ProsusAI/finbert')
# or
model_sentf = SentenceTransformer('yiyanghkust/finbert-tone')
```
- **Option B: Improved Word2Vec**
```python
from gensim.models import Word2Vec
word2vec_model = Word2Vec(
    sentences=tokenized_train,
    vector_size=300,  # Increase from 100
    window=10,        # Increase context window
    min_count=1,      # Don't exclude rare words
    sg=1,
    epochs=20,        # More training
    seed=42
)
```
---

### Priority 5: Model Architecture Improvements ⚙️
- **Random Forest:**
```python
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],  # Increase depth!
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'class_weight': ['balanced']  # Handle imbalance
}
rf_tuned = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='f1_weighted')
rf_tuned.fit(X_train_st, y_train)
```

- **Neural Network:**
```python
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
nn_improved = Sequential([
    Dense(256, activation='relu', input_shape=(X_train_st.shape[1],)),
    Dropout(0.4),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(3, activation='softmax')
])
nn_improved.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3)
history = nn_improved.fit(
    X_train_st_np, y_train_mapped_st,
    validation_data=(X_test_st_np, y_test_mapped_st),
    epochs=50,
    batch_size=32,
    class_weight=class_weight_dict,  # Add class weights!
    callbacks=[early_stop, reduce_lr]
)
```
---

### Priority 6: Try Advanced Models 🏆
- **Ensemble approach**
```python
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
xgb = XGBClassifier(
    n_estimators=200,
    max_depth=7,
    learning_rate=0.1,
    scale_pos_weight=len(y_train[y_train==0])/len(y_train[y_train==1]),  # Handle imbalance
    random_state=42
)
ensemble = VotingClassifier(
    estimators=[('rf', rf_tuned.best_estimator_), ('xgb', xgb)],
    voting='soft'
)
```
---

## EXPECTED IMPROVEMENTS
- Implementing these recommendations should improve metrics to:
    - **Accuracy:** 78-85% (from 72%)
    - **F1-Score:** 0.78-0.84 (balanced across classes)
    - **Recall for minority class:** +15-20%

**Implementation Priority Order:**
- ✅ Class imbalance handling (biggest impact)
- ✅ Enhanced preprocessing
- ✅ Hyperparameter tuning for RF
- ✅ Domain-specific embeddings (FinBERT)
- ✅ Feature engineering (hybrid approach)
- ✅ Improved NN architecture with callbacks

---

**Would you like me to implement any of these improvements in your notebook?**
