# **Stock Market Sentiment Analysis**  

Predicting Market Returns Using Online Investor Sentiment  
Presented by: Chuxuan Ma, Zijian Wang 
Date: March 3, 2025

## Background & Motivation

- **Investor Sentiment Matters:**  
  Investor emotions and sentiment have been shown to affect market momentum and price fluctuations.
  
- **Stock Market as a Barometer:**  
  The Shanghai Composite Index is widely regarded as a reflection of the overall market sentiment and economic conditions in China.

- **Problem Context:**  
  - How online opinions, posts, and comments (from finance forums) aggregate into measurable market sentiment.  
  - Can we quantify this sentiment and improve predictions of next-day returns?

## Project Goals

- **Data Integration:**  
  Integrate labeled Chinese financial sentiment data from Github, Kaggle, and Hugging Face.

- **Modeling Approach:**  
  - Fine-tune a Chinese pre-trained BERT model with an added fully connected layer to output a sentiment score.
  - Train the model using 10 epochs on our curated dataset.

- **Sentiment Quantification:**  
  - Use Beautiful Soup to crawl posts/comments from **EastMoney-Shanghai Securities Composite index (SSEC) forum** (**2018/12/1–2025/1/1**).
  - Predict sentiment scores with the BERT model.
  
- **Index Construction:**  
  Calculate a monthly sentiment index with a rolling window to capture the impact on the Shanghai Composite Index.

- **Investment Strategy:**  
  Incorporate the sentiment index as an additional factor in a regression model to predict next-day returns and design a trading strategy.

## Approach
### 1. Data Collection & Model Training

- **Step 1: Labeled Data Acquisition**  
  - Sources: Github, Kaggle, Hugging Face  
  - Labels: “Positive” (1), “Neutral” (0), “Negative” (-1)

- **Step 2: BERT Sentiment Score Model**
  - **Model:** Chinese pre-trained BERT + Fully Connected layer
  - **Training:** Fine-tune on the collected dataset for 10 epochs  
  - **Objective:** Predict sentiment probability for each financial post/comment

- **Mathematical Formulation for a Comment:**

  $$
  \text{score}_i = f_{\text{BERT}, \text{prob}}(\text{comment}_i)
  $$

### 2. Web Scraping with Beautiful Soup

- **Data Source:** 东方财富网-上证指数吧  
- **Time Period:** 2018/12/1 – 2025/1/1  
- **Target Data:** All available posts and comments

- **Process:**  
  - Utilize Beautiful Soup for HTML parsing.  
  - Extract text data and timestamp along with other metadata.  
  - Apply the BERT model to each post/comment to obtain the sentiment score.

- **Example Data Table:**

<center>

| Post ID | Date       | Excerpt                          | Predicted Score |
|---------|------------|----------------------------------|-----------------|
| 1       | 2024-05-01 | "Investor optimism is rising."   | 0.85            |
| 2       | 2024-05-01 | "Concerns over market volatility." | 0.30            |
| 3       | 2024-05-01 | "Mixed signals with uncertainty." | 0.50            |

</center>

### 3. Monthly Sentiment Index Calculation

- **Rolling Window Method:**  
  For each month $t \in \{ \text{2019-01-01}, \dots, \text{2025-01-01} \}$, use data from $[t-1, t]$.

- **Aggregation Process:**  
  For $n_t$ posts/comments during $[t-1, t]$:

  - Positive sentiment sum:

    $$\text{pos}_t = \sum_{i=1}^{n_t} \text{score}_i$$

  - Negative sentiment sum:

    $$
    \text{neg}_t = \sum_{i=1}^{n_t} (1-\text{score}_i)
    $$

  - **Sentiment Index:**

    $$
    \text{index}_t = \ln\left(\frac{1+\text{pos}_t}{1+\text{neg}_t}\right)
    $$

- **Example Calculation Table:**

<center>

| Month   | $$n_t$$ | $$\text{pos}_t$$ | $$\text{neg}_t$$ | $$\text{index}_t$$                         |
|---------|---------|------------------|------------------|--------------------------------------------|
| 2025-1-24  | 150     | 110              | 40               | $$\ln\left(\frac{1+110}{1+40}\right) \approx 1.00$$ |
| 2025-2-24  | 200     | 150              | 50               | $$\ln\left(\frac{1+150}{1+50}\right) \approx 1.09$$ |

</center>

### 4. Predictive Regression Model

- **Regression Model:**

  Original model without sentiment:

  $$
  r_{t+1} = \hat{\alpha} + \sum_{i=1}^{n} \hat{\beta}_i \, \text{factor}_{i, t} + \hat{\epsilon}_{t}
  $$

- **Incorporating Sentiment Index:**

  $$
  r_{t+1} = \hat{\alpha} + \sum_{i=1}^{n} \hat{\beta}_i \, \text{factor}_{i, t} + \hat{\beta}_{n+1} \, \text{index}_t + \hat{\epsilon}_{t}
  $$

  where $r_{t+1} = \frac{P_{t+1} - P_t}{P_t}$.

- **Trading Signal:**
  - If $\hat{r}_{t+1} > 0$: **Buy**
  - Else: **No Buy**

- **Example Investment Strategy Table:**

<center>

| Trade Date $t$ | $\text{factor}_{1, t}$ | ...   | $\text{index}_t$ | $\hat{r}_{t+1}$ | Trading Signal | Actual $r_{t+1}$ |
|-----------------|-----------------------|------|------------------|-----------------  |---------------- |------------------|
| 2025-01-02      | 0.1%                  | ...  | 0.2%             | 0.5%              | Buy             | 0.6%           |
| 2025-01-09      | 0.3%                  | ...  | 0.4%             | -0.2%             | No Buy          | -0.1%          |
| 2025-01-16      | 0.2%                  | ...  | 0.5%             | 0.3%              | Buy             | 0.4%           |

</center>

### 5. Backtesting & Strategy Performance

- **Backtesting Period:**  
  January 2, 2025 – March 1, 2025

- **Approach:**
  - Use the regression model with the sentiment index to predict returns.
  - Execute the trading strategy in a simulated environment.
  
- **Evaluation Metric:**  
  - Total return over the period.
  - Comparison of $ R^2 $ changes with/without the sentiment factor.
  
- **Observation:**  
  - If inclusion of $\text{index}_t$ significantly improves $ R^2 $ and the estimated $\hat{\beta}_{n+1}$ is statistically significant, investor sentiment can be considered a meaningful predictor.

## Discussion of Existing Solutions

- **Existing Approaches:**
  - Traditional sentiment analysis using lexicon-based methods.
  - Machine learning models that combine news sentiment with price data.
  - Use of alternative data (like social media posts) often suffers from future bias when using metrics like likes/upvotes.

- **Our Improvements and Perspectives:**
  - **Novel Data Source:**  
    Utilizing forum posts from EastMoney “future bias” by not including future likes/upvotes.
  - **Rolling Window Sentiment Index:**  
    Incorporates a forgetting effect and aligns with the reporting period.
  - **Integration with Traditional Factors:**  
    The framework tests if sentiment adds predictive power over standard financial indicators.

  - **Future Work (Milestone 2):**  
    - Explore alternative aggregation methods (e.g., weighted sentiment based on user credibility).
    - Compare with deep learning approaches (e.g., transformer models using full context of posts).

## Conclusion

- **Summary:**  
  - We built a pipeline that collects labeled sentiment data, fine-tunes BERT for sentiment classification, and crawls financial forum data.
  - Using a rolling window, we compute a monthly sentiment index and integrate it into a regression model predicting next-day returns.
  - Preliminary backtesting shows potential for the sentiment index to improve investment decisions.

- **Final Thoughts:**  
  - This framework provides a novel quantitative approach that captures the human element of market behavior.
  - Continued improvements and comparative studies with existing methods will further refine the model and strategy.