# Combine Stock Price and Sentiment Data

## Overview:
This notebook merges stock price data with sentiment analysis data to create a combined dataset that includes both market valuations and sentiment indicators. The resulting dataset can be used for further analysis, such as evaluating how news sentiment correlates with stock price movements.

## Purpose:
The main objective of this notebook is to:
- Load historical stock price data and sentiment data (news sentiment scores).
- Preprocess and clean the sentiment data, extracting meaningful indicators like polarity and sentiment scores (negative, neutral, and positive).
- Merge the two datasets based on date to create a unified view of stock performance alongside sentiment.
- Handle any missing sentiment data by forward-filling it and save the combined dataset for further use.

## Steps:
1. **Data Loading**:
- Load stock price data from a CSV file containing OHLC (Open, High, Low, Close) values.
- Load news sentiment data for the same stock from a separate CSV file.

2. **Sentiment Data Preprocessing**:
- Parse the sentiment data, which is stored in a JSON-like string format, to extract values for polarity and sentiment scores (`neg`, `neu`, `pos`).
- Clean the sentiment data by removing unnecessary columns and converting dates to a common format.

3. **Aggregating Sentiment Data**:
- Aggregate the sentiment scores on a daily basis, calculating the average polarity, negative, neutral, and positive sentiment for each day.

4. **Merging Stock and Sentiment Data**:
- Merge the stock price data and aggregated sentiment data based on the `Date` field.
- Handle missing values in the sentiment columns by forward-filling them with the previous day’s values.

5. **Saving the Combined Data**:
- Save the merged dataset to a CSV file for future use, ensuring that it is structured in a way that can be used for further analysis, such as time series analysis, trading strategies, or sentiment-based forecasting.

## Key Functions:
- `parse_sentiment`: This function parses the JSON-like sentiment strings into dictionaries and extracts the relevant sentiment values.
- `pd.merge`: This function merges the stock and sentiment data based on matching dates.
- `ffill`: This method is used to forward-fill missing sentiment data from the previous day's values.

## Output:
- A CSV file named `combined_spy_stock_sentiment.csv`, which contains both stock price data and sentiment scores.

## Use Cases:
- **Sentiment-Based Trading**: Use the combined data to analyze how news sentiment affects stock prices and incorporate this into a trading strategy.
- **Correlation Analysis**: Study the relationship between sentiment indicators and stock price movements.
- **Predictive Modeling**: Use the combined dataset for machine learning models that predict stock prices based on sentiment data.

## Conclusion:
This notebook efficiently combines stock price data and sentiment data, providing a powerful dataset for analyzing how market sentiment influences stock performance. By saving the combined data, it sets up a foundation for further financial analysis or model building.


#### Importing necessary libraries
This cell imports essential libraries for handling data and performing operations:
- `pandas`: Used for handling data in DataFrame format, especially for CSV file operations.
- `numpy`: Used for numerical computations.

In [None]:
import pandas as pd
import ast

#### Loading stock price and sentiment data
This cell reads the historical stock price data and sentiment data for SPY (S&P 500 ETF) from two CSV files:
- `SPY.csv`: Contains the stock price data.
- `SPY.US_news.csv`: Contains the news sentiment data.

**Output:**
- It displays the first few rows of each dataset for initial inspection.

In [None]:
spy_stock_data = pd.read_csv('../data/processed/SPY.csv')
spy_news_data = pd.read_csv('../data/processed/SPY.US_news.csv')

print("SPY Stock Data:")
print(spy_stock_data.head())

print("\nSPY News Data:")
print(spy_news_data.head())

#### Preprocessing and cleaning sentiment data
In this cell, the sentiment data (stored as JSON-like strings) is parsed and cleaned:
- The `parse_sentiment` function extracts key sentiment attributes from the string, converting it into a dictionary.
- Extracted sentiment scores include:
  - `polarity`: The overall sentiment score.
  - `neg`: Negative sentiment score.
  - `neu`: Neutral sentiment score.
  - `pos`: Positive sentiment score.

The cell also converts dates to `datetime` format for easier merging with stock price data later.


In [None]:
# Preprocess and Clean the News Data
# ------------------------------------------
# The sentiment column is stored as JSON-like strings, we need to extract them.
# We'll parse the 'sentiment' column and clean it.

# Function to parse sentiment column (stored as JSON-like strings)
def parse_sentiment(sentiment_str):
    try:
        return ast.literal_eval(sentiment_str)  # Convert string to dictionary
    except (ValueError, SyntaxError):
        return None


# Apply the parsing function to the sentiment column
spy_news_data['parsed_sentiment'] = spy_news_data['sentiment'].apply(parse_sentiment)

# Now, extract 'polarity', 'neg', 'neu', and 'pos' from the parsed sentiment
spy_news_data['polarity'] = spy_news_data['parsed_sentiment'].apply(lambda x: x['polarity'] if x else None)
spy_news_data['neg'] = spy_news_data['parsed_sentiment'].apply(lambda x: x['neg'] if x else None)
spy_news_data['neu'] = spy_news_data['parsed_sentiment'].apply(lambda x: x['neu'] if x else None)
spy_news_data['pos'] = spy_news_data['parsed_sentiment'].apply(lambda x: x['pos'] if x else None)

# Drop the unnecessary 'parsed_sentiment' column
spy_news_data = spy_news_data.drop(columns=['parsed_sentiment'])

# Convert the 'date' column to just the date (without time)
spy_news_data['date_only'] = pd.to_datetime(spy_news_data['date']).dt.date

# Inspect the cleaned news data
print("\nCleaned SPY News Data:")
print(spy_news_data[['date_only', 'polarity', 'neg', 'neu', 'pos']].head())

In [None]:
# Aggregate the News Sentiment Scores by Date
# ---------------------------------------------------
# Since there may be multiple articles for the same day, we'll aggregate the sentiment scores.

aggregated_sentiment = spy_news_data.groupby('date_only').agg({
    'polarity': 'mean',
    'neg': 'mean',
    'neu': 'mean',
    'pos': 'mean'
}).reset_index()

# Inspect the aggregated sentiment data
print("\nAggregated Sentiment Data:")
print(aggregated_sentiment.head())

In [None]:
# Prepare the Stock Data for Merging
# ------------------------------------------
# Ensure the 'Date' column in the stock data is in the correct format (YYYY-MM-DD) for merging.
spy_stock_data['Date'] = pd.to_datetime(spy_stock_data['Date']).dt.date

# Inspect the stock data
print("\nSPY Stock Data Ready for Merging:")
print(spy_stock_data.head())

#### Merging stock data with aggregated sentiment data
In this cell, the stock price data and the sentiment data are merged based on the `Date` column. This allows us to align the stock price movements with the news sentiment on a given day. The merged dataset includes stock prices alongside the aggregated sentiment indicators (polarity, negative, neutral, and positive sentiment scores).

**Output:**
- The merged data is displayed for inspection.

In [None]:
# Merge Stock Data with Aggregated Sentiment Data
# -------------------------------------------------------
# Merge the stock data with the aggregated sentiment data on the 'Date' column.
combined_data = pd.merge(spy_stock_data, aggregated_sentiment, left_on='Date', right_on='date_only', how='left')

# Drop the redundant 'date_only' column
combined_data = combined_data.drop(columns=['date_only'])

# Inspect the combined data
print("\nCombined Stock and Sentiment Data:")
print(combined_data.head())


#### Handling missing sentiment data
Since not every day will have associated sentiment data, this cell uses forward-filling (`ffill`) to carry over the previous day's sentiment to fill any missing values. This ensures that each day in the stock data has associated sentiment scores.

**Output:**
- The dataset after applying forward-filling is displayed for inspection.


In [None]:
# Handle Missing Data by Carrying Over the Previous Day's Sentiment
# ------------------------------------------------------------------------
# Use forward-fill (ffill) to carry over the previous day's sentiment for missing values

combined_data[['polarity', 'neg', 'neu', 'pos']] = combined_data[['polarity', 'neg', 'neu', 'pos']].ffill()

# Inspect the data after applying forward-fill
print("\nCombined Data After Forward-Filling Missing Values:")
print(combined_data.head())

#### Saving the combined data
Once the stock and sentiment data have been merged and cleaned, the final combined dataset is saved to a CSV file. This file contains both the stock price data and the associated sentiment scores, allowing for further analysis.

**Output:**
- The merged data is saved to `combined_spy_stock_sentiment.csv`.


In [None]:
# Save the Combined Data
# ------------------------------
# Save the merged DataFrame to a new CSV file for further use.
combined_data.to_csv('../data/processed/combined_spy_stock_sentiment.csv', index=False)

print("\nData saved to 'combined_spy_stock_sentiment.csv'!")