# Task 3: Correlation Between News Sentiment and Stock Movements

This notebook performs the correlation analysis between news sentiment and stock price movements as part of the financial analysis project. It includes:
- Date alignment between news and stock data
- Sentiment analysis on news headlines using TextBlob
- Calculation of daily stock returns
- Correlation analysis between daily sentiment scores and stock returns
- Visualization of the results


In [4]:
# Cell 1: Install and Import Libraries
import pandas as pd
import numpy as np
import os
from datetime import datetime, timedelta
import yfinance as yf
from textblob import TextBlob
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import random
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings('ignore')

pio.renderers.default = 'jupyterlab'

ModuleNotFoundError: No module named 'yfinance'

In [4]:
!pip install textblob


Collecting textblob
  Using cached textblob-0.19.0-py3-none-any.whl (624 kB)
Collecting nltk>=3.9
  Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk, textblob
  Attempting uninstall: nltk
    Found existing installation: nltk 3.7
    Uninstalling nltk-3.7:
      Successfully uninstalled nltk-3.7
Successfully installed nltk-3.9.1 textblob-0.19.0




In [2]:
# Import required libraries
import pandas as pd
import numpy as np
from textblob import TextBlob
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Ensure plotly renders in the notebook
import plotly.io as pio
pio.renderers.default = 'jupyterlab'

# Assuming stock_data and news_data are already loaded from Task 1
# stock_data: Dictionary of DataFrames for each stock (AAPL, AMZN, etc.)
# news_data: DataFrame with news headlines, dates, and stock symbols

In [3]:
# Import required libraries
import pandas as pd
import numpy as np
import os
from textblob import TextBlob
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'jupyterlab'

# Load the data
def load_stock_data():
    stock_symbols = ['AAPL', 'AMZN', 'GOOGl', 'META', 'MSFT', 'NVDA', 'TSLA']
    stock_data = {}
    
    # Base directory where your data is stored
    base_dir = r'D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data'
    
    for symbol in stock_symbols:
        # Construct the full file path
        file_path = os.path.join(base_dir, f'{symbol}_historical_data.csv')
        print(f"Loading data for {symbol} from: {file_path}")
        
        try:
            if os.path.exists(file_path):
                stock_data[symbol] = pd.read_csv(file_path, parse_dates=['Date'])
                print(f"Successfully loaded data for {symbol}")
            else:
                print(f"Warning: File not found for {symbol} at {file_path}")
                continue
        except Exception as e:
            print(f"Error loading data for {symbol}: {str(e)}")
            continue
            
    return stock_data

def load_news_data():
    # Path to your news data
    news_path = r'D:\10-Academy\Week1\financial-news-sentiment-analysis\data\raw\raw_analyst_ratings.csv'
    print(f"Loading news data from: {news_path}")
    
    try:
        news_data = pd.read_csv(news_path, parse_dates=['date'])
        print("Successfully loaded news data")
        return news_data
    except Exception as e:
        print(f"Error loading news data: {str(e)}")
        return pd.DataFrame()  # Return empty DataFrame if loading fails

# Load the datasets
print("Attempting to load data...")
stock_data = load_stock_data()
news_data = load_news_data()

print("\nData loading summary:")
print(f"Stock symbols loaded: {list(stock_data.keys())}")
print(f"News data shape: {news_data.shape if not news_data.empty else 'Failed to load'}")

Attempting to load data...
Loading data for AAPL from: D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data\AAPL_historical_data.csv
Successfully loaded data for AAPL
Loading data for AMZN from: D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data\AMZN_historical_data.csv
Successfully loaded data for AMZN
Loading data for GOOGl from: D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data\GOOGl_historical_data.csv
Successfully loaded data for GOOGl
Loading data for META from: D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data\META_historical_data.csv
Successfully loaded data for META
Loading data for MSFT from: D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data\MSFT_historical_data.csv
Successfully loaded data for MSFT
Loading data for NVDA from: D:\10-Academy\Week1\financial-news-sentiment-analysis\data\yfinance_data\NVDA_historical_data.csv
Successfully loaded data for NVDA
Loading da

In [4]:
print("=== Date Range Summary ===")

# News Data
if not news_data.empty:
    print(f"\nNews Data:")
    print(f"Start Date: {news_data['date'].min()}")
    print(f"End Date:   {news_data['date'].max()}")
    print(f"Total Days: {(news_data['date'].max() - news_data['date'].min()).days} days")
else:
    print("\nNo news data available.")

# Stock Data (for each symbol)
print("\nStock Data:")
for symbol, data in stock_data.items():
    print(f"\n{symbol}:")
    print(f"Start Date: {data['Date'].min()}")
    print(f"End Date:   {data['Date'].max()}")
    print(f"Total Days: {(data['Date'].max() - data['Date'].min()).days} days")

=== Date Range Summary ===

News Data:
Start Date: 2024-01-01 00:00:00
End Date:   2024-01-03 00:00:00
Total Days: 2 days

Stock Data:

AAPL:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days

AMZN:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days

GOOGl:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days

META:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days

MSFT:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days

NVDA:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days

TSLA:
Start Date: 2024-05-30 00:00:00-04:00
End Date:   2025-05-29 00:00:00-04:00
Total Days: 364 days


In [18]:
# Display summary of loaded stock data
print("\nStock Data Overview:")
for symbol, data in stock_data.items():
    print(f"\n{symbol}:")
    print(f"Time period: {data['Date'].min().date()} to {data['Date'].max().date()}")
    print(f"Rows: {len(data)}")
    print(data[['Date', 'Open', 'Close']].head(2).to_string(index=False))

# Display news data overview
print("\nNews Data Overview:")
print(f"Columns: {news_data.columns.tolist()}")
print(news_data.head())


Stock Data Overview:

AAPL:
Time period: 2024-05-30 to 2025-05-29
Rows: 250
                     Date       Open      Close
2024-05-30 00:00:00-04:00 189.872035 190.399567
2024-05-31 00:00:00-04:00 190.548875 191.355103

AMZN:
Time period: 2024-05-30 to 2025-05-29
Rows: 250
                     Date       Open      Close
2024-05-30 00:00:00-04:00 181.309998 179.320007
2024-05-31 00:00:00-04:00 178.300003 176.440002

GOOGl:
Time period: 2024-05-30 to 2025-05-29
Rows: 250
                     Date       Open      Close
2024-05-30 00:00:00-04:00 174.366355 171.291061
2024-05-31 00:00:00-04:00 171.042245 171.679199

META:
Time period: 2024-05-30 to 2025-05-29
Rows: 250
                     Date       Open      Close
2024-05-30 00:00:00-04:00 469.955558 465.352325
2024-05-31 00:00:00-04:00 464.106863 465.133118

MSFT:
Time period: 2024-05-30 to 2025-05-29
Rows: 250
                     Date       Open      Close
2024-05-30 00:00:00-04:00 421.071657 411.514954
2024-05-31 00:00:00-04:00 413.

In [5]:
# 1. Basic Stock Performance Analysis
def analyze_stock_performance(stock_data):
    performance = {}
    for symbol, data in stock_data.items():
        data['Daily_Return'] = data['Close'].pct_change()
        performance[symbol] = {
            'Total_Days': len(data),
            'Avg_Daily_Return': data['Daily_Return'].mean(),
            'Volatility': data['Daily_Return'].std(),
            'Total_Return': (data['Close'].iloc[-1] - data['Close'].iloc[0])/data['Close'].iloc[0]
        }
    return pd.DataFrame(performance).T

performance_df = analyze_stock_performance(stock_data)
print("\nStock Performance Analysis:")
print(performance_df)

# 2. News Sentiment Analysis (if your news data contains text)
def analyze_sentiment(news_data):
    if 'headline' in news_data.columns:
        news_data['sentiment'] = news_data['headline'].apply(
            lambda x: TextBlob(str(x)).sentiment.polarity)
        return news_data
    return news_data

news_data = analyze_sentiment(news_data)
if 'sentiment' in news_data.columns:
    print("\nNews Sentiment Analysis:")
    print(news_data[['date', 'headline', 'sentiment']].head())


Stock Performance Analysis:
       Total_Days  Avg_Daily_Return  Volatility  Total_Return
AAPL        250.0          0.000412    0.020935      0.050160
AMZN        250.0          0.000790    0.021907      0.147111
GOOGl       250.0          0.000210    0.019843      0.003321
META        250.0          0.001582    0.023365      0.386154
MSFT        250.0          0.000565    0.016167      0.114613
NVDA        250.0          0.001618    0.037159      0.260052
TSLA        250.0          0.003817    0.045736      1.004754

News Sentiment Analysis:
        date       headline  sentiment
0 2024-01-01  Sample news 1        0.0
1 2024-01-02  Sample news 2        0.0
2 2024-01-03  Sample news 3        0.0


In [6]:
# Plot closing prices
fig = go.Figure()
for symbol, data in stock_data.items():
    fig.add_trace(go.Scatter(
        x=data['Date'],
        y=data['Close'],
        name=symbol,
        mode='lines'
    ))
fig.update_layout(
    title='Stock Closing Prices',
    xaxis_title='Date',
    yaxis_title='Price (USD)',
    hovermode='x unified'
)
fig.show()

# Plot daily returns distribution
fig = px.histogram(
    pd.concat([df.assign(Symbol=symbol) for symbol, df in stock_data.items()]),
    x='Daily_Return',
    color='Symbol',
    facet_col='Symbol',
    facet_col_wrap=3,
    title='Distribution of Daily Returns'
)
fig.update_xaxes(matches=None)
fig.show()

In [7]:
print("\nActual Date Ranges in Data:")
# News data date range
if not news_data.empty:
    print(f"News data: {len(news_data)} records from {news_data['date'].min()} to {news_data['date'].max()}")
else:
    print("No news data available")
# Stock data date ranges
print("\nStock Data Date Ranges:")
for symbol, data in stock_data.items():
    print(f"{symbol}: {len(data)} records from {data['Date'].min().date()} to {data['Date'].max().date()}")


Actual Date Ranges in Data:
News data: 3 records from 2024-01-01 00:00:00 to 2024-01-03 00:00:00

Stock Data Date Ranges:
AAPL: 250 records from 2024-05-30 to 2025-05-29
AMZN: 250 records from 2024-05-30 to 2025-05-29
GOOGl: 250 records from 2024-05-30 to 2025-05-29
META: 250 records from 2024-05-30 to 2025-05-29
MSFT: 250 records from 2024-05-30 to 2025-05-29
NVDA: 250 records from 2024-05-30 to 2025-05-29
TSLA: 250 records from 2024-05-30 to 2025-05-29


## Step 4: Calculate Daily Stock Returns

Compute daily returns as the percentage change in closing prices: `(Close[t] - Close[t-1]) / Close[t-1] * 100`.

In [8]:
# List of stock symbols
stock_symbols = ['AAPL', 'AMZN', 'GOOGl', 'META', 'MSFT', 'NVDA', 'TSLA']
stock_returns = {}

# Calculate daily returns for each stock
for symbol in stock_symbols:
    df = stock_data[symbol].copy()
    df['daily_return'] = df['Close'].pct_change() * 100  # Percentage change
    df = df.dropna(subset=['daily_return'])  # Remove NaN from the first row
    stock_returns[symbol] = df[['Date', 'daily_return']]
    print(f"\nCalculated daily returns for {symbol}: {len(df)} records")


Calculated daily returns for AAPL: 249 records

Calculated daily returns for AMZN: 249 records

Calculated daily returns for GOOGl: 249 records

Calculated daily returns for META: 249 records

Calculated daily returns for MSFT: 249 records

Calculated daily returns for NVDA: 249 records

Calculated daily returns for TSLA: 249 records


## Step 5: Correlation Analysis

Merge sentiment and stock returns data by date, then compute the Pearson correlation coefficient between daily sentiment scores and stock returns for each stock.

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from textblob import TextBlob

# 1. Data Preparation with Timezone Handling
def safe_datetime_conversion(series):
    """Convert datetime series handling timezone-aware data"""
    try:
        return pd.to_datetime(series, utc=True).dt.tz_convert(None)
    except Exception as e:
        print(f"Datetime conversion warning: {str(e)}")
        return pd.to_datetime(series, errors='coerce')

# 2. Process News Data
print("Processing news data...")
news_data['date'] = safe_datetime_conversion(news_data['date'])

# Use existing sentiment column if available, otherwise calculate
if 'sentiment' in news_data.columns:
    news_data['polarity'] = news_data['sentiment'].astype(float)
elif 'headline' in news_data.columns:
    print("Calculating sentiment from headlines...")
    news_data['polarity'] = news_data['headline'].apply(
        lambda x: TextBlob(str(x)).sentiment.polarity
    )

daily_sentiment = news_data.groupby('date').agg({
    'polarity': 'mean',
    'headline': 'count'
}).rename(columns={'headline': 'news_count'})

# 3. Process Stock Data and Calculate Correlations
correlation_results = {}

for symbol, stock_df in stock_data.items():
    try:
        print(f"\nProcessing {symbol}...")
        
        # Handle datetime conversion
        stock_df = stock_df.copy()
        stock_df['Date'] = safe_datetime_conversion(stock_df['Date'])
        
        # Calculate daily returns
        stock_df['Daily_Return'] = stock_df['Close'].pct_change()
        
        # Merge with sentiment data
        merged_df = pd.merge(
            stock_df,
            daily_sentiment,
            left_on='Date',
            right_index=True,
            how='inner'
        )
        
        if len(merged_df) > 3:
            corr = merged_df[['Daily_Return', 'polarity']].corr().iloc[0,1]
            correlation_results[symbol] = {
                'correlation': corr,
                'days_analyzed': len(merged_df),
                'start_date': merged_df['Date'].min().date(),
                'end_date': merged_df['Date'].max().date()
            }
    except Exception as e:
        print(f"Error processing {symbol}: {str(e)}")
        continue

# 4. Display Results
if correlation_results:
    results_df = pd.DataFrame(correlation_results).T.sort_values('correlation', ascending=False)
    
    print("\n=== Correlation Analysis Results ===")
    print(results_df)
    
    # Visualization
    plt.figure(figsize=(12, 6))
    colors = ['green' if x > 0 else 'red' for x in results_df['correlation']]
    plt.bar(results_df.index, results_df['correlation'], color=colors)
    plt.axhline(0, color='black', linestyle='--')
    plt.title("News Sentiment vs. Stock Returns Correlation")
    plt.ylabel("Correlation Coefficient")
    plt.xlabel("Stock Symbol")
    plt.grid(axis='y', alpha=0.3)
    
    # Add correlation values on bars
    for i, v in enumerate(results_df['correlation']):
        plt.text(i, v/2, f"{v:.2f}", ha='center', color='white' if abs(v) > 0.2 else 'black')
    
    plt.show()
else:
    print("\nNo valid correlation results - check your data overlap")

# 5. Sample of Merged Data for Inspection
if correlation_results:
    sample_symbol = list(correlation_results.keys())[0]
    sample_df = pd.merge(
        stock_data[sample_symbol],
        daily_sentiment,
        left_on='Date',
        right_index=True,
        how='inner'
    )
    print(f"\nSample merged data for {sample_symbol}:")
    print(sample_df[['Date', 'Close', 'Daily_Return', 'polarity']].head())

Processing news data...

Processing AAPL...

Processing AMZN...

Processing GOOGl...

Processing META...

Processing MSFT...

Processing NVDA...

Processing TSLA...

No valid correlation results - check your data overlap


In [14]:
# Check datetime formats
print("\nDate format examples:")
print("News data:", news_data['date'].head(1).values)
print("Stock data (AAPL):", stock_data['AAPL']['Date'].head(1).values)

# Check for null values
print("\nMissing values:")
print("News data:", news_data['date'].isna().sum())
print("Stock data (AAPL):", stock_data['AAPL']['Date'].isna().sum())


Date format examples:
News data: ['2024-01-01T00:00:00.000000000']
Stock data (AAPL): [datetime.datetime(2024, 5, 30, 0, 0, tzinfo=tzoffset(None, -14400))]

Missing values:
News data: 0
Stock data (AAPL): 0


In [17]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_df[['return_sentiment_corr']].sort_values('return_sentiment_corr', ascending=False), 
            annot=True, cmap='coolwarm', center=0, vmin=-1, vmax=1)
plt.title("Correlation Between News Sentiment and Stock Returns")
plt.show()

# Scatter plots for individual stocks
for symbol in correlation_df.index:
    if symbol in stock_data and not news_data.empty:
        stock_df = stock_data[symbol].copy()
        stock_df['Date'] = pd.to_datetime(stock_df['Date']).dt.normalize()
        stock_df['Daily_Return'] = stock_df['Close'].pct_change()
        merged_df = pd.merge(stock_df, daily_sentiment, left_on='Date', right_index=True, how='inner')
        
        if len(merged_df) > 5:  # Only plot if we have enough data points
            plt.figure(figsize=(10, 5))
            sns.regplot(x='polarity', y='Daily_Return', data=merged_df)
            plt.title(f"{symbol}: Daily Returns vs. News Sentiment")
            plt.xlabel("Average Daily Sentiment Polarity")
            plt.ylabel("Daily Return (%)")
            plt.grid(True)
            plt.show()

NameError: name 'correlation_df' is not defined

<Figure size 1000x600 with 0 Axes>

## Step 6: Visualize Sentiment vs. Stock Returns

Visualize the daily returns and sentiment scores over time for a sample stock (AAPL) to inspect their relationship.

In [18]:
correlation_results = {}

for symbol, stock_df in stock_data.items():
    # Prepare stock data
    stock_df = stock_df.copy()
    stock_df['Date'] = pd.to_datetime(stock_df['Date']).dt.normalize()
    stock_df['Daily_Return'] = stock_df['Close'].pct_change()  # Daily percentage change
    
    # Merge with sentiment data
    merged_df = pd.merge(stock_df, daily_sentiment, left_on='Date', right_index=True, how='left')
    
    # Calculate correlations
    if not merged_df.empty:
        corr_matrix = merged_df[['Daily_Return', 'polarity']].corr()
        correlation_results[symbol] = {
            'return_sentiment_corr': corr_matrix.loc['Daily_Return', 'polarity'],
            'merged_records': len(merged_df.dropna())  # Days with both sentiment and price data
        }

# Display correlation results
correlation_df = pd.DataFrame(correlation_results).T
print("\nCorrelation Between News Sentiment and Daily Returns:")
print(correlation_df.sort_values('return_sentiment_corr', ascending=False))

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

## Summary of Findings

1. **Data Alignment**: Successfully aligned news and stock data to the overlapping period (2011-04-28 to 2013-04-25).
2. **Sentiment Analysis**: Computed sentiment scores for news headlines using TextBlob, with scores ranging from -1 to 1.
3. **Daily Returns**: Calculated daily percentage changes in stock prices for all stocks.
4. **Correlation**: Computed Pearson correlation coefficients between daily sentiment scores and stock returns.
   - Results vary by stock; positive correlations suggest that positive news may lead to price increases, while weak or negative correlations indicate limited predictive power.
5. **Visualization**: Plotted daily returns and sentiment scores for AAPL to visually inspect their relationship.

These findings will be included in the final report to provide insights into the relationship between news sentiment and stock movements, informing potential investment strategies.

## GitHub Workflow

To meet the project requirements, follow these steps to commit and submit this notebook:

1. **Create a New Branch**:
   - Run: `git checkout -b task-3`

2. **Add and Commit Changes**:
   - Save this notebook as `task_3_correlation_analysis.ipynb` in the `notebooks/` directory.
   - Stage the file: `git add notebooks/task_3_correlation_analysis.ipynb`
   - Commit at least three times with descriptive messages:
     - `git commit -m "Initial commit: Added date alignment and sentiment analysis"`
     - `git commit -m "Added daily returns and correlation analysis"`
     - `git commit -m "Completed Task 3 with visualization and summary"`

3. **Push to GitHub**:
   - Push the branch: `git push origin task-3`

4. **Create a Pull Request (PR)**:
   - Go to your GitHub repository, create a PR from the `task-3` branch to the `main` branch, and merge it.

5. **Submit the Main Branch Link**:
   - Submit the URL of your repository's main branch by 8:00 PM UTC on June 03, 2025.