# Stock Market Sentiment Analysis

## Table of Contents
1. [Introduction](#Introduction)
2. [Data Collection](#Data-Collection)
3. [Data Preprocessing](#Data-Preprocessing)
4. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
5. [Modeling](#Modeling)
    - [Logistic Regression](#Logistic-Regression)
    - [Support Vector Machine](#Support-Vector-Machine)
6. [Model Evaluation](#Model-Evaluation)
7. [Conclusion](#Conclusion)

## Introduction
In this project, we will perform sentiment analysis on financial news articles to predict the sentiment of the stock market. We will use logistic regression and support vector machine (SVM) for classification and evaluate the models using accuracy, precision, and recall metrics.

## Data Collection
We will collect financial news articles from X (formerly known as Twitter) using the Tweepy library.

In [1]:
# Install necessary libraries
# !pip install tweepy pandas scikit-learn nltk

In [15]:
# Install necessary libraries
# %pip install tweepy

# Import necessary libraries
import tweepy
import re

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


from library.sb_utils import save_file

%matplotlib inline


In [3]:
path = r"C:\Users\adame\OneDrive\Desktop\python_scripts\data_projects\x_sentiment_analysis\x_sentiment_analysis\data\raw\training.1600000.processed.noemoticon.csv"
tweets_df = pd.read_csv(path, encoding='latin', header=None)

In [4]:
tweets_df.columns = ['sentiment', 'id', 'date', 'query', 'user_id', 'text']
tweets_df.head()

Unnamed: 0,sentiment,id,date,query,user_id,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [5]:
tweets_df['sentiment'].value_counts()

sentiment
0    800000
4    800000
Name: count, dtype: int64

Sentiment --> 0 is negative, 2 is neutral, and 4 is positive

## Data Preprocessing
We will clean and tokenize the text data.

In [6]:
# Check if 'punkt' is already downloaded
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Download NLTK data
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adame\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
# Function to clean and tokenize text data
def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word.lower() not in stopwords.words('english')]
    # Join tokens back into a single string
    text = ' '.join(tokens)
    return text

## Exploratory Data Analysis
We will explore the data to understand its structure and distribution.

In [8]:
# Apply preprocessing to the text data
tweets_df['cleaned_text'] = tweets_df['text'].apply(preprocess_text)

# Display the first few rows of the dataset
tweets_df.head()

Unnamed: 0,sentiment,id,date,query,user_id,text,cleaned_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot Awww bummer shoulda got David Carr ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset update Facebook texting might cry result...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,Kenichan dived many times ball Managed save re...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole body feels itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behaving mad see


In [9]:
# save to csv
datapath = r"C:\Users\adame\OneDrive\Desktop\python_scripts\data_projects\x_sentiment_analysis\x_sentiment_analysis\data\raw"   
save_file(tweets_df, 'preprocessed_tweets_df.csv', datapath)

A file already exists with this name.

Writing file.  "C:\Users\adame\OneDrive\Desktop\python_scripts\data_projects\x_sentiment_analysis\x_sentiment_analysis\data\raw\preprocessed_tweets_df.csv"


In [10]:
# load new data
path = r"C:\Users\adame\OneDrive\Desktop\python_scripts\data_projects\x_sentiment_analysis\x_sentiment_analysis\data\raw\preprocessed_tweets_df.csv"
tweets_df = pd.read_csv(path)

## Modeling
We will use logistic regression and support vector machine (SVM) for classification.

In [1]:
# Fill NaN values with an empty string
tweets_df['cleaned_text'] = tweets_df['cleaned_text'].fillna('')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tweets_df['cleaned_text'], tweets_df['sentiment'], test_size=0.25, random_state=42)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

NameError: name 'tweets_df' is not defined

### Logistic Regression

In [None]:
# Train a logistic regression model
logreg = LogisticRegressionCV(solver='lbfgs', random_state=42, max_iter=1000, n_jobs=-1)
logreg.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_logreg = logreg.predict(X_test_tfidf)

# Evaluate the model
print('Logistic Regression Metrics:')
print('Accuracy:', accuracy_score(y_test, y_pred_logreg))
print('Precision:', precision_score(y_test, y_pred_logreg, average='weighted'))
print('Recall:', recall_score(y_test, y_pred_logreg, average='weighted'))
print('Classification Report:\n', classification_report(y_test, y_pred_logreg))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Metrics:
Accuracy: 0.781875
Precision: 0.7821031799396063
Recall: 0.781875
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.77      0.78    199581
           4       0.77      0.80      0.79    200419

    accuracy                           0.78    400000
   macro avg       0.78      0.78      0.78    400000
weighted avg       0.78      0.78      0.78    400000



### Support Vector Machine

In [2]:
# import standdard scaler
from sklearn.preprocessing import StandardScaler

In [3]:
scaler = StandardScaler()
X_train_tfidf_scaled = scaler.fit_transform(X_train_tfidf)

NameError: name 'X_train_tfidf' is not defined

In [None]:
# Train a support vector machine model
svm = SVC(kernel='linear', random_state=42, C=1.0)
svm.fit(X_train_tfidf, y_train)

# Make predictions
y_pred_svm = svm.predict(X_test_tfidf)

# Evaluate the model
print('Support Vector Machine Metrics:')
print('Accuracy:', accuracy_score(y_test, y_pred_svm))
print('Precision:', precision_score(y_test, y_pred_svm, average='weighted'))
print('Recall:', recall_score(y_test, y_pred_svm, average='weighted'))
print('Classification Report:\n', classification_report(y_test, y_pred_svm))

In [None]:
# metrics to score the SVM model
print('Support Vector Machine Metrics:')
print('Accuracy:', accuracy_score(y_test, y_pred_svm))
print('Precision:', precision_score(y_test, y_pred_svm, average='weighted'))
print('Recall:', recall_score(y_test, y_pred_svm, average='weighted'))
print('Classification Report:\n', classification_report(y_test, y_pred_svm))

## Conclusion
Summarize the findings and discuss the performance of the models.