### Sentiment Analysis
`Sentiment Analysis` is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, subject is positive, negative or neutral.

In **financial context**, Sentimential Analysis is used to extract insights from news, social media, financial reports and alternative data for investment, trading, risk management, operations in financial institutions and basically anything related to Finance

### Importing Libraries

In [21]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import plt
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import roc_auc_score
from sklearn import metrics

ImportError: cannot import name 'plt' from 'matplotlib.pyplot' (/Users/arnavgupta/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py)

### Data Loading and Preprocessing

In [22]:
import pandas as pd
df1 = pd.read_csv("~/Desktop/NUS Fintech Society/5. NLP/nus-fintech-news-headline-sentiment-analysis/news_headlines_train.csv")
df1.columns = ['text', 'sentiment']
df1.head()

Unnamed: 0,text,sentiment
0,"In addition , a further 29 employees can be la...",-1
1,The authorisation is in force until the end of...,0
2,The value of the deal was not disclosed .,0
3,You need to be ready when the window opens up ...,0
4,Major Order in India Comptel Corporation has r...,1


### Overview

To Train & Test a model, we need a way to represent the text data numerically, which can be done with a technology known as the Bag of Words (BOW). We can generate a BOW Matrix for textdata with sklearn's CountVectorize() function. This fuction is designed to convert text into numerical feature vectors, first performing tokenization and filtering of stopwords.

`The CountVectorize () function` performs tokenization using either the default tokenizer or a custom one.

In [23]:
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

### Splitting the Data Set into a Training Set and a Test Set

In [24]:
y = df1['sentiment'] #dependent variable
X = vectorizer.fit_transform(df1['text'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=100)

### Training a Support Vector Machine Classifier

In [25]:
classifier = svm.LinearSVC()
classifier.fit(X_train, y_train)

LinearSVC()

### Evaluating the Model

In [26]:
pred = classifier.predict(X_test)
acc_score = metrics.accuracy_score(y_test, pred)
print(f'Accuracy Score: {acc_score}')

Accuracy Score: 0.7621283255086072


### Prediction of New Data

In [39]:
df2 = pd.read_csv("~/Desktop/NUS Fintech Society/5. NLP/nus-fintech-news-headline-sentiment-analysis/news_headlines_test_sample_submission.csv")
new_reviews = df2['text']
X_new = vectorizer.transform(new_reviews)
result = classifier.predict(X_new)
df2['sentiment'] = result
df2.to_csv("AG_NLP_01.csv", index = False)