# Comparative Analysis of Sentiment Analysis Techniques

In this project, we will draw comparisons between traditional statistical analysis tools and cutting-edge AI/ML models. This notebook will examine of the capabilities of SAS against models such as the BERT transformer and LSTM (Long Short-Term Memory) networks.

## Objective

The primary aim is to evaluate and contrast the effectiveness of each method in classifying sentiments expressed in a dataset composed of various tweets. This evaluation will focus on metrics such as accuracy, recall, and precision scores to evaluate each method's performance.

## Approach

1. Create "Gold Standard" testing dataset.
2. Import in Tweet Sentiment Extraction train.csv data.
3. Perform Necessary pre-processing.
4. Train a BERT Model and an LSTM Neural Network Model.
5. Test accuracy, precision, and recall on the "Gold Standard" dataset.
6. Analyze results. Explain why one model may work better than others and discuss interpretation of sarcasm and other figurative speech. 

If you are curious about the creation of the Gold Standard dataset, see the file "TestData.ipynb". I will start by importing the necessary libraries and connecting to the Twitter comp_dbms PostgreSQL 

In [20]:
# Import packages
import pandas as pd
import psycopg2

Now I will connect to the database and display some basic information about the table twitter.statuses. This is where we will be getting the data which we will perform sentiment analysis on.

In [21]:
# Connect to the database
connection = psycopg2.connect(host='3.230.203.12',
                             user='compdb',
                             port=5438,
                             database='twitter',
                             password='compdbs_postgres')
connection.set_session(readonly=True, autocommit=True)

#From our connection we need a cursor, which acts as our interface into the database
cur = connection.cursor()

In [22]:
res = cur.execute("""Select distinct(status_id), text 
                     from twitter.statuses s 
                     where s.lang = 'en' and status_id not in (
                         select status_id from twitter.statuses s2
                         where text like '%https://%')"""
                 )
users = cur.fetchall()
df = pd.DataFrame.from_dict(users)
df.head()

Unnamed: 0,0,1
0,1284819382355365893,RT @Craig_Foster: Final words on 7th anniversa...
1,1280228458749145098,RT @blue_bnd: In Von Braun's book 'MARS PROJEC...
2,1241785982174629888,RT @MissJWaring: It’s #WorldPoetryDay! #Shelt...
3,1344732936730419200,RT @aubreyjanescott: Some Thoughts Here: I thi...
4,1126911564286767106,if dogs could talk - and we cannot stress this...


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10439 entries, 0 to 10438
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       10439 non-null  object
 1   1       10439 non-null  object
dtypes: object(2)
memory usage: 163.2+ KB
