# Sentiment Analysis deployed in AWS - Using SVM, Logistic Regression, and Bert for on-demand predictions

The goal of this project is to compare the power of **SVM**, **Logistic Regression**, and **BERT** in a text classification task as simple as Tweet Sentiment Analysis Classification. 


### What we know about the data

The data was extracted from a [Sentiment140 dataset with 1.6 million tweets](https://www.kaggle.com/datasets/kazanova/sentiment140/code). It contains the following features

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- flag: The query (lyx). If there is no query, then this value is NO_QUERY.
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)

To keep it simple we will only focus on 'target' and 'text' features for text classification.

### Data Pre-processing

In this notebook we will process and transform Tweet text data by using the following data transformation techniques:

- Tokenization
- Lemmatization
- Stop Words Removal
- Contractions conversion

### Text Representation

For this project we will initially use Tf-Idf vectorization for feature engineering. In the future we might perform techniques such as Bag of Words or Word Embeddings via Word2Vec and compare scores.


### Validation and Evaluation

I will use k-fold cross-validation to perform multiple trainings and use AUC as an evaluation metric for hyperparameyter tuning.

### Deployment into production with AWS Sagemaker Hosting and Visualization with AWS QuickSight

The following AWS services will be used to deploy an API endpoint with Data Visualizationa and on-demand predictions:

AWS Sagemaker Hosting will deploy my ML models. For this, I will be using Docker to containerize my built inference models and deploy them in Sagemaker. I will be using A/B release strategy to load balance the traffic to different hosted inferences.

**AWS API Gateway** will be used to create an API endpoing where the request will return the respopnse of the inference model. 

**AWS Lambda** will take care of data transformation and preprocessing as well as evoking the Sagemaker inference. 

Requests will be stored in **AWS Aurora Servelress SQL** Database. 

**AWS QuickSight** will be used for data vizualization.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltz

In [2]:
%xmode Minimal

Exception reporting mode: Minimal


In [3]:
DATASET_COLUMNS = ['target','ids','date','flag','user','text']
DATASET_ENCODING = "ISO-8859-1"
df = pd.read_csv('tweets.csv', encoding=DATASET_ENCODING, names=DATASET_COLUMNS)
df

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


### Class balance - Too good to be true...

In [5]:
df['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [6]:
data = df[['text', 'target']].sample(n=100000).reset_index(drop=True)
print(data['text'][0])

@deviousrex Only if they try to take the moonshine bottles I will have strapped around my waste for race hydration!  #running


In [7]:
data['target'].value_counts()

target
0    50023
4    49977
Name: count, dtype: int64

In [8]:
import re, string, contractions
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
import re
import  string


# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('stopwords')
# nltk.download('punkt')

In [9]:
import pandas as pd
from nltk.tokenize import TweetTokenizer
import contractions
import re

tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)

def preprocess_text(text):
    # Step 1: Expand contractions
    text = contractions.fix(text)
    
   # Step 2: Remove URLs starting with http or https
    text = re.sub(r'http[s]?://[^\s]+', '', text)
    
    # Step 3: Remove URLs with common top-level domains (TLDs)
    text = re.sub(r'(?:www\.)?[a-zA-Z0-9-]+\.(com|org|net|edu|gov|io)[^\s]*', '', text)
    
    # Step 4: Remove www. prefixes not covered by previous steps
    text = re.sub(r'www\.[a-zA-Z0-9-]+\.[^\s]+', '', text)
    
    # Step 5: Remove mentions
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    
    # Step 6: Remove all punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Step 7: Remove non-ASCII characters
    text = ''.join(i for i in text if ord(i) < 128)
    
    # Step 8: Tokenize
    tokens = tokenizer.tokenize(text)
    
    return tokens

# Apply the enhanced preprocessing function to each text entry in the DataFrame
data['text'] = data['text'].apply(preprocess_text)

In [10]:
print(data['text'][0])
data

['Only', 'if', 'they', 'try', 'to', 'take', 'the', 'moonshine', 'bottles', 'I', 'will', 'have', 'strapped', 'around', 'my', 'waste', 'for', 'race', 'hydration', 'running']


Unnamed: 0,text,target
0,"[Only, if, they, try, to, take, the, moonshine...",4
1,"[I, think, I, am, fully, rested, from, last, w...",4
2,"[I, am, a, victim, A, victim, of, electrobitch...",0
3,"[i, watched, your, TV, reality, show, thing, l...",4
4,"[sorry, jen, we, can, maybe, watch, a, movie, ...",0
...,...,...
99995,"[Props, for, sure, Thanks, RE]",4
99996,"[sytycd, totally, did, not, think, they, would...",0
99997,"[Stuck, in, ALDO, for, the, next, 6, hourswith...",0
99998,"[raining, sunday]",0


In [11]:
lemmatizer = WordNetLemmatizer()

def lemmatize_and_lower(tokens):
    cleaned_tokens = []
    for token, tag in pos_tag(tokens):
        if tag.startswith("NN"):
            pos = 'n'
        
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        token = lemmatizer.lemmatize(token,pos)
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stopwords.words('english'):
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

data.loc[:,'text'] = data.loc[:,'text'].apply(lambda x: lemmatize_and_lower(x))

In [12]:
print(data['text'][0])

['try', 'take', 'moonshine', 'bottle', 'strap', 'around', 'waste', 'race', 'hydration', 'run']


In [13]:
data.to_csv('tokenized.csv', index=False)

In [14]:
print(data['text'])

0        [try, take, moonshine, bottle, strap, around, ...
1        [think, fully, rest, last, week, write, thanky...
2        [victim, victim, electrobitching, mock, taste,...
3        [watch, tv, reality, show, thing, last, night,...
4        [sorry, jen, maybe, watch, movie, something, t...
                               ...                        
99995                                [props, sure, thanks]
99996    [sytycd, totally, think, would, send, max, hom...
99997    [stuck, aldo, next, 6, hourswith, die, phone, ...
99998                                       [rain, sunday]
99999                                [sad, night, goodbye]
Name: text, Length: 100000, dtype: object


### Tf-Idf Vectorization

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import pos_tag
import ast
import nltk
# nltk.download('averaged_perceptron_tagger') 
# nltk.download('wordnet')
# nltk.download('stopwords')

In [None]:
tokenized = pd.read_csv('tokenized.csv')

In [None]:
tokenized

In [None]:
from collections import Counter

# Flatten the list of lists into a single list of tokens
all_tokens = []

for token_string in tokenized['text']:
    tokens = ast.literal_eval(token_string)
    for token in tokens:
        all_tokens.append(token)

all_tokens

In [None]:
# Calculate the frequencies of each token
token_frequencies = Counter(all_tokens)

# Create a DataFrame from the token frequencies
word_freq_df = pd.DataFrame(token_frequencies.items(), columns=['Word', 'Frequency']).sort_values(by='Frequency', ascending=False).reset_index(drop=True)

word_freq_df

In [None]:
import matplotlib.pyplot as plt

# Visualize the top 20 most common words
top_n = 20
plt.figure(figsize=(10, 8))
plt.bar(word_freq_df['Word'][:top_n], word_freq_df['Frequency'][:top_n], color='skyblue')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.title(f'Top {top_n} Most Common Words')
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Assuming word_freq_df is your DataFrame with the word frequencies

# Histogram of word frequencies
plt.figure(figsize=(10, 6))
plt.hist(word_freq_df['Frequency'], bins=50, color='skyblue', edgecolor='black')
plt.title('Word Frequency Distribution')
plt.xlabel('Word Frequency')
plt.ylabel('Number of Words')
plt.yscale('log')  # Use log scale for better visibility of lower frequencies
plt.show()

# Log-log plot
plt.figure(figsize=(10, 6))
plt.loglog(range(1, len(word_freq_df) + 1), word_freq_df['Frequency'].sort_values(ascending=False), marker='o', linestyle='-', color='skyblue')
plt.title('Word Frequency Distribution (Log-Log Scale)')
plt.xlabel('Rank of word (log scale)')
plt.ylabel('Frequency of word (log scale)')
plt.show()


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
data['text'] = data['text'].apply(lambda x: ' '.join(x))
data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000,
                             max_df=0.01, 
                             min_df=20, 
                             ngram_range=(1,2))

tfidf_matrix = vectorizer.fit_transform(data['text'])

In [None]:
feature_names = vectorizer.get_feature_names_out()
tfidf_pd = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

In [None]:
tfidf_pd

In [None]:
tfidf_pd.to_csv('tfidf_pd.csv')

## PCA for dimentionality reduction

In [None]:
tfidf_pd = pd.read_csv('tfidf_pd.csv')

In [None]:


from sklearn import decomposition

svd = decomposition.TruncatedSVD(n_components=200)
svd_df = svd.fit_transform(tfidf_pd)

In [None]:
svd_df_pd = pd.DataFrame(svd_df, columns=[f'PC{i+1}' for i in range(svd_df.shape[1])])
svd_df_pd

In [None]:
svd_df_pd.columns

In [None]:
svd_df_pd.to_csv('svd_df.csv')

## Model Training

In [None]:
import pandas as pd

X = pd.read_csv('svd_df.csv')
y_ = pd.read_csv('tokenized.csv')['target']

In [None]:
# Convert classes: 0 remains 0, and 4 becomes 1
y = (y_ == 4).astype(int)
y

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

In [None]:
y = reg.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_test,y))

print("RMSE:", rmse)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the model
model = LogisticRegression(solver='saga', max_iter=10000)

# Define a grid of parameters to search over
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# Setup the grid search
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

In [None]:
y_ = grid_search.predict(X_test)

In [None]:
from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_test, y_)
recall = recall_score(y_test, y_)

precision, recall

In [None]:
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_)
roc_auc = auc(fpr, tpr)

# Create an interactive plot
fig = go.Figure()

# Add Traces
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name='ROC Curve',
                         line=dict(color='darkorange'),
                         showlegend=True))

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Chance',
                         line=dict(color='navy', dash='dash'),
                         showlegend=False))

# Add AUC in the legend
fig.update_layout(title=f'ROC Curve (AUC = {roc_auc:.2f})',
                  xaxis_title='False Positive Rate',
                  yaxis_title='True Positive Rate',
                  xaxis=dict(showgrid=False),
                  yaxis=dict(showgrid=False),
                  template="plotly_white")

# Show figure
fig.show()


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.linear_model import LogisticRegression

In [None]:
# Define the logistic regression model using 'liblinear' solver
model = LogisticRegression(solver='saga', random_state=42, max_iter=10000, tol=1e-4)

# Define a grid of hyperparameter values to search over
param_grid = {
    'C': [ 0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2']  # liblinear supports both L1 and L2 regularization
}

# Define the AUC scoring function
auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, needs_proba=True)

# Set up GridSearchCV
grid_search = GridSearchCV(model, param_grid, scoring=auc_scorer, cv=5, verbose=1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# After fitting, you can check the best parameters and the best AUC score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best AUC score: {grid_search.best_score_}")

# Optionally, evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_ = best_model.predict_proba(X_test)[:, 1]  # Get probability estimates of the positive class
test_auc = roc_auc_score(y_test, y_)

print(f"Test AUC score: {test_auc}")


In [None]:
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_)
roc_auc = auc(fpr, tpr)

# Create an interactive plot
fig = go.Figure()

# Add Traces
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name='ROC Curve',
                         line=dict(color='darkorange'),
                         showlegend=True))

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Chance',
                         line=dict(color='navy', dash='dash'),
                         showlegend=False))

# Add AUC in the legend
fig.update_layout(title=f'ROC Curve (AUC = {roc_auc:.2f})',
                  xaxis_title='False Positive Rate',
                  yaxis_title='True Positive Rate',
                  xaxis=dict(showgrid=False),
                  yaxis=dict(showgrid=False),
                  template="plotly_white")

# Show figure
fig.show()