# Unveiling Twitter Sentiments: A Big Data Dive into Twitter Sentiments during the 2020 US Elections

This Notebook contains Code for the **Regular ML approach** for both DataSets ( Biden and Trump ) so that we can compare it with our Local Approach

<!--  -->

### Initiating SprakSession

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("Twitter_Sentiment_Analysis") \
    .appName("Group_Project") \
    .appName("regular_ML_solution") \
    .getOrCreate()

sc = spark.sparkContext

### Importing Libraries

In [0]:
%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
import pyspark.sql.functions as F
from pyspark.sql.functions import when, col
from pyspark.sql import Row
from pyspark.sql.functions import col, lit
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import MulticlassMetrics
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import wordcloud
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# To ensure that each column's content is fully displayed without being truncated, you can set:
pd.set_option('display.max_colwidth', None)

<!--  -->

### Trump DataSet
### Setting up Spark DataFrame 

In [0]:
#Importing the DF and casting column types
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, DoubleType, FloatType

custom_schema = StructType([
    StructField("user_location", StringType(), True),
    StructField("continent", StringType(), True),
    StructField("is_english", BooleanType(), True),
    StructField("tweet_len", IntegerType(), True),
    StructField("word_count", IntegerType(), True),
    StructField("avg_word_len", DoubleType(), True),
    StructField("filtered_tweet", StringType(), True),
    StructField("subjectivity", FloatType(), True),
    StructField("polarity", FloatType(), True),
    StructField("sentiment", StringType(), True),
])

df_t = spark.read.csv("/FileStore/tables/hashtag_donaldtrump-ML.csv", schema=custom_schema)
df_t.persist()

DataFrame[user_location: string, continent: string, is_english: boolean, tweet_len: int, word_count: int, avg_word_len: double, filtered_tweet: string, subjectivity: float, polarity: float, sentiment: string]

### Dropping Columns that are not needed

In [0]:
#Dropping some unneccessary columns
columns_to_drop = ['user_location','continent','is_english']
df_t = df_t.drop(*columns_to_drop) 
df_t.unpersist()
df_t.cache()

DataFrame[tweet_len: int, word_count: int, avg_word_len: double, filtered_tweet: string, subjectivity: float, polarity: float, sentiment: string]

In [0]:
df_t.show(5)

+---------+----------+-----------------+--------------------+------------+--------+---------+
|tweet_len|word_count|     avg_word_len|      filtered_tweet|subjectivity|polarity|sentiment|
+---------+----------+-----------------+--------------------+------------+--------+---------+
|      270|        46|5.869565217391305|report administra...|       0.525|     0.3| positive|
|      242|        54|4.481481481481482|white house put t...|  0.44722223|  -0.225| negative|
|      125|        27| 4.62962962962963|curtis james jack...|         0.0|     0.0|  neutral|
|       32|         6|5.333333333333333|ticker covers lat...|         0.9|     0.5| positive|
|      157|        37|4.243243243243243|dj pot us pot gop...|         1.0|    -0.5| negative|
+---------+----------+-----------------+--------------------+------------+--------+---------+
only showing top 5 rows



In [0]:
# Ensuring filtered_tweet is not null
df_t = df_t.filter(df_t.filtered_tweet.isNotNull() & (df_t.filtered_tweet != ''))

<!--  -->

### Building ML model

In [0]:
from pyspark.mllib.regression import LabeledPoint
# Feature Preparation
def parse_point(row):
    features = [row['tweet_len'], row['word_count'], row['avg_word_len'], row['subjectivity']]
    label = 1.0 if row['sentiment'] == 'positive' else 0.0
    return LabeledPoint(label, features)

# Preparing the training and test data
positive_negative_df_t = df_t.filter((df_t["sentiment"] == "positive") | (df_t["sentiment"] == "negative"))
train_df_t, test_df_t = positive_negative_df_t.randomSplit([0.7, 0.3], seed=654321)

train_rdd_t = train_df_t.rdd.map(parse_point)
test_rdd_t = test_df_t.rdd.map(parse_point)

# Model Training
model_t = LogisticRegressionWithLBFGS.train(train_rdd_t)

# Predictions
predictions_t = test_rdd_t.map(lambda p: (float(model_t.predict(p.features)), p.label))

<!--  -->

### Performing Model Evaluation 

Evaluating the performance of our model using Mean Absolute Error, as it is commonly used for regression models. Using part of the lecture material to calculate mean absolute error.

In [0]:
# Calculate absolute errors
absolute_errors_t = predictions_t.map(lambda pred: abs(pred[0] - pred[1]))

# Calculate mean absolute error
mean_absolute_error_t = absolute_errors_t.mean()

print("Mean Absolute Error:", mean_absolute_error_t)


Mean Absolute Error: 0.36129391646802783


<!--  -->

# Classifying Neutral Tweets 

Until now we trained and tested on our positive-negative training set. Now we will classify the neutral tweets into positive and negative.

In [0]:
neutral_df_t = df_t.filter(df_t["sentiment"] == "neutral")
neutral_df_t = neutral_df_t.drop("filtered_tweet", "polarity")
neutral_rdd_t = neutral_df_t.rdd.map(parse_point)
neutral_predictions_t = neutral_rdd_t.map(lambda p: (float(model_t.predict(p.features)), p.label))

<!--  -->

We get the count of positive, negative, positive-predicted-neutral and negative-predicted neutral tweets to see total positive tweets / all tweets for this candidate. We will then compare this ratio with the other candidate. Our assumption is that whoever has the higher ratio should win. We will see if the person with the higher ratio actually won to see if twitter sentiments actually reflect real election outcome.


In [0]:
# We get the count of positive, negative, positive-predicted-neutral and negative-predicted neutral tweets to see total positive tweets / all tweets for this candidate. We will then compare this ratio with the other candidate. Our assumption is that whoever has the higher ratio should win. We will see if the person with the higher ratio actually won to see if twitter sentiments actually reflect real election outcome.

from pyspark.sql.functions import col

positive_neutral_count = neutral_predictions_t.filter(lambda pred: pred[0] == 1.0).count()

negative_neutral_count = neutral_predictions_t.filter(lambda pred: pred[0] == 0.0).count()

positive_count = positive_negative_df_t.filter(col("sentiment") == "positive").count()

negative_count = positive_negative_df_t.filter(col("sentiment") == "negative").count()

sum = positive_neutral_count + negative_neutral_count + positive_count + negative_count
positives = positive_neutral_count + positive_count
ratio = positives/sum


# Display the counts
print("Positive Predictions Coming From Neutrals:", positive_neutral_count)
print("Negative Predictions Coming From Neutrals:", negative_neutral_count)
print("Positive Predictions:", positive_count)
print("Negative Predictions:", negative_count)
print("Positive/All Tweets Ratio For Trump:", ratio)

Positive Predictions Coming From Neutrals: 524360
Negative Predictions Coming From Neutrals: 1144
Positive Predictions: 327997
Negative Predictions: 186253
Positive/All Tweets Ratio For Trump: 0.8197679451100933


<!--  -->

### Biden Dataset
Now, we do the same things again for Joe Biden

### Setting up Spark DataFrame

In [0]:
df_b = spark.read.csv("/FileStore/tables/hashtag_joebiden-ML.csv", schema=custom_schema)
df_b.persist()

DataFrame[user_location: string, continent: string, is_english: boolean, tweet_len: int, word_count: int, avg_word_len: double, filtered_tweet: string, subjectivity: float, polarity: float, sentiment: string]

### Dropping Columns that are no longer needed

In [0]:
#Dropping some unneccessary columns
columns_to_drop = ['user_location','continent','is_english']
df_b = df_b.drop(*columns_to_drop) 
df_b.unpersist()
df_b.cache()

DataFrame[tweet_len: int, word_count: int, avg_word_len: double, filtered_tweet: string, subjectivity: float, polarity: float, sentiment: string]

In [0]:
# Ensuring filtered_tweet is not null
df_b = df_b.filter(df_b.filtered_tweet.isNotNull() & (df_b.filtered_tweet != ''))

<!--  -->

### Building ML model

In [0]:
# Preparing the training and test data
positive_negative_df_b = df_b.filter((df_b["sentiment"] == "positive") | (df_b["sentiment"] == "negative"))
train_df_b, test_df_b = positive_negative_df_b.randomSplit([0.7, 0.3], seed=654321)

train_rdd_b = train_df_b.rdd.map(parse_point)
test_rdd_b = test_df_b.rdd.map(parse_point)

# Model Training
model_b = LogisticRegressionWithLBFGS.train(train_rdd_b)

# Predictions
predictions_b = test_rdd_b.map(lambda p: (float(model_b.predict(p.features)), p.label))

### Performing Model Evaluation

In [0]:
# Calculate absolute errors
absolute_errors_b = predictions_b.map(lambda pred: abs(pred[0] - pred[1]))

# Calculate mean absolute error
mean_absolute_error_b = absolute_errors_b.mean()

print("Mean Absolute Error:", mean_absolute_error_b)

Mean Absolute Error: 0.2951331740553935


<!--  -->

### Classifying Neutral Tweets 

In [0]:
neutral_df_b = df_b.filter(df_b["sentiment"] == "neutral")
neutral_df_b = neutral_df_b.drop("filtered_tweet", "polarity")
neutral_rdd_b = neutral_df_b.rdd.map(parse_point)
neutral_predictions_b = neutral_rdd_b.map(lambda p: (float(model_b.predict(p.features)), p.label))

<!--  -->

### Calculating Winning Ratio based on tweets 

In [0]:
# We get the count of positive, negative, positive-predicted-neutral and negative-predicted neutral tweets to see total positive tweets / all tweets for this candidate. We will then compare this ratio with the other candidate. Our assumption is that whoever has the higher ratio should win. We will see if the person with the higher ratio actually won to see if twitter sentiments actually reflect real election outcome.

from pyspark.sql.functions import col

positive_neutral_count = neutral_predictions_b.filter(lambda pred: pred[0] == 1.0).count()

negative_neutral_count = neutral_predictions_b.filter(lambda pred: pred[0] == 0.0).count()

positive_count = positive_negative_df_b.filter(col("sentiment") == "positive").count()

negative_count = positive_negative_df_b.filter(col("sentiment") == "negative").count()

sum = positive_neutral_count + negative_neutral_count + positive_count + negative_count
positives = positive_neutral_count + positive_count
ratio = positives/sum


# Display the counts
print("Positive Predictions Coming From Neutrals:", positive_neutral_count)
print("Negative Predictions Coming From Neutrals:", negative_neutral_count)
print("Positive Predictions:", positive_count)
print("Negative Predictions:", negative_count)
print("Positive/All Tweets Ratio For Biden:", ratio)

Positive Predictions Coming From Neutrals: 382414
Negative Predictions Coming From Neutrals: 4
Positive Predictions: 265638
Negative Predictions: 110776
Positive/All Tweets Ratio For Biden: 0.8540124823412824


###.                 The regular MLlib approach also says Biden should win due to his ratio being bigger, which he did.