<a href="https://colab.research.google.com/github/danielhuynh0/social-media-bias-ml/blob/main/MachineLearningFinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Detecting Bias in Social Media Posts
by Alby Alex, Daniel Huynh, James LaCava, and Eric Li

###Section 1: Data Preprocessing

In [None]:
from google.colab import drive
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

drive.mount('/content/drive')

path = "drive/MyDrive/Yangfeng Ji Final Project/political_social_media.csv"
df_stuff = pd.read_csv(path, encoding = "ISO-8859-1")
vector = CountVectorizer()
X = vector.fit_transform(df_stuff['text'])
Y = df_stuff['bias']


Mounted at /content/drive


###Section 2: Data Splits

In [None]:
from sklearn.model_selection import train_test_split
Xtrn, Xrst, Ytrn, Yrst = train_test_split(X, Y, test_size=0.8, random_state = 201)
Xtst, Xval, Ytst, Yval = train_test_split(Xrst, Yrst, test_size=0.5, random_state = 201)

###Section 3: Build Classifiers
####Support Vector Machine

In [None]:
from sklearn.svm import SVC
default_SVM = SVC()
default_SVM.fit(Xtrn, Ytrn)
accuracy = default_SVM.score(Xtst, Ytst)
print("The default SVM classifier prediction accuracy is {}".format(accuracy))

The default SVM classifier prediction accuracy is 0.7355


####Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
default_logistic = LogisticRegression()
default_logistic.fit(Xtrn, Ytrn)
accuracy = default_logistic.score(Xtrn, Ytrn)
print("The default Logistic Regression classifier prediction accuracy is {}".format(accuracy))

The default Logistic Regression classifier prediction accuracy is 0.996


###Section 4: Hyper-Parameter Tuning
Support Vector Machine

In [None]:
kernels = ['linear', 'rbf', 'poly']
C_vals = [0.01, 0.1, 1, 10, 100]
gamma_vals = [0.01, 0.025, 0.05, 0.075, 0.1, 1]
degree = [2, 3, 4]

max_parameters = []
max_acc = -1
for k in kernels:
  for c in C_vals:
    if(k == 'linear'):
      val_classifier = SVC(kernel=k, C = c)
      val_classifier.fit(Xtrn,Ytrn)
      accuracy = val_classifier.score(Xval, Yval)
      print("The SVM classifier validation accuracy with hyperparameters [kernel = {}, C = {}]: {}".format(k, c, accuracy))

      if(accuracy > max_acc):
        max_acc = accuracy
        max_parameters = [k,c]
    else:
      for g in gamma_vals:
        if(k == 'poly'):
          for d in degree:
            val_classifier = SVC(kernel=k, C = c, gamma = g, degree = d)
            val_classifier.fit(Xtrn,Ytrn)
            accuracy = val_classifier.score(Xval, Yval)
            print("The SVM classifier validation accuracy with hyperparameters [kernel = {}, C = {}, gamma = {}, degree = {}]: {}".format(k, c, g, d, accuracy))

            if(accuracy > max_acc):
              max_acc = accuracy
              max_parameters = [k,c,g,d]
          continue
        val_classifier = SVC(kernel=k, C = c, gamma = g)
        val_classifier.fit(Xtrn,Ytrn)
        accuracy = val_classifier.score(Xval, Yval)
        print("The SVM classifier validation accuracy with hyperparameters [kernel = {}, C = {}, gamma = {}]: {}".format(k, c, g, accuracy))

        if(accuracy > max_acc):
          max_acc = accuracy
          max_parameters = [k,c,g]


# Use the best hyperparameters to make the classifier
k = max_parameters[0]
c = max_parameters[1]
if(k == 'linear'):
  print("The SVM classifer with the highest validation accuracy has hyperparameters [kernel = {}, C = {}] with validation accuracy {}".format(k, c, max_acc))
  classifier = SVC(kernel = k, C = c, gamma = g)
  classifier.fit(Xtrn, Ytrn)
  accuracy = classifier.score(Xtst, Ytst)
else:
  g = max_parameters[2]
  if(k == 'poly'):
    print("The SVM classifer with the highest validation accuracy has hyperparameters [kernel = {}, C = {}, gamma = {}, degree = {}] with validation accuracy {}".format(k, c, g, d, max_acc))
    d = max_parameters[3]
    classifier = SVC(kernel = k, C = c, gamma = g, degree = d)
    classifier.fit(Xtrn, Ytrn)
    accuracy = classifier.score(Xtst, Ytst)
  else:
    print("The SVM classifer with the highest validation accuracy has hyperparameters [kernel = {}, C = {}, gamma = {}] with validation accuracy {}".format(k, c, g, max_acc))
    classifier = SVC(kernel = k, C = c, gamma = g)
    classifier.fit(Xtrn, Ytrn)
    accuracy = classifier.score(Xtst, Ytst)
print("The test accuracy of the SVM classifier with the best hyperparameters is {}".format(accuracy))

The SVM classifier validation accuracy with hyperparameters [kernel = linear, C = 0.01]: 0.7465
The SVM classifier validation accuracy with hyperparameters [kernel = linear, C = 0.1]: 0.74
The SVM classifier validation accuracy with hyperparameters [kernel = linear, C = 1]: 0.727
The SVM classifier validation accuracy with hyperparameters [kernel = linear, C = 10]: 0.714
The SVM classifier validation accuracy with hyperparameters [kernel = linear, C = 100]: 0.6375
The SVM classifier validation accuracy with hyperparameters [kernel = rbf, C = 0.01, gamma = 0.01]: 0.7425
The SVM classifier validation accuracy with hyperparameters [kernel = rbf, C = 0.01, gamma = 0.025]: 0.7425
The SVM classifier validation accuracy with hyperparameters [kernel = rbf, C = 0.01, gamma = 0.05]: 0.7425
The SVM classifier validation accuracy with hyperparameters [kernel = rbf, C = 0.01, gamma = 0.075]: 0.7425
The SVM classifier validation accuracy with hyperparameters [kernel = rbf, C = 0.01, gamma = 0.1]: 0.

####Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

C_vals = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
tol_vals = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
solvers = ['lbfgs','liblinear', 'newton-cg', 'sag', 'saga']
max_acc = -1
max_c = 1
max_tol = 1
best_solver= 'lbfgs'

for s in solvers:
  for t in tol_vals:
    for c in C_vals:
      classifier = LogisticRegression(C= c, max_iter= 100000, solver = s,  tol= t)
      classifier.fit(Xtrn,Ytrn)
      accuracy = classifier.score(Xval, Yval)
      print("The LR classifier validation accuracy with hyperparameters [C = {}, tolerance = {}, solver = {}]: {}".format(c, t, s, accuracy))
      if(accuracy > max_acc):
        max_acc = accuracy
        max_c = c
        max_tol = t
        best_solver = s



print("The LR classifer with the highest validation accuracy has hyperparameters [C = {}, tolerance = {}, solver = {}] with validation accuracy {}".format(max_c, max_tol, best_solver, max_acc))

classifier = LogisticRegression(C = max_c, max_iter= 100000, solver = best_solver,  tol= max_tol)
classifier.fit(Xtrn,Ytrn)
accuracy = classifier.score(Xtst, Ytst)

print("The test accuracy of the LR classifier with the best hyperparameters is {}".format(accuracy))


The LR classifier validation accuracy with hyperparameters [C = 0.0001, tolerance = 0.0001, solver = lbfgs]: 0.743
The LR classifier validation accuracy with hyperparameters [C = 0.001, tolerance = 0.0001, solver = lbfgs]: 0.743
The LR classifier validation accuracy with hyperparameters [C = 0.01, tolerance = 0.0001, solver = lbfgs]: 0.748
The LR classifier validation accuracy with hyperparameters [C = 0.1, tolerance = 0.0001, solver = lbfgs]: 0.7525
The LR classifier validation accuracy with hyperparameters [C = 1.0, tolerance = 0.0001, solver = lbfgs]: 0.742
The LR classifier validation accuracy with hyperparameters [C = 10.0, tolerance = 0.0001, solver = lbfgs]: 0.7405
The LR classifier validation accuracy with hyperparameters [C = 100.0, tolerance = 0.0001, solver = lbfgs]: 0.72
The LR classifier validation accuracy with hyperparameters [C = 0.0001, tolerance = 0.001, solver = lbfgs]: 0.743
The LR classifier validation accuracy with hyperparameters [C = 0.001, tolerance = 0.001, so

###Section 5: Analysis

**Based on the results above, which classifier is better, and why?**

Based on the results we had with our testing, the Logistic Regression classifier with ideal hyperparameters generally performed better than SVM with ideal hyperparameters. This can be attributed to muliple reasons. First, the SVM classifier may be affected by the curse of dimensionality. We chose to use the CountVectorizer, which takes a list of strings and converts each string to a vector of the number of occurances of each word in each string. This results in a very large vector with a significant number of features. An SVM classifier is more sensitive to these additional features than a Logisitic Regression classifer, which may lead to a poorer generalization performance from the SVM classifier.

Another potential reason for the better performance from the Logistic Regression classifier is the lack of balance in the dataset. There are three times as many partisan tweets than neutral tweets in the dataset. This presents a problem for the SVM because it is more prone to overfitting to the majority class than Logisitic Regression is.

Overall, the difference between the performance of the two classifiers is somewhat marginal. The strategies discussed below may improve the performance of both classifiers significantly and affect the eventual decision.

**For further improvement on classification accuracy, what strategies that you can use and why do you think they will be helpful?**

One strategy could be to incorporate more feature engineering and additional preprocessing.  Current preprocessing with the count vectorizer simply counts the frequency of words, which doesn’t take into account that some words such as “is” and “and” might be very frequent but don’t add much meaning. These words appear in both neutral and partisan tweets, and they do not provide much information to differentiate the two types. To fix this, we can specify the vectorizer to remove certain common words to make the remaining words more indicative and impactful. There are also additional steps that could be taken with this, such as lemmatization to remove duplicate phrases with the same meaning. Another way to accomplish this is to use TfidfVectorizer, which purposefully reduces the count of words it thinks isn’t important.

Another potential strategy to improve accuracy could be made on the model side. Other than testing more hyperparameters, the model accuracies could be potentially improved using ensemble methods such as bagging. Bagging works by training weaker models and smaller sets of the data, and using the average or majority of the predictions to determine the final result. With SVM’s for example, multiple SVM models could be trained and ensembled via BaggingClassifier from sklearn.ensemble. This way could also significantly reduce variance.

Another potential strategy to improve accuracy could be to find more examples of partisan tweets to balance the dataset. Currently, the dataset contains about three times as many neutral tweets than it does partisan tweets. Ideally, the dataset would contain a similar number of both types of tweets. Fixing this imbalance may improve performance of the SVM classifier on this dataset by giving it more datapoints to base its boundary on.