Project for my graduate-level ML class (COMP 551). You can find the paper under "writeup.pdf." (last file above).
We analyzed adaboost, linear SVC, radial basis function SVM, random forests, decision trees, and BERT on two sentimental NLP datasets - IMDb and 20Newsgroup.
Many machine learning algorithms have been developed in recent history. We will explore the performance of some of the most common models in this paper given a categorical or a binary classification problem on text files. These models include Adaboost, linear SVC, linear regression, radial basis function SVM, random forest, decision tree, and BERT. Our results show the effects of regularization, resampling methods such as bagging (bootstrap aggregating) and 5-fold crossvalidation as well as boosting on model accuracy. We also examine the effects of these strategies on bias-variance tradeoff to determine the best models for each algorithm and data set. Our highest test accuracies were achieved using BERT: 72.40% on 20 Newsgroups and 94.15% on IMDb.