GitHub - bllin001/hate-speech-detection

Group: Anna Garcia & Brian Llinas

This is a term project for Machine Learning class (CS722) - The goal of the project is re-create the NeurIPS research paper.

Our group re-created the proposed methods in the research paper using different environment. We primarily used Python3, pandas DataFrame, sklearn library, transformers, and others. Please see the different steps on how to execute the code.

File Structure:

├── al_trained_models <- Trained Logistic Regression For Active Learning method
├── dataset
│   ├── bert_CAL_experiment_results_v3.csv <- Resulting dataset from algorithm 2 with BERT & CAL protocol
│   ├── bert_SAL_experiment_results_v3.csv <- Resulting dataset from algorithm 2 with BERT & SAL protocol
│   ├── tfidf_SAL_experiment_results_v2.csv <- Resulting dataset from algorithm 2 with TFIDF & SAL protocol
│   ├── tfidf_CAL_experiment_results_v2.csv <- Resulting dataset from algorithm 2 with TFIDF & CAL protocol
│   ├── train_test_datasetV2.csv <- Training dataset used to train our models
│   ├── experiment_datasetV2.csv <- Our "Experiment-set" to test the models to build the "Dataset"
│   ├── pooling_dataset.csv <- Resulting Dataset from algorithm 1 - Pooling
|   ├── experiment_df_bert.csv <- Resulting dataset from Extra credit proposal algorithm
├── notebooks
│   ├── 722_ActiveLearning_bert.ipynb <- Code used to run the Extra credit proposal
│   ├── 722_PoolingAlgorithmNotebook.ipynb <- Same as pooling.py and pooling_models.py in notebook 
│   ├── 722_Project_CreateDataset.ipynb <- Codes used to build training-set and experiment-set
│   ├── AL-bert-cal.ipynb <- Code for algorithm 2 with BERT & CAL
│   ├── AL-bert-sal.ipynb <- Code for algorithm 2 with BERT & SAL
│   ├── AL-tfidf-cal.ipynb <- Code for algorithm 2 with TFIDF & CAL
│   ├── AL-tfidf-sal.ipynb <- Code for algorithm 2 with TFIDF & SAL
├── pooling_trained_models <- Various trained models for Pooling method
├── pooling.py
├── pooling_models.py
├── ReadMe.md
├── requirements.txt

Local Setup:

Make your machine has python3 or you can download python page

python --version

Clone the repository:

git clone https://github.com/AnnaGarcia1207/722_project.git

cd into the directory and install required libraries.

We suggest to create a vitual environment in this case. On your terminal run line by line:

python -m venv myenv

myenv/Scripts/activate

pip install -r requirements.txt

Running 1st algorithm: Pooling

** The datasets and the models are already trained. They can be found under /dataset and /pooling_trained_models. But you can also re-run the models and create the poooling dataset yourself.

Run the Pooling Models. The porgram will train and save the models

python pooling_models.py

Run the Pooling algorithm.

python pooling.py

Sample output:

=========================================
Running LogisticRegression model :
AUC: 0.385757532998961
Optimal threshold: 0.9126040168712445
LogisticRegression extracted 62 / 1000
=========================================
Running NaiveBayes model :
AUC: 0.22649452048351443
Optimal threshold: 0.9888543820117094
NaiveBayes extracted 7 / 1000
=========================================
Running GradientBoostingClassifier model :
AUC: 0.4905845530228202
Optimal threshold: 0.9035055984947704
GradientBoostingClassifier extracted 85 / 1000
=========================================
Running SVC model :
AUC: 0.6956505937033663
Optimal threshold: 0.9999812549974917
SVC extracted 228 / 1000
=========================================
Running LinearSVC model :
AUC: 0.5658525426410286
Optimal threshold: 0.9986208248027768
LinearSVC extracted 164 / 1000
=========================================
Preliminary pooling dataset with duplicates length: 546
Number of duplicates: 269
Removing duplicates....
Pooling dataset length: 277

Running 2nd Algorithm: Active Learning

Our group opted to use ODU's HPC Wahab Cluster because it uses BERT from transformers library from HuggingFace and it is very resource intensive, so it must be ran using the HPC environment.

You must have an account with ODU's HPC Wahab Cluster.
Create Jupyter Server with the following parameters:

Python Suite: tensorflow 2.12 + pytorch 1.13GPU

Number of Cores: 8
Once Jupyter instance has started, click on Connect to Jupyter.
Create a Jupyter notebook
Under \notebooks, please upload the following notebooks:

a. AL-bert-sal.ipynb

b. AL-bert-cal.ipynb

c. AL-tfidf-sal.ipynb

d. AL-tfidf-cal.ipynb
There are four different models being ran for Algorithm 2. To save time, we separated code in four different notebooks to run concurrently.
To run the code, click on every cell then click the RUN button. You can repeat step 7 for every notebook.
Each notebook may produce the following:
1. Experiment results in CSV format
2. logistic regression model saved via pickle library

Running Extra credit proposal algorithm

In order to develop this framework proposal, we used HPC ODU cluster and Jupyter notebook and the environment was set up with the following parameters:
- Python Suite: tensorflow 2.12 + pytorch 1.13 GPU
- Number of Cores: 8
Follow steps in jupyter notebook Extra credit proposal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Local Setup:

Running 1st algorithm: Pooling

Running 2nd Algorithm: Active Learning

Running Extra credit proposal algorithm

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
al_trained_models		al_trained_models
dataset		dataset
notebooks		notebooks
pooling_trained_models		pooling_trained_models
.gitignore		.gitignore
ReadMe.md		ReadMe.md
pooling.py		pooling.py
pooling_models.py		pooling_models.py
requirements.txt		requirements.txt

bllin001/hate-speech-detection

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Local Setup:

Running 1st algorithm: Pooling

Running 2nd Algorithm: Active Learning

Running Extra credit proposal algorithm

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages