## Project: Not Safe for Work (NSFW) media Classifier for Wikimedia Commons

#### Task [T264045](https://phabricator.wikimedia.org/T264045)

### Contents
1. What is NSFW?
2. Existing Datasets
3. Pre-trained models
4. NSFW detection Models
5. Comparison of existing systems
6. Proposal
7. Additional Resources

## What is NSFW?

According to [Wikipedia](https://en.wikipedia.org/wiki/Not_safe_for_work):
> Not safe for work (NSFW) is Internet slang used to mark links to content, videos or websites pages the viewer may not wish to be seen looking at in a public, formal or controlled environment. The marked content may contain nudity, intense sexuality, political incorrectness, profanity, slurs, violence or other potentially disturbing subject matter.

> Conversely, safe for work (SFW) is used for links that don't contain such material, but where the title might otherwise lead people to think that content is NSFW.

In this work we work with images and videos. Although NSFW categories are quite subjective, most works in this field include pornography, nudity and in rare instances, violence in the NSFW category. Details about these can be found in the `Dataset` section.

## Existing Datasets
#### Microtask [T264938](https://phabricator.wikimedia.org/T264938)

There is no single perfect Benchmark dataset for NSFW detection, mostly because of the the subjectivity of its definition. Yet many researchers have curated their own datasets, and some have combined their own collections with existing datasets. Some of the most used data sources are briefed below:

1. [NPDI pornography datasets](https://sites.google.com/site/pornographydatabase/) 
The Pornography database contains nearly 80 hours of 400 pornographic and 400 non-pornographic videos. Of the non-pornographic content, 200 videos are 'easy' and '200' videos are 'difficult'. The difficult class contains people with exposed skin yet sfw such as swimming, wrestling etc.

![images/porn_db.png](images/porn_db.png)

An extension of this dataset, called the [pornography-2k](https://danielmoreira.github.io/publication/2016_fsi/paper.pdf) dataset is available. This new dataset comprises nearly 140 hours of 1000 pornographic and 1000 non-pornographic videos, varying from six seconds to 33 min long.

>The Pornography-2k dataset is available free of charge to the scientific community but, due to the potential legal liabilities of distributing large quantities of pornographic/copyrighted material, the request must be formal and a responsibility term must be signed.

2. [NSFW Data Scraper](https://github.com/alex000kim/nsfw_data_scraper) 
A set of scripts for automatic collection of nsfw/sfw images. The categories classified are : `porn, hentai, sexy, neutral and drawings`. More urls containing images can be added to the list.

3. [Bigger NSFW Data Scraper](https://github.com/EBazarov/nsfw_data_source_urls)
This repository similarly contains text file of urls to `159 categories` of nsfw images. Scraping this gives around 1.3 million images. But there are no urls for nfw images. 
>After downloading and cleaning it's possible to have ~ 500GB or in other words ~ 1 300 000 of NSFW images

4. [NudeNet classifier Dataset](https://archive.org/details/NudeNet_classifier_dataset_v1)
The author of NudeNet combined multiple sources of nsfw/sfw data to make a comprehensive nsfw classification dataset. 

- [Ripme](https://github.com/RipMeApp/ripme) was used to collect images from nsfw sub-redits (list of sub-redits was collected from [here](https://scrolller.com/nsfw)).
- To gather low quality nsfw images, pornhub thumbnails were also crawled.
- Data collected from [1](https://github.com/alex000kim/nsfw_data_scraper) and [here](https://github.com/GantMan/nsfw_model)
- Facebook profile pictures for sfw using Graph API.
- SFW sub-redit images using Ripme.
- `Exposed part detection` dataset was also used in this project as object detection.

5. [Annotated Dataset](https://www.researchgate.net/publication/312241188_Pornographic_image_recognition_by_strongly-supervised_deep_multiple_instance_learning)
>We have collected a large scale dataset consisting of 155,000 pornographic images and 222,000 normal images from Internet.

This dataset contains 33,000 annotated pornographic images and 100,000 randomly selected normal images as the training set. Randomly selected 5,000 pornographic images and 5,000 normal images from the remaining images are included in the validation set, while the remaining 117,000 pornographic images and 117,000 normal images are used as the test set.

6. [Annotated dataset](https://arxiv.org/abs/1902.03771)
They use a newly-collected large scale dataset of 100K pornographic images and 100K normal images. They perform bounding box _annotation_ for these key pornographic contents in the training set, and for the sexual behaviour, just annotate the most compact regions that are sufficient to characterize these behaviours.
>Overall, we categorize them into three groups, namely, regular `nudity, sexual behaviour, and unprofessional porn`. The normal images of our dataset are also downloaded from the Internet, which can be further categorized into three groups, namely, `scantily-clad people, normal people and no people`.

7. [Video collection](https://www.researchgate.net/publication/336430665_A_baseline_for_NSFW_video_detection_in_e-learning_environments)
This work mainly aims at identifying inappropriate content from educational videos. The dataset for this work was self curated from Youtube8M and video@RNP for SFW videos. NPDI dataset was integrated into it. Videos of surgeries from Cholec80 were added as SFW, since educational content may contain such views. NSFW was considered to be porn and violence and for this videos were collected from XVideos and some more websites like Gore7, BestGore8 and GoreBrasil9.

>49920 videos from Yotube8M, 5000 videos from video@RNP, 400 SFW videos and 400 NSFW videos from NPDI. 80 videos of cholecystectomy surgeries, 47,758 videos from XVideos, 2.242 gore videos.

8. [Kaggle dataset](https://www.kaggle.com/drakedtrex/my-nsfw-dataset)
This dataset does not seem to be in use. Its images can nevertheless be included in nsfw categories to strengthen our dataset.

9. Other small collections([1](https://yacvid.hayko.at/task.php?did=368) and [2](https://yacvid.hayko.at/task.php?did=369))
These datasets contain a small collection of images and collection of video segments from various movies which include both nsfw and sfw.

**Dataset tools:**
These datasets are noisy and require cleaning and duplicate removal. Also, on creating or combining datasets some additional cleaning steps are required. I have listed a few tools that will be helpful in that procedure.
- [Find duplicates](https://github.com/Qarj/duplicate-file-finder) This tool can help identify and remove duplicate files.
- [Gemini2](https://macpaw.com/gemini) This tool can help us remove images that may not be exactly same but are "essentially" same.
- [Snorkel](https://www.snorkel.org/) Can be used for auto-labeling.
- Before model training proper data augmentation has to be done such as cropping, rotation, flipping etc. Although [external augmenters](https://github.com/mdbloice/Augmentor) can be used, most platforms(pytorch, tendorflow) already consist of in-built tranformation modules.

## Pre-trained Models

Computer Vision tasks have long established the practice of transfer learning. Transfer learning is the process of training a model on a very large dataset to learn a generic task and then fine-tuning it for downstream tasks. Almost all works in NSFW detection use models pre-trained on [ImageNet](http://www.image-net.org/) dataset as a start. ImageNet is a huge image databse of 1000 classes of images. When a model learns to distinguish among these classes, it also gains the ability to identify patterns and colors.

A list of the various available pre-trained models is given below along with thir performance on ImageNet validation set. Note: the accuracy reported here is not always representative of the transferable capacity of the network on other tasks and datasets. This list is not comprehensive, there are many other types of model and multiple variations and size of similar models.

Model | Version | Top 1 Acc | Top 5 Acc
--- | --- | --- | ---
PNASNet-5-Large | [Tensorflow](https://github.com/tensorflow/models/tree/master/research/slim) | 82.858 | 96.182
NASNet-A-Large | [Tensorflow](https://github.com/tensorflow/models/tree/master/research/slim) | 82.693 | 96.163
NASNet-A-Mobile | [Tensorflow](https://github.com/tensorflow/models/tree/master/research/slim) | 74.0 | 91.6
SENet | [Caffe](https://github.com/hujie-frank/SENet) | 81.32 | 95.53
PolyNet | [Caffe](https://github.com/CUHK-MMLAB/polynet) | 81.29 | 95.75
InceptionResNetV2 | [Tensorflow](https://github.com/tensorflow/models/tree/master/slim) | 80.4 | 95.3
DualPathNet | [Cadene](https://github.com/Cadene/pretrained-models.pytorch#dualpathnetworks) | 79.224 | 94.488
Xception | [Keras](https://github.com/keras-team/keras/blob/master/keras/applications/xception.py) | 79.000 | 94.500
SE-ResNet | [Caffe](https://github.com/hujie-frank/SENet) | 77.63 | 93.64
FBResNet152 | [Torch7](https://github.com/facebook/fb.resnet.torch) | 77.84 | 93.84
InceptionV4 | [Tensorflow](https://github.com/tensorflow/models/tree/master/slim) | 80.2 | 95.3
InceptionV3 | [Pytorch](https://github.com/pytorch/vision#models) | 77.294 | 93.454
DenseNet| [Pytorch](https://github.com/pytorch/vision#models) | 74.646 | 92.136
ResNet | [Pytorch](https://github.com/pytorch/vision#models) | 70.142 | 89.274
VGG| [Pytorch](https://github.com/pytorch/vision#models) | 68.970 | 88.746
SqueezeNet| [Pytorch](https://github.com/pytorch/vision#models) | 58.250 | 80.800
Alexnet| [Pytorch](https://github.com/pytorch/vision#models) | 56.432 | 79.194
GoogleNet |[Pytorch](https://github.com/pytorch/vision#models)| 69.88 | 89.53
ShuffleNet V2 |[Pytorch](https://github.com/pytorch/vision#models)|69.36 | 88.32
MobileNet V2 |[Pytorch](https://github.com/pytorch/vision#models)| 71.88 | 90.29
ResNeXt-50-32x4d |[Pytorch](https://github.com/pytorch/vision#models)| 77.62 | 93.7

These models have different kinds of architectural variations and are usually very large. Training on a huge dataset like imagenet requires huge computational power. Most models are available in pre-trained form in frameworks such as pytorch and tensorflow. Other libraries also provide a collection of pretrained models ready for use. 
Below comparison of size, accuracy and computational requirement of these models is shown([source](https://www.learnopencv.com/pytorch-for-beginners-image-classification-using-pre-trained-models/)).

size | accuracy
---|---
![images/pretrained_model_comp_size.png](images/pretrained_model_comp_size.png)|![images/pretrained_model_comp_error.png](images/pretrained_model_comp_error.png)

cpu | gpu
---|---
![images/pretrained_model_comp_cpu.png](images/pretrained_model_comp_cpu.png)|![images/pretrained_model_comp_gpu.png](images/pretrained_model_comp_gpu.png)

![images/pretrained_model_comp.png](images/pretrained_model_comp.png)

Smaller Bubbles are better in terms of model size. Bubbles near the origin are better in terms of both Accuracy and Speed.

## NSFW detection Models
#### Microtask [T264056](https://phabricator.wikimedia.org/T264056)

Before we dive into the present deep learning based state-of-the-art models, it is a good idea to take a peek at what NSFW detection has been since its birth. NSFW mainly encompassed nudity detection, and this was accomplished by detecting large area of exposed skin. Later shape detection allowed identification of human limbs along with statistical models to identify nudity. For video, repititiveness of audio was considered as signs of nsfw video, and motion vectors alng with BoVW was used with PCA and simple classifiers such as SVM.

1. [Yahoo open_nsfw](https://github.com/yahoo/open_nsfw)

>Defining NSFW material is subjective and the task of identifying these images is non-trivial. Moreover, what may be objectionable in one context can be suitable in another. For this reason, the model focuses only on one type of NSFW content: pornographic images. The identification of NSFW sketches, cartoons, text, images of graphic violence, or other types of unsuitable content is not addressed with this model.

>We are releasing the thin ResNet 50 model, since it provides good tradeoff in terms of accuracy, and the model is lightweight in terms of runtime (takes < 0.5 sec on CPU) and memory (~23 MB). 


>Depending on the dataset, usecase and types of images, we advise developers to choose suitable thresholds. Ideally developers should create an evaluation set according to the definition of what is safe for their application, then fit a ROC curve to choose a suitable threshold if they are using the model as it is.
>Results can be improved by fine-tuning the model for your dataset/ use case / definition of NSFW.

Although they release only the Resnet50 model, they do test out other pre-trained models. Below a comparison of yahoos open_nsfw dataset peroformance using these models is shown:

![images/yahoo_comp.jpg](images/yahoo_comp.jpg)

2. [Classification of pornographic images and videos](https://arxiv.org/pdf/1511.08899.pdf)

This reasearch takes AlexNet and GoogleNet and fuses them. The fusion can be made using a few fully connected layers or a simple linear weighted average of scores. This paper lists multiple variations of this procedure and shows that the fusion performs better than the models individually. They used the NPDI dataset and majority voting of all frames of a video to classify a video.

3. [Pornography classification of videos](https://www.researchgate.net/publication/308398120_Pornography_Classification_The_Hidden_Clues_in_Video_Space-Time)

Traditional methods of video classification often employ still-image techniques — labeling frames individually prior to a global decision. Frame-based approaches, however, ignore information brought by motion. This work aggregates local information extracted by TRoF into a mid-level representation using Fisher Vectors, the state-of-the-art model of Bags of Visual Words (BoVW).

4. [Pornographic Image recognition](https://www.researchgate.net/publication/312241188_Pornographic_image_recognition_by_strongly-supervised_deep_multiple_instance_learning)

This work uses GoogleNet and self-curated dataset for fine-tuning and validation. The hilight of this work is that they perform `strongly-supervised` classification. They annotate 33k pornographic images for private parts and obscene scene. Then apply a sliding window approach to match the annotated portions of an image, much like object-detection. If any match is found, the image is classified as porn.
>We have collected a large scale dataset consisting of 155,000 pornographic images and 222,000 normal images from Internet. For these pornographic images, we randomly select 33,000 images to annotate exposed private parts with keypoints, which serves as strong supervision in the training phase.

5. [Video pornogrpahy detection](https://www.researchgate.net/publication/311551085_Video_pornography_detection_through_deep_learning_techniques_and_motion_information)

This work takes into account motion of a video to classify it, besides the image frames. Motion vectors give us, for each pixel, its relative movement from the source image to the reference one. Each position in this field has a displacement vector indicating the estimation of which direction the respective pixel has moved to and the intensity (gradient) of this movement. 

> We represent the motion information by two motion maps, one for the horizontal (dx) component of the motion and another for the vertical (dy), containing in each (x,y) position, a measure of motion in that respective direction. When transforming these motion maps to images, we linearly rescale them to the [0, 255] interval and store them as grayscale images, one image for each component of the motion.

Motion information is explicitly provided to the convolutional neural network, and each type of information (static and motion) is independently processed by the network. They experiment with several architectural variations to incorporate the motion maps to GoogleNet, VGG and AlexNet. Of them the best performing one is where a video frame and its motion maps are passed through a CNN model separately and then concat pool is applied to classify the extracted features together. The architecture is shown below.

![images/motion.png](images/motion.png)

They also perform comparison with third party applications. This comparison along with comparison of various architecture is shown below.

![images/model_comp.png](images/model_comp.png)
![images/thirdparty_comp.png](images/thirdparty_comp.png)


6. [Video detection with CNN-RNN architecture](https://www.researchgate.net/publication/318408211_Adult_Content_Detection_in_Videos_with_Convolutional_and_Recurrent_Neural_Networks)

This work also uses the NPDI dataset and tests with GoogleNet and ResNet. It extracts 1024 features from the last convolutional layer from GoogleNet and 2048 features from ResNets 52, 101, and 152. It normalizes the images by subtracting the RGB mean. For generating more robust features, it employs 10 crops from each original frame: top-left, top-right, bottom-left, bottomright, and center (and then the same crops but horizontally mirrored). The final features are defined by the average of the vectors extracted from the 10 crops. They don't fine-tune, simply employ pretrained models for forward pass. Then pass frames to **LSTM** to perform sequence modeling on the video frames and finally get a classification result. For them largest resnet worked best. The architecture and accuracy compairson is shown below.


![images/acorde.png](images/acorde.png) ![images/acorde_res.png](images/acorde_res.png)

7. [Pornographic Image Recognition via Weighted Multiple Instance Learning](https://arxiv.org/abs/1902.03771)

This work takes each image as a bag of regions, and follows a multiple instance learning (MIL) approach to train a generic region-based recognition model. They use a newly-collected large scale dataset of 100K pornographic images and 100K normal images. 
>Overall, we categorize them into three groups, namely, regular `nudity, sexual behaviour, and unprofessional porn`. The normal images of our dataset are also downloaded from the Internet, which can be further categorized into three groups, namely, `scantily-clad people, normal people and no people`.

they perform bounding box _annotation_ for these key pornographic contents in the training set, and for the sexual behaviour, just annotate the most compact regions that are sufficient to characterize these behaviours. They move around and change the box size to generate positive and negative class images. They use GoogleNet. For testing and validation an image is scaled into 3 sizes, lots of crops are taken. If one of those is identified porn, the image is porn. They also try NPDI benchmark.

8. [A baseline for NSFW video detection](https://www.researchgate.net/publication/336430665_A_baseline_for_NSFW_video_detection_in_e-learning_environments)

This work uses multimodal features, consisting of image sequence features and audio features. They use an Inception-V3 CNN for extracting image sequence features and use an Audio VGG CNN (pre-trained in the Audioset11) to extract the audio embeddings. They apply PCA (+whitening) to reduce the dimensions of the image embeddings to 1024 and audio embeddings to 128. Finally, they concatenate both image and audio embeddings to compose the final video embeddings with 1152 dimensions. Then features from all frames of the video are combined into a single vector of features(by averaging) and then classification is performed on the resulting features.
Classification portion was done with svm, knn and mlp and they compare them. SVM gave the best performance in this case. The architecture of this system is shown below.
![images/audio_video.png](images/audio_video.png)

The performance of their various classifiers is shown below. The SVM model seems to perform better in a balanced manner, although MLP also perfomed well.

![images/audio_svm.png](images/audio_svm.png)
![images/audio_knn.png](images/audio_knn.png)
![images/audio_mlp.png](images/audio_mlp.png)

This work mainly aims at identifying inappropriate content from educational videos. The dataset for this work was self curated from Youtube8M and video@RNP for SFW videos. NPDI dataset was integrated into it. Videos of surgeries from Cholec80 were added as SFW, since educational content may contain such views. NSFW was considered to be porn and violence and for this videos were collected from XVideos and some more websites like Gore7, BestGore8 and GoreBrasil9.

9. Others

Some other examples of nsfw models are given below. These models do not have extensive description but have released either dataset or pre-trained model or both. These can be used to compare our models performance if and when required.
- [ResNet50 model in caffee](https://github.com/ryanjay0/miles-deep): 95% Acc on 6 fine-grained categories of Porn.
- [Frame by Frame classifier](https://github.com/lakshaychhabra/NSFW-Detection-DL): Data from [here](https://github.com/alex000kim/nsfw_data_scraper). Uses MobileNet architecture, fast and less parameters. A sliding window approach is used to scan an image and look for inappropriate content within the image. 
- [NSFW Classifier](https://github.com/deepanshu-yadav/NSFW-Classifier) Classifies images as nude, semi-nude, animated, porn and SFW. Dataset consists of [NudeNet dataset](https://archive.org/details/NudeNet_classifier_dataset_v1), Data Scraped from [here](https://github.com/EBazarov/nsfw_data_source_urls) and [here](https://github.com/alex000kim/nsfw_data_scraper/tree/master/raw_data), and images from instagram using a Instagram [Scraper](https://github.com/rarcega/instagram-scraper).
- [CNN NSFW predictor](https://github.com/Parasgupta44/NSFW_Detector): This work used the [NudeNet dataset](https://archive.org/details/NudeNet_classifier_dataset_v1) with VGG16 and InceptionV3 and classified nude, sexy and safe.
- [NSFW detection](https://github.com/GantMan/nsfw_model): This repo classifies 5 classes - drawings, hentai, neutral, porn and sexy. Acquired 93% best accuracy using InceptionV3. Also used MobileNet model and has [link](https://github.com/yangbisheng2009/nsfw-resnet) to a ResNet pytorch model. The dataset is not made available.
- [NudeNet](https://github.com/notAI-tech/NudeNet): This repo curated and combined many sources to create the [NudeNet dataset](https://archive.org/details/NudeNet_classifier_dataset_v1). Xception model was used to perform classification of safe and nude classes. Nude class also included porn, simulated porn, and explicit nudity.


**Bonus:** Comparison of the existing APIs can be found [here](https://towardsdatascience.com/comparison-of-the-best-nsfw-image-moderation-apis-2018-84be8da65303). An overall comparisons is shown below. If required, our model can be tested against a few of these APIs for robustness.

![images/api_comp.png](images/api_comp.png)

## Comparison of existing models

To compare all the models described above, I have created a table with the important information about each of these work including what they were classifying, what datasets they used and how accurate they could get. Not all paper contain information about model size and computational power, but since most use pre-trained models, an idea about it can be gained from the `pre-trained models` section above.

Work | Model | Classes | Dataset | Accuracy | Model size 
---|---|---|---|---|---
[Yahoo open nsfw](https://github.com/yahoo/open_nsfw) | Resnet50| Porn | Not avaialable | 99.5 | 23 MB 
[Image and Video](https://arxiv.org/pdf/1511.08899.pdf) | AlexNet, GoogleNet| Porn | NPDI | 94.1 | 62 MB + 40 MB 
[Video](https://www.researchgate.net/publication/308398120_Pornography_Classification_The_Hidden_Clues_in_Video_Space-Time) | TRoF, BoVW | Porn | NPDI | 95.0 | 
[Image](https://www.researchgate.net/publication/312241188_Pornographic_image_recognition_by_strongly-supervised_deep_multiple_instance_learning) | GoogleNet | Porn | Self curated and annotated | 98.4 | 40 MB 
[Video](https://www.researchgate.net/publication/311551085_Video_pornography_detection_through_deep_learning_techniques_and_motion_information) | GoogleNet | Porn | NPDI | 96.4 | 40 MB 
[Video](https://www.researchgate.net/publication/318408211_Adult_Content_Detection_in_Videos_with_Convolutional_and_Recurrent_Neural_Networks) | ResNet101 | Porn | NPDI | 95.6 | 45 MB 
[Images](https://arxiv.org/abs/1902.03771) | GoogleNet | Porn, Nudity | Self curated annotated and NPDI | 97.3 | 40 MB 
[Video with Audio](https://www.researchgate.net/publication/336430665_A_baseline_for_NSFW_video_detection_in_e-learning_environments) | InceptionV3, AudioVGG, SVM | Porn, Violence | Self Curated dataset and NPDI | 97.0 | 24 MB, 138 MB 

## Additional Resources
- [How to make models smaller and faster](https://heartbeat.fritz.ai/deep-learning-has-a-size-problem-ea601304cd8)
- [Curated list of Computer Vision Models](https://github.com/nerox8664/awesome-computer-vision-models)
- Vision Transformers: [Paper](https://openreview.net/pdf?id=YicbFdNTTy) and [Video](https://www.youtube.com/watch?v=TrdevFK_am4&ab_channel=YannicKilcher)