## NSFW Classifiers Survey for T264045: Not Safe for Work (NSFW) media Classifier for Wikimedia Commons

NSFW(Not Safe for Work) is an internet slang that arose out of utility to communicate and indicate a certain type of visual data that may be sexual(explicit or suggestive), violent or harmful content to view in non-casual, professional environments. In the previous report I briefly explained what NSFW entails and the datasets that can be used to train our own NSFW classifiers. In this report I will be comparing various methods and approaches for building NSFW media classifiers. 

Before listing the results and details of the existing research, it is important to understand transfer learning and using pre-trained models in NSFW video classification as most existing research have based their methods on this.


### Using Pretrained Models

Pre-trained models are neural network models trained on large benchmark datasets. It is a popular approach in deep learning where pre-trained models are used as the starting point for computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems from scratch and from the huge jumps in skill that they provide on related problems. 

This method of using pre-trained models developed for a certain task as a starting point to where a model developed for on a second task is called `transfer learning`. Consequently, pre-training CNN models on ImageNet and fine-tuning them further is a common method deep learning tasks for building NSFW classifiers.

#### Implementation

The comparison of the pre-trained models shown below is done after training them on [ImageNet](http://www.image-net.org/), which is a popular image dataset providing on average 1000 classes of data having a repository of more than 14 million images. Images of each class are quality-controlled and human-annotated. Training is done on both CPU and GPU for each model.

#### Author's Conclusions
  
The illustration below demonstrates the inference time (on GPU) performance of each model against the top-1% error in performance. All the illustrations and findings have been taken from this [source](https://www.learnopencv.com/pytorch-for-beginners-image-classification-using-pre-trained-models/).
<img src="img/Pre-Trained-Model-Comparison.jpg" width=700>
  
I have sought to compare the above models tested on ImageNet validation dataset from data from [Papers With Code ImageNet Classification table](https://paperswithcode.com/sota/image-classification-on-imagenet).

Model | Implementation| Top-1% Accuracy | Top-5% Accuracy | Number of Params
--- | --- | --- | --- | ---
ResNeXt-101-32x8d| [PyTorch](https://github.com/facebookresearch/semi-supervised-ImageNet1K-models) |82.2 |	96.4	|88M
ResNeXt-50-32x4d |[PyTorch](https://github.com/facebookresearch/semi-supervised-ImageNet1K-models)| 80.22	|94.9	|34.7M 
InceptionV3 |[TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim)| 78.95| 94.49|24M
ResNet-101 |[PyTorch](https://github.com/facebookresearch/semi-supervised-ImageNet1K-models)  | 78.25 | 93.95 | 40M
DenseNet-201| [PyTorch](https://github.com/liuzhuang13/DenseNet)| 77.42 | 93.66 | 20M
ResNet-50 | [PyTorch](https://github.com/facebookresearch/semi-supervised-ImageNet1K-models) | 77.15 | 93.29 | 26M
DenseNet-169| [PyTorch](https://github.com/liuzhuang13/DenseNet) | 76.2 | 93.15 | 13M
DenseNet-121|[PyTorch](https://github.com/liuzhuang13/DenseNet) | 74.98 | 92.29 | 7.2M
ResNet-34 |[PyTorch](https://github.com/facebookresearch/semi-supervised-ImageNet1K-models) | 62.6 | 84.1 | 22M
VGG-19|[TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim) | 74.5 | 92.0 | 144M
MobileNet V2 |[TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim)| 74.9|92.5| 6.9M
GoogleNet |[PyTorch](https://github.com/pytorch/vision/blob/6db1569c89094cf23f3bc41f79275c45e9fcb3f3/torchvision/models/googlenet.py#L62)|71|90.8 | 7M
ResNet-18 |[PyTorch](https://github.com/facebookresearch/semi-supervised-ImageNet1K-models) | 72.7 |91.9  | 11M
ShuffleNet V2 |[PyTorch](https://github.com/megvii-model/ShuffleNet-Series/tree/master/ShuffleNetV2)|75.4 | 92.4| 5.4M
SqueezeNet | [TensorFlow](https://github.com/forresti/SqueezeNet)| 58.2 | 87.4 | 1.25M
AlexNet|[PyTorch](https://github.com/dansuh17/alexnet-pytorch/blob/d0c1b1c52296ffcbecfbf5b17e1d1685b4ca6744/model.py#L40)| 63.3 | 84.6 | 60M

The additional results of the author's findings are comparisons of the models made on accuracy, inference time (on CPU and GPU), and model size. 

<table><tr><td><img src='img/Accuracy-Comparison-of-Models.jpg'></td><td><img src='img/Model-Size-Comparison.jpg'></td></tr></table>

<table><tr><td><img src='img/Model-Inference-Time-Comparison-on-CPU-ms-Lower-is-better-.jpg'></td><td><img src='img/Model-Inference-Time-Comparison-on-GPU-ms-Lower-is-better-.jpg'></td></tr></table>

From the above data, the author concludes:
1.   ResNet50 is the best model in terms of all three parameters (small in size and closer to origin)
2.   DenseNets and ResNext101 are expensive on inference time.
3.   AlexNet and SqueezeNet have pretty high error rate.


While the above list is vast, it is far from comprehensive and as such I have added a few more models whose accuracy and details were tabulated after training on ImageNet. I have taken the data below from the same source as the previous table from the [Papers With Code website](https://paperswithcode.com/sota/image-classification-on-imagenet).

Model | Reference | Top-1% Accuracy | Top-5% Accuracy | Number of Params
--- | --- | --- | --- | ---
MobileNetV3_large_x1_0_ssld| [PyTorch](https://github.com/PaddlePaddle/PaddleClas)|79.0|	94.5|	5.47M
FixEfficientNet-B4|[Facebook](https://github.com/facebookresearch/FixRes/blob/master/README_FixEfficientNet.md)|85.9|	97.7|	19M
Xception|[Keras](https://github.com/keras-team/keras/blob/master/keras/applications/xception.py)| 79|	94.5|	22.8M
InceptionResNet V2|[TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim)|80.1|	95.1|	55.8M
FixEfficientNet-B7|[PyTorch](https://github.com/facebookresearch/FixRes/blob/master/README_FixEfficientNet.md)| 87.1|	98.2|	66M
VGG-16|[TensorFlow](https://github.com/tensorflow/models/tree/master/research/slim)| 74.4|91.9|	138M
FixEfficientNet-L2|[PyTorch](https://github.com/facebookresearch/FixRes/blob/master/README_FixEfficientNet.md)|88.4|98.7	|480M

From the above tables we see that iterations of **EfficientNet** and **MobileNet** provide good trade-offs between size and accuracy to be used as a reliable pre-trained model.

#### Pros: 

The findings here list useful information that can be leveraged while taking into consideration which pre-trained model is best suited as starting points for building NSFW classifiers. These results can be used to determine which models best encompass the `trade-off` for computational complexity and accuracy in outputs.

#### Cons: 

While transfer learning is a great starting point, it only works if the initial and target problems of both models are similar enough. If the first round of training data required for the new task is too far from the data of the old task, then the trained models might perform worse than expected. This can be a point of concern and requires checking, seeing as all the models have been trained on ImageNet which contains a vast repository of not just human SFW media but objects, animals, flora and vaious other classes of data. Before building and training NSFW classifiers on top of these pre-trained models using NSFW datasets, thorough testing and inspection on smaller-sized models in initial stage of development is required.



## Summary (TL;DR)

Before detailing the existing research on NSFW media classification, I have summarized the performances of the best accuracies observed out of the methods proposed in the various research papers compared in this report, with the dataset, pre-trained model that they used and accuracy of the output classifcations and created an index below.

Go to location in this notebook| Research Work | Pre-trained Model Used | Datasets Used | Accuracy %
---|---|---|---|---
   [Paper 1](#1)  |  [Yahoo Open NSFW](https://github.com/yahoo/open_nsfw) | Resnet50|  Own dataset | 99.5 
   [Paper 2](#2)  | [Multimodal Features with SVM classifier](http://docsdrive.com/pdfs/medwelljournals/jeasci/2018/1174-1182.pdf) | VGG-16, SVM| Modified Pornography-2k dataset| 63.4
   [Paper 3](#3)  | [Multimodal Features with MLP classifier](https://sci-hub.se/https://doi.org/10.1145/3323503.3360625) | Inception V3, Audio VGG CNN, MLP  | Own dataset + NPDI | 98.0 
   [Paper 4](#4)  |  [Spatio-Temporal Descriptor (TRoF)](https://sci-hub.se/10.1016/j.forsciint.2016.09.010) | BoVW + TRoF | Pornography-2k dataset | 95.58 
   [Paper 5](#5)  | [Weighted Multiple Instance Learning](https://arxiv.org/pdf/1902.03771.pdf) | GoogLeNet| Own dataset + NPDI dataset |  98.35
   [Paper 6](#6)  |  [Strongly Supervised Deep Multiple Instance Learning](https://sci-hub.se/10.1109/ICIP.2016.7533195) | GoogLeNet |  Own dataset | 98.4 
   [Paper 7](#7)  |   [(ACORDE) CNN + LSTM end-to-end architecture](https://sci-hub.se/10.1016/j.neucom.2017.07.012) |ResNet101 | Self curated dataset| 95.5
   [Paper 8](#8)  |   [Using Motion Information](https://sci-hub.se/10.1016/j.neucom.2016.12.017) | GoogLeNet | NPDI + Pornography-2k dataset| 97.9
   [Paper 9](#9)  |    [Deep One-Class Classification (DOCAPorn) With Visual Attention Mechanism](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9141172) | VGG-19|Own dataset + NPDI | 98.419
   [Paper 10](#10)  |   [AGNet for images and videos](https://arxiv.org/pdf/1511.08899.pdf) | AlexNet, GoogleNet| NPDI dataset| 94.1 
        

Besides the above, I have also perused and summarized some particularly insightful [blogposts](#blog) that have sought to implement and compare NSFW classifiers.



### <a id='1'></a> 1. [Yahoo open_nsfw](https://yahooeng.tumblr.com/post/151148689421/open-sourcing-a-deep-learning-solution-for)

Yahoo's Open NSFW is a popular and one of the earliest models to be published and released in the field of NSFW content classfication. 

The authors have endeavoured to only classify one type of NSFW media: pornographic images while other categories like NSFW sketches, cartoons, text, gore have not been addressed. They have done this to simplify their problem statement and reduce ambiguity in content where what is considered inappropriate content in what context maybe be SFW in another. They have not published their dataset.

#### Implementation 

While training, the images were resized to 256x256 pixels, horizontally flipped for data augmentation, and randomly cropped to 224x224 pixels, and were then fed to the network. For training residual networks, tyhe authors used scale augmentation as described in the [ResNet paper](https://arxiv.org/abs/1512.03385) to avoid overfitting. 

They evaluated the following architectures to experiment with tradeoffs of runtime vs accuracy:

1.    [MS_CTC](https://arxiv.org/pdf/1412.1710.pdf) – This architecture was proposed in Microsoft’s constrained time cost paper. It improves on top of AlexNet in terms of speed and accuracy maintaining a combination of convolutional and fully-connected layers.
2.    [Squeezenet](https://arxiv.org/pdf/1602.07360.pdf) – This architecture introduces the fire module which contain layers to squeeze and then expand the input data blob. This helps to save the number of parameters keeping the Imagenet accuracy as good as AlexNet, while the memory requirement is only 6MB.
3.    [VGG](https://arxiv.org/pdf/1409.1556.pdf) – This architecture has 13 conv layers and 3 FC layers.
4.    [GoogLeNet](https://arxiv.org/pdf/1409.4842.pdf) – GoogLeNet introduces inception modules and has 20 convolutional layer stages. It also uses hanging loss functions in intermediate layers to tackle the problem of diminishing gradients for deep networks.
5.    [ResNet-50](https://arxiv.org/pdf/1512.03385.pdf) – ResNets use shortcut connections to solve the problem of diminishing gradients. They used the 50-layer residual network released by the authors.
6.    ResNet-50-thin – The model was generated using their [pynetbuilder](https://github.com/jay-mahadeokar/pynetbuilder) tool and replicates the Residual Network paper’s 50-layer network (with half number of filters in each layer).The generation and training details of this model have been stated [here](https://github.com/jay-mahadeokar/pynetbuilder/tree/master/models/imagenet).
    
The deep models were first pre-trained on the ImageNet 1000 class dataset. For each network, they replaced the last layer (FC1000) with a 2-node fully-connected layer. Then the weights were fine-tuned on the NSFW dataset. They kept the learning rate multiplier for the last FC layer 5 times the multiplier of other layers, which are being fine-tuned. They also tuned the hyper parameters (step size, base learning rate) to optimize the performance.

This implementation for NSFW classification of the ResNet-50-thin model has been open-sourced by the authors [here](https://github.com/yahoo/open_nsfw) built using Caffe deep learning framework and CaffeOnSpark library for distributed learning. 

#### Authors' Conclusions

An illustration showing the performance comparisons of performance of models on Imagenet and their counterparts fine-tuned on NSFW dataset has been shown below.
<img src='img/yahoo_open_nsfw.png' width=550>
An illustration showing tradeoffs of different architectures: accuracy vs number of flops vs number of params in network has been shown below.
<img src='img/yahoo_open_nsfw2.png' width=500>

1. The performance of the models on NSFW classification tasks is related to the performance of the pre-trained model on ImageNet classification tasks, so a better pretrained model helps in fine-tuned classification tasks. 
2. ResNet-50 gives the best overall performance in terms of trade-off between model size, computational complexity of the architeture, accuracy as seen by Top-1% ImageNet center crop error.
 
#### Pros 

Their open-sourced implementation and the methodology as described above can be used as a great reference and starting point when exploring ways to build our own classifier. The comparisons of different models shed light on which models perform  better when tradeoffs of accuracy and computational bottleneck are being considered in NSFW classifiation tasks.

#### Cons

The authors only take into consideration one category of NSFW content, i.e., pornographic images. This fails to shed light on how the models they used would fare in case of various other catgories of NSFW media. We do not know if the findings are also applicable when a more comprehensive dataset is used to train their open-sourced model and would have to verify the same.


### <a id='2'></a>2. [Pornographic Video Detection Scheme Using Multimodal Features](http://docsdrive.com/pdfs/medwelljournals/jeasci/2018/1174-1182.pdf)

In this study, the authors propose a pornographic video detection scheme using multimodal features extracted with CNN architectures and audio/motion extractors and classification of extracted features using Support Vector Machine.

#### Implementation

The authors create a scheme for detecting pornography videos using multimodal features, those features being:
1. image features of each frame using deep learning architecture and image descriptor features of the frame sequence, extracted using the VGG-16 CNN, 
2. motion features extracted using [optical flow](https://www.sciencedirect.com/science/article/abs/pii/0004370281900242) and VGG-16, 
3. audio features extracted using a Mel-scaled spectro-gram. 
The final features for each model are obtained by an average pooling of each of the features by sample in the video.

The illustration below gives the overview of their architecture.
<img src='img/1met.png' width=1000>

Each of those kind of features are used in a single SVM classifier per type of feature, resulting in an image sequence based detector, a motion based detector, and an audio based detector. The final decision making is done by model stacking all detectors. 

The image below shows the how the detectors will be stacked with feature extractors and SVM classifier for the resulting output.
<img src='img/2met.png' width=900>

The authors used a modified dataset based on the [pornography-2k dataset](https://recodbr.wordpress.com/code-n-data/#porno) which comprises 2,000 web videos and 140 hours of video footage for training and testing.

#### Author's Conclusions

1. The results of their method are an average of 63.4% accuracy, with average 100% true positive rate for porn, and average of 23.5% of false positive rate.

<table><tr><td><img src='img/4met.png'></td><td><img src='img/3met.png'></td></tr></table>

#### Pros

By using the various features of a video at once the authors detect almost all pornographic events without being confused by a specific element of input video. This gives us new ideas on feature extractor methods to use multimodal deatures like motion, sound along with image to aid in NSFW video classification.

#### Cons

The overall accuracy is little bit low due to high false positive rate averaging 23.5% and hence needs a lot of architectural improvements. Also, the categories of NSFW content used is limited to pornography.


### <a id='3'></a>3. [NSFW Video Detection in e-learning Environments](https://sci-hub.se/https://doi.org/10.1145/3323503.3360625)

The main aim of the authors in this video is to build a classifier that would detect NSFW content from videos in educational and professional environments and websites.

The authors in this paper use multimodal features, taking inspiration from the paper [Pornographic Video Detection Scheme Using Multimodal Features](http://docsdrive.com/pdfs/medwelljournals/jeasci/2018/1174-1182.pdf) but try to reduce the high False Positive Rate obtained in that paper's results. The authors here use image sequence features and audio features. The use Inception-V3 CNN for extracting image sequence features and use an Audio VGG CNN for extracting audio features. They do a comparison among support vector machine, K-nearest neighbours and multi-layer perceptron algorithms to determine the best performing classifier on the extracted video features.

#### Implementation

Their CNN-based SFW/NSFW classifier is composed by two modules. The first module is what the researchers call the backbone, that acts as the feature extractor from which the model draws its discriminating power. The second module, the classifier, operates over the extracted features by the backbone to aggregate and classify it.

The architecture of their NSFW classifier is illustrated below. 
<img src='img/baseline.png' width=600>

They opt to a bimodal approach that uses two backbones to extract the audio and image features from videos. Once they extract the features from the video, they use a shallow model to perform the video classification. 

The dataset the authors use has not been published for public use but has been described by them as being structured into videos of appropriate (SFW) and inappropriate content (NSFW). It is divided into 55,400 SFW videos, 50,400 NSFW videos. 

> For appropriate content, they selected educational videos from public repositories. They include 49,920 videos from [Youtube8M](https://research.google.com/youtube8m/) from "Jobs & Education" and "News" categories, 5,000 videos from [video@RNP](https://www.rnp.br/en/node/720), 400 SFW videos divided into 'easy' and 'hard' from the [NPDI](https://sites.google.com/site/pornographydatabase/) dataset, [Cholec80](http://camma.u-strasbg.fr/datasets) dataset which contains 80 videos of cholecystectomy surgeries performed by 13 surgeons.
    
> For inappropriate content, they selected porn and violence from specialized websites. For the porn type, they extracted 47,758 videos from XVideos6. For the gore content, they use a web crawler to extract 2,242 gore videos from various websites dedicated to gore media such as [Gore](https://www.gore.com.b), [BestGore](https://www.bestgore.com/) and [GoreBrasil](https://www.gorebrasil.com).

The authors' use the following algorithms for NSFW video classification: 

1. Support Vector Machine (SVM) in which the data is mapped into a higher dimension input space where an optimal separating hyper-plane is constructed. These decisions surfaces are found by solving a linearly constrained quadratic programming problem.

2. K-Nearest Neighbors (KNN) uses distance measure between training samples so that the k-nearest neighbors always belong to the same class, while samples from different classes are separated by a large margin.

3. Multilayer-perceptron (MLP) contains layers of nodes: input layer, output layer and various hidden layers in between.The number of layers used is problem dependent, as is thenumber of nodes in each hidden layer. The weights are adjustedby local optimization using a set of feature vectors so that the network produces the optimal expected output

#### Author's Conclusions

1. The feature extraction using deep learning yield high accuracy even in linear or shallow models. This highlights that the feature extraction method is a solid base from which to build deeper classification models.

2. The best performing baseline model going the feature extraction route is the MLP, with 98% of accuracy, 98.02% of f1-score for SFW and 97.97% of f1-score for NSFW classes.

The illustrations below show the results on the precision, recall, and accuracy evaluation metrics:

<table><tr><td><img src='img/1.png'></td><td><img src='img/2.png'></td><td><img src='img/3.png'></td></tr></table>

The confusion matrix below shows the aggregated results:
<img src='img/conf.png' width=500>

(The F1-score represents an overall performance metric, and the precision and recall metrics can give insights on where the classification model is doing better.)

#### Pros

The authors' reasearch shed insight on using feature extractor methods to use multimodal deatures like motion, sound along with image to aid in NSFW video classification. This can be used as a point of reference to design our our feature extractor architectures for NSFW videos. They also compare 3 classification algorithms used on the extracted features which can be useful for future refernce.

#### Cons

The authors have not used a broad category of content for their experiments beyond porn and gore and therefore it remains to be seen how well their method would perform with more comprehensive datasets.


### <a id='4'></a>4.  [Pornography Classification Using Video Space-Time](https://sci-hub.se/10.1016/j.forsciint.2016.09.010)

#### Implementation

In this paper, the authors focus on video-pornography classification, using a space-temporal interest point detector and descriptor called Temporal Robust Features(TRoF). TRoF was custom-tailored for efficient(low processing time and memory footprint) and effective(high classification accuracy and low false negative rate) motion description, particularly suited to the task at hand. The authors aggregate local information extracted by TRoF into a mid-level representation using Fisher Vectors, the state-of-the-art model of Bags of Visual Words(BoVW). 

The performance is assessed using the [Pornography-2k dataset](https://recodbr.wordpress.com/code-n-data/#porno), which has been curated by the authors themselves. 

>    The dataset comprises 2,000 web videos and 140 hours of video footage including both professional and amateur content, and it depicts several genres of pornography, from cartoon to live action, with diverse behaviour and ethnicity. It contains 1,000 pornographic and 1,000 non-pornographic videos, each of which varies from six seconds to 33 minutes. It is an improvement upon the [NPDI dataset](https://sites.google.com/site/pornographydatabase/).

The TRoF detector is directly inspired by the still-image [Speeded-Up RobustFeatures(SURF) detector](https://www.sciencedirect.com/science/article/abs/pii/S1077314207001555), which is very fast. It relies on three major extensions of the original method to use the video space-time: 
1. the employment of five-variable Hessian matrices, 
2. three-dimensional box filters,
3. the concept of integral video.

#### Authors' Conclusions

1. The best approach, based on a dense application of TRoF, yields a classification error reduction of almost 79% when compared to the best commercial classifier. 
2. A sparse description relying on TRoF detector is also noteworthy, for yielding a classification error reduction of over 69%, with 19×less memory footprint than the dense solution, and yet can also be implemented to meet real-time requirements.
An illustration of the results obtained by the author is show below:
<img src='img/trof.png' width=500>

#### Pros

The authors tackle the problem of automatically detecting pornography by introduing a novel method to use videos without employing still-image techniques(labeling frames individually prior to a global decision) as frame-based methods often ignore significant cogent information brought by motion. They propose a novel idea called Temporal Robust Features (TRoF) in order to use videos by factoring in the motion in them as well.

#### Cons

The categories of NSFW content used is limited to pornography. Perfromance of TRoF on a more comprehensive dataset is yet to be tested.


### <a id='5'></a>5. [Pornographic Image Recognition via Weighted Multiple Instance Learning (MIL)](https://arxiv.org/pdf/1902.03771.pdf)

In this  paper, the authors model each image as a bag of regions, and follow a multiple instance learning(MIL) approach to train a generic region-based recognition model. 

#### Implementation

The authors take into account the regions’ degree  of  pornography. They present a simple quantitative measure of a region’s degree of pornography, which can be used to weigh the importance of different regions in a positive image and  formulate  the  recognition  task  as  a  weighted  MIL  problem under  the  CNN  framework,  with  a  bag probability  function  introduced  to  combine  the  importance  of different regions. 

By  randomly  moving  the  windows  from  each  annotated region with different displacements, they generate about 100 regions  for  a  positive  image,  among  which  some  are  typical positive  regions,  some  are  non-typical  positive  regions,  and some are non-pornographic regions. 

Each image is modeled as a set of bags (regions). The  deep  CNN extracts layer-wise representations from the first convolutional layer to the last fully connected layer. Their CNN architecture is  inspired  by  the  GoogLeNet  model.  The  output  of the  last  fully  connected  layer  is  a  1,000  dimensional  vector, followed by a softmax layer to transform it into a probability distribution  for  objects  of  1,000  categories. Since this task requires binary classification, they re-design  the  output  of  the  last  fully  connected  layer  to  be a  1  dimensional  vector,  and  transform  it  into  a  Bernoulli distribution  via  the  sigmoid  function.  

The authors use a combination of a dataset that they curate themselves and the [NPDI dataset](https://sites.google.com/site/pornographydatabase/) to train their model with. 
Their own dataset is described below:

>    138,000  pornographic  images  and  205,000  normal images collected from Internet. Overall,  they categorize  them  into  three  groups,  namely, regular  nudity, sexual behaviour, and unprofessional porn.
 The normal images of their dataset are also downloaded from the internet,  which  can  be  further  categorized  into  three  groups, namely, scantily-clad  people, normal  people and no  people. 

> From the pornographic images, they randomly select 33,000 images   to   annotate   their   key   pornographic   contents   with bounding boxes. They  use  the  annotated  33,000  pornographic images  and  100,000  randomly  selected  normal  images  as the  training  set,  randomly  select  5,000  pornographic  images and  5,000  normal  images  from  the  remaining  images  as the validation set, while the remaining 100,000 pornographic images  and  100,000  normal  images  are  used  as  the  test  set. 

They compare their method with five traditional deep learning baseline models. 
The illustrations below compares the accuracy of their model with other traditional methods on the dataset they curated and cross-referenced the results on the NPDI dataset:

<table><tr><td><img src='img/mil0-1.png'></td><td><img src='img/mil0.png'></td></tr></table>


#### Authors' Conclusions

1. Experiments on their large-scale dataset  demonstrate  the  effectiveness  of  the  proposed  method, achieving  a high  accuracy  with  97.52%  true positve rate  at  1%  false positive rate.
2. The  proposed  method  is  highly  efficient  and real-time capable when implemented with GPU.

#### Pros

They present a novel approach based on the region of interest method to detect porn images. They show that based on  very  few  annotations  of  the key pornographic contents in a training image, a  bag  of  properly  sized  regions can be generated,  among  which  the  potential positive  regions  usually  contain  useful  contexts  that  can  aid in NSFW media recognition. 

#### Cons

The dataset is limited to only pornography as the catgory for NSFW content and as such the findings and results by the authors can only be relied upon only for porn genre. This method applied to a comprehensive dataset including various NSFW media genres is yet to be tested.


### <a id='6'></a>6. [Pornographic Image Recognition by Strongly-Supervised Deep Multiple Instance Learning (SD-MIL)](https://sci-hub.se/10.1109/ICIP.2016.7533195)

The authors model each image as a bag of local image patches(instances), and assume that for each pornographic image at least one instance accounts for the pornographic content within it. This allows them to cast the model training as a Multiple Instance Learning (MIL) problem. Furthermore, they propose a strongly-supervised  setting for MIL by identifying the most likely pornographic instances in positive bags, which effectively prevents the algorithm from getting trapped in a bad local optima. They also formulate the strongly-supervised MIL under the deep CNN framework to learn deep representations, referring to it as Strongly-supervised Deep MIL(SD-MIL). 

#### Implementation

There are three key components in constructing the Strongly-supervised Deep MIL(SD-MIL) based pornographic image recognition system. They are:

1. Instance Generation - In this work they use the sliding window method, to generate multiple instances from an image. They resize each image into 434×434, and then extract 16 image patches of224×224 with a step of 70. 

2. Instance Selection - With the multiple instances obtained by sliding windows, they select the most likely positive instances in positive bags to prevent training from prematurely locking onto erroneous instances by developing an efficient semi-automatic strategy for instance selection.

3. DCNN-based Learning -  Given one instance of an image, a deep CNN extracts layer-wise representations of it.  Then, given a bag of instances, a multiple deep CNN extracts epresentations of it in which each column is the representation of an instance. Here they used the open-source package Caffe to extract deep features, redefine the objective and fine-tune the CNNs based on the GoogLeNet model.

The authors have self-curated a dataset.

> They have collected a large scale dataset consisting of 155,000 pornographic images and 222,000 normal images from  Internet. For these pornographic images, they randomly select 33,000 images to annotate exposed private parts with keypoints, which serves as strong supervision in the training phase. 
    
> To conduct their experiments, they use the annotated 33,000 pornographic images and 100,000 randomly selected normal images as the training set, randomly select 5,000 pornographic images and 5,000 normal images from the remaining images as the validation set, while the remaining 117,000 pornographic images and 117,000 normal images are used as the test set.

They use the [NPDI dataset](https://sites.google.com/site/pornographydatabase/) only for cross-database testing against their own dataset.

They compare their method with two traditional methods using shallow low-level features, i.e., the [retrieval-based method](https://dl.acm.org/doi/abs/10.1016/j.patrec.2007.08.002) and the [Bag-of-Feature based method](https://ieeexplore.ieee.org/document/7077625). Then they compare it with the following in-house baselines using deep learning to verify the effectiveness of the proposed algorithm  components:
1. Deep holistic (D-Holistic) image method by training CNNs with the holistic images rather than the multiple instance based representation
2. Deep part detector (D-Part Detector) method by training inde-pendent part detector for female breast, female sexual organand male sexual organ with 70×70 patches centered at key-points.  The trained part detectors are then used to scan theimage when testing
3. Deep MIL (D-MIL) method without using additional supervision on instances

#### Authors' Conclusions

1. The authors demonstrate that their SD-MIL based system produces remarkable accuracy with 97.01% TPR at 1% FPR, and achieves 55 FPS with GPU.
2. On the NPDI Pornography database, the SD-MIL model performs well, with accuracy of 97.5% measured by the Mean Average Precision.
3. The authors with their method are able to narrow down the range of positive instances in a positive  bag, which effectively prevents training from prematurely locking onto erroneous instances.

<img src='img/mil.png' width=500>

#### Pros

The authors propose a novel method to use each image as a bad of instances assuming that at least one instance accounts for the pornographic content in a pornographic image, which they cast their task as a Multiple Instance Learning problem. 

#### Cons

The dataset is limited to only pornography as the catgory for NSFW content and as such the findings and results by the authors can only be relied upon if considering one genre on NSFW content. This method applied to a comprehensive dataset including various NSFW media genres is yet to be tested.


### <a id='7'></a>7. [Adult Content Detection in Videos with Convolutional and Recurrent Neural Networks](https://sci-hub.se/10.1016/j.neucom.2017.07.012)

In this paper, the authors propose a novel approach for adult content detection in videos, namely ACORDE(Adult Content Recognition with Deep Neural Networks).

#### Implementation

Its architecture makes use of a convolutional neural network(CNN) as a feature extractor and of a long short-term memory (LSTM) to perform the final video classification. ACORDE extracts feature vectors from the video key frames, building a sorted set of semantic descriptors. This set is used to feed the LSTM that is responsible for analyzing the video in an end-to-end fashion. The proposed approach does not require fine-tuning nor retraining the CNN. 

<img src='img/acorde.png' width=400>

They use the [NPDI dataset](https://sites.google.com/site/pornographydatabase/) to conduct their experiments.

#### Authors' Conclusions

1. The authors indicate that the high-level features from the CNN and the sequence learning of the LSTM are robust in identifying adult content in scenarios with large skin-exposure.
2. They are able to achieve a TP rate of 96% with 4% FP rate and overall accuracy of 95.5%.

The illustrations below shows a comparison of ACORDE with various other methods:

<table><tr><td><img src='img/acc1.png'></td><td><img src='img/acc2.png'></td></tr></table>

#### Pros

The authors present an end-to-end solution using a CNN-LSTM architecture, usage of which is novel in this field. This opens of possibilities to build encoder-decoder type models for NSFW content classification.

#### Cons

The dataset is limited to the NPDI pornography database and hence this method ignores other NSFW genres for their findings.


### <a id='8'></a>8. [Video pornography detection through deep learning techniques and motion information](https://sci-hub.se/10.1016/j.neucom.2016.12.017)

The authors  propose  novel  ways  for  combining  static (picture) and dynamic (motion) information using [optical flow](https://www.sciencedirect.com/science/article/abs/pii/0004370281900242) and [MPEG motion  vectors](https://cyber.sci-hub.se/MTAuMTAxNi9qLm5ldWNvbS4yMDE2LjEyLjAxNw==/perez2016.pdf#cite.Richardson2004).   They  show  that  both  methods  provide  equivalent  accuracies,  but that  MPEG  motion  vectors  allow  a  more  efficient  implementation. 

#### Implementation

Although in this work the authors focus on the pornography modality, they claim that their method is versatile and its extension to other types of sensitive content is straight-forward.  
The contributions of this paper are three-fold:
1.  A novel method for classifying pornographic videos,  using convolutional neural networks along with static and motion information
2.  A  new  technique  for  exploring  the  motion  information  contained  in  the MPEG motion vectors
3.  A study of different forms of combining the static and motion information extracted from questioned videos

They have demonstrated the pipelines for static and motion-based information flow. Illustrations of the pipelines have been shown here:
<table><tr><td><img src='img/static.png'></td><td><img src='img/motion.png'></td></tr></table>

They propose 3 fusion methodologies to combine the extracted motion-based and static data features:

1. Early Fusion - In the early fusion method, the static and the motion information are combined at the very beginning of the pipeline, being processed together by a special CNN.
2. Mid-level Fusion - In the mid-level fusion,  they concatenate the features extracted from each type of information (static or motion-based),  and  from  each  independent  CNN,  into  a  single  feature  vector  before feeding a classifier.
3. Late Fusion - In this fusion scheme, each information is processed by a separate decision-making  approach  (e.g.,  SVM  classifier),  generating  independent  classification scores that can then be combined later on on a single score for the final classification.

Illustrations of the architectures has been shown below:
<table><tr><td><img src='img/early.png'></td><td><img src='img/middle.png'></td><td><img src='img/late.png'></td></tr></table>

They used the GoogLeNet pre-trained model trained on ImageNet using the Caffe framework. They used the [NPDI](https://sites.google.com/site/pornographydatabase/) and the [pornography-2k dataset](https://recodbr.wordpress.com/code-n-data/#porno) in their experiments.

#### Authors' Conclusions

1. The  best proposed method yields a classification accuracy of 97.9% on the NPDI dataset. 
2. In  the  static  stream,  the  model  relying  on  the  GoogLeNet architecture  trained  with  ImageNet  data  yields  an  impressive  performance  of 94.6% accuracy and 95.1% F2 score. These results are further improved upon by fine-tuning the network weights with the pornographic data, reaching 96.0% accuracy and 96.1% F2.
3. When considering the motion information, optical flow (OF) by itself yielded a performance close to the static model.  Meanwhile,  the MPEG motion vectors  (MV)  led  to  a  lower  performance  of  91.0%  accuracy  and  92.0%  F2. The difference in performance between the two sources of motion information maybe explained by the fact that the MV represents the motion of a macroblock of pixels, which is a much lesser fine-grained description form than OF, which takes into account the motion information for each pixel.
4. Despite  the  lower  performance  of  the  motion  information  alone,  when   combined   with   the   static   information   from   the   fine-tuned   network (pornography-specialized network) by mid-level fusion and late fusion, the accuracy and F2 scores improve.

The authors conclude that mid-level and late fusion methodologies yield the best result. They have compared their findings with existing state of the art methods on the NPDI and Pornography-2k datasets as shown below:
<table><tr><td><img src='img/res-fus.png'></td><td><img src='img/resfus0.png'></td></tr></table>

#### Pros

The evaluation of their techniques shows that the association of deep learning with the combined use of static and motion information considerably improves  pornography  detection. Their solution  also  proves  to  be  superior  to  general-purpose  action  recognition  features when applied to pornography detection.

#### Cons

The dataset is limited to pornography category even though the authors claim the results can be extrapolated when using a different, more comprehensive NSFW dataset. The authors 


### <a id='9'></a>9. [A Pornographic Images Recognition Model based on Deep One-Class Classification With Visual Attention Mechanism](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9141172)

The overall architecture of the proposed approach in this paper consists of a Deep One-Class with Attention for Pornography (DOCAPorn), a Preprocessing for Compressing and Recon-structing (PreCR), and the Scale Constraint Pooling (SCP).

#### Implementation

The proposals in this paper are:
1. Deep One-Class with Attention for Pornography (DOCAPorn) - This method recognizes the pornographic images through the one-class classification model based on neural networks and introduces the visual attention mechanism to enhance the performance of recognition. 

2. Scale Constraint Pooling (SCP) - Since the existing approaches based on deep  CNNs require a fixed-size (e.g., 224×224) input image, the geometric distortion caused by image scaling is not considered in the existing approaches, which reduces the pornographic image recognition accuracy. In order to solve this issue, this paper proposes SCP that converts the inputs of different dimensions into outputs of the same dimension. It is a variant of max pooling that constrains the features of the different scales to the same scale feature map.

3. Preprocessing for Compressing and Reconstructing (PreCR) - All the existing approaches ignore the adversarial attacks in the field of pornographic image recognition. In order to deal with this problem, the paper proposes PreCR, a pre-processing approach that reduces the subtle perturbation through compressing the images and then reconstructs the purified image for recognition, as these perturbations are difficult to detect by humans.

The VGG19 neural network model was referenced to design the proposed DOCAPorn architecture.

The proposed architecture is shown below:
<img src='img/doca.png' width=450>

The authors curated a dataset of their own as follows: 

> 500,000 pornographic images were obtained for the dataset. The collected images included white, black, and Asian subjects.  500,000 normal images from the Internet and the ImageNet dataset were also collected. The training, validation and test sets of non-target classes were 80%, 10%, and 10% of the dataset, respectively. 

> A total of 10,000 key frames were extracted each from pornographic and normal videos. These keyframes were utilized to test the accuracy of the proposed framework.

The [NPDI](https://sites.google.com/site/pornographydatabase/) was used as a cross-database reference to test the experiments.

#### Authors' Conclusions

1. The proposed approach is verified by conducting comparative experiments using custom datasets. The experimental results showed that they achieved an accuracy of 98.419% on their own dataset. 
2. The proposed approach yielded a classification accuracy of 95.632% on the NPDI dataset. 

Illustrations of the performance of DOCAPorn against other methods as well as the usefulness of SCP in comparison to non-SCP using methods has been shown below:
<table><tr><td><img src='img/doca-1.png'></td><td><img src='img/doca-2.png'></td></tr></table>

The authors have also shown a performance of the model with and without visual attention mechanism to complement their proposal as shown below:
<img src='img/doca-3.png' width=450>

#### Pros

Introducing the visual attention mechanism into deep one-class classification makes the neural network to focus on the target object while ignoring a slice of irrelevant information thus improving detection and classification tasks. SCP avoids unnecessary geometric distortion and makes the trained data conform to real-world conditions. With this they were able to consistently get high accuracies on their model.

#### Cons

The authors use porn media as the sole NSFW category. Moreover, the authors do not shed light on the computational complexity of using the attention model with CNNs.


### <a id='10'></a>10. [Applying deep learning to classify pornographic images and videos](https://arxiv.org/pdf/1511.08899.pdf)

#### Implementation

The authors propose applying a combination of CNN to classify porn from regular images and video frames. They propose slight modifications to change the existing AlexNet and GoogLeNet to suit their problem. They then propose a simple fusion of both networks.

They have considered the majority voting of all frames, that belong to the same video sequence. They recorded the average correct classification rate of all 5 folds and the standard deviation. The reported correct classification rate is the average of both ‘correctly classifying benign as benign’ and ‘correctly classifying porn as porn’.

The illustration below shows their architecture with ANet being AlexNet and GNet being GoogLeNet models.
<img src='img/ag.png' width=500>

They use the NPDI dataset for their experiments.

#### Authors' Conclusions

1. AGNet outperforms AlexNet in regards to accuracy and AGNet (with simple equal weighted average) is slightly more accurate than GNet. 
3. AGNet produced smaller variance than either ANet or GNet.
4. The automatic method as proposed by them outperforms the accuracy of hand-crafted feature descriptors solutions.

<img src='img/agnet.png' width=500>

#### Pros

This method proved to be a benchmark for reference for future automatic porn classification work. Their proposed method opens up new possibilities of using existing CNN architectures in combination foor porn detection and recognition.

#### Cons

The authors only use a standard pornography genre from the NPDI dataset to draw their conclusions for the proposed network. The results have also not taken into consideration model sizes or computational complexity as such, only working towards the simple aim of improving upon AlexNet and GoogLeNet.


## <a id='blog'></a>Notable work published as blog-posts for NSFW classfication

Some authors have published their findings as blogposts. I decided to summarize a few of those works here.


### 1. [Comparison of the best NSFW Image Moderation APIs (2018)](https://towardsdatascience.com/comparison-of-the-best-nsfw-image-moderation-apis-2018-84be8da65303)

The author comapres various NSFW classification APIs for a survey. The comparison on the APIs' performance on the following categories of NSFW media has been done:

> Explicit Nudity, suggestive Nudity, porn/sexual act, simulated/animated porn, gore/violence

The author evaluated the following API moderators:
1. Amazon Rekognition
2. Google
3. Microsoft
4. Yahoo
5. Algorithmia
6. Clarifai
7. DeepAI
8. Imagga
9. Nanonets
10. Sightengine
11. X-Moderator

The following illustrations show comparison of the performance of all the APIs against for explicit nudity, suggestive nudity, porn act, simulated porn and gore, and also cateforisation SFW content:
<table><tr><td><img src='img/api1.png'></td><td><img src='img/api2.png'></td><td><img src='img/api3.png'></td></tr></table>
<table><tr><td><img src='img/api4.png'></td><td><img src='img/api5.png'></td><td><img src='img/api6.png'></td></tr></table>

A [Google spreadsheet](https://docs.google.com/spreadsheets/d/1fEOJfTLmQdtRvllw1e8LXJ6vXAjQQj68iF4gWCJy1JM/edit#gid=0) containing the raw accuracy predictions has also been published by the author.

The author concluded that a general social media application that is more geared towards content distribution and wants a balanced classifier would prefer to use Nanonets API as proven by the highest F1 score for their classifier.


### 1. [NudeNet: An ensemble of Neural Nets for Nudity Detection and Censoring](https://medium.com/@praneethbedapudi/nudenet-an-ensemble-of-neural-nets-for-nudity-detection-and-censoring-d9f3da721e3)

The author collects data to implement nudity detection using Image Classification. The author uses the following data collected from Reddit, Facebook, the [website scrolller](https://scrolller.com/nsfw), and PornHub:
> 1,78,601 from PornHub, 1,21,644 from Reddit and 1,30,266 from [GantMan’s dataset](https://github.com/GantMan/nsfw_model)

> 68,948 from Facebook, 55,137 from Reddit and 98,359 from GantMan’s dataset

NudeNet’s Detector performs better than Yahoo’s Open NSFW, GantMan’s nsfw_model and NudeNet’s classifier in identifying porn. A source for the code can be found [here](https://github.com/notAI-tech/NudeNet).

The author uses RetinaNet by FAIR for object detection. RetinaNet uses a variation of cross entropy loss called Focal Loss, which is designed to increase the performance of one-stage object detection. They use ResNet-50 for the backend. 
<img src='img/nude.png' width=500>

NudeNet has been published as a Python library and can be installed using `pip`. 


Some other useful open-source implementations that I found on GitHub are these repositories:
1. [NSFWJS](https://github.com/infinitered/nsfwjs)
2. [NSFWDetector](https://github.com/lovoo/NSFWDetector)
3. [nsfw_model](https://github.com/GantMan/nsfw_model) 
4. [open_nsfw--](https://github.com/rahiel/open_nsfw--)
5. [nsfw-classification-tensorflow](https://github.com/MaybeShewill-CV/nsfw-classification-tensorflow)



### Conclusion

We see that novel methods are continuously being proposed to make NSFW content classifiers even though most researchers do not take into account a broad and comprehensive dataset in their experiments. 

However, we see that end-to-end building from scratch in this case is futile considering the time constraints, reproducibility and computational constraints of this project, which is why I find it wise to consider pre-trained models to build our classifiers with. 

Inspiration can be taken from the more novel yet high performing architectures proposed in the research, like using attention models with CNN, region of interest based approach to make a multiple instance based detector and considering spatio-temporal, audio and motion information while extracting input data features from video. These ideas can help us in contriving a model built on top of pre-trained CNN models to make a NSFW content classifier.