Skip to content

Commit

Permalink
Added details for Data Augmentation with example (#83)
Browse files Browse the repository at this point in the history
* * Added cnn example and details

* Added details for L1 regularization

* Added L1 regularization details and an example

* Add L1 regularization example

* Added L1 regularization example use case

* Added L2 regularization details

*minor fix in L1 and L2 regularization details.

* Added L2 regularization (Ridge regression) details and examples

* Added Early stopping regularization details with example code.

* * Started adding details for Data Augmentation

* * Completed details with example for Data Augmentation
  • Loading branch information
backtrack-5 authored and bfortuner committed Dec 4, 2019
1 parent 39f1ff4 commit 115de5c
Showing 1 changed file with 50 additions and 1 deletion.
51 changes: 50 additions & 1 deletion docs/regularization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,56 @@ It is a Techniques for combating overfitting and improving training.
Data Augmentation
=================

Be the first to `contribute! <https://github.com/bfortuner/ml-cheatsheet>`__
Having more data (dataset / samples) is a best way to get better consistent estimators (ML model). In the real world getting a large volume of useful data for training a model is cumbersome and labelling is an extremely tedious task.

Either labelling requires more manual annotation, example - For creating a better image classifier we can use Mturk and involve more man power to generate dataset or doing survey in social media and asking people to participate and generate dataset.
Above process can yield good dataset however those are difficult to carry and expensive. Having small dataset will lead to the well know Over fitting problem.

Data Augmentation is one of the interesting regularization technique to resolve the above problem. The concept is very simple, this technique generates new training data from given original dataset. Dataset Augmentation provides a cheap and easy way to
increase the amount of your training data.

This technique can be used for both NLP and CV.

In CV we can use the techniques like Jitter, PCA and Flipping. Similarly in NLP we can use the techniques like Synonym Replacement,Random Insertion, Random Deletion and Word Embeddings.


It is worth knowing that Keras' provided ImageDataGenerator for generating Data Augmentation.

Sample code for random deletion

.. code-block:: python
def random_deletion(words, p):
"""
Randomly delete words from the sentence with probability p
"""
#obviously, if there's only one word, don't delete it
if len(words) == 1:
return words
#randomly delete words with probability p
new_words = []
for word in words:
r = random.uniform(0, 1)
if r > p:
new_words.append(word)
#if you end up deleting all words, just return a random word
if len(new_words) == 0:
rand_int = random.randint(0, len(words)-1)
return [words[rand_int]]
return new_words
Furthermore, when comparing two machine learning algorithms train both with either augmented or non-augmented dataset. Otherwise, no subjective decision can be made on which algorithm performed better

.. rubric:: Further reading

- `NLP Data Augmentation <https://arxiv.org/abs/1901.11196>`_
- `CV Data Augmentation <https://arxiv.org/abs/1904.12848>`_
- `Regularization <http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf>`_

Dropout
=======
Expand Down

0 comments on commit 115de5c

Please sign in to comment.