In this part of the tutorial, I will discuss some common mistakes I saw in previous year's assignments. For this purpose, I will talk about a few topics like;
  *  Big-O notation,
  * Tips about PyTorch
  * More complex topics like GPU optimization.
  
Also, I will add some links for you to follow if you want to check these topics more in-depth

# Big O notation
=============================

Wikipedia explains Big-O notations as: " a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity." This means that it shows how long the computation will take in the worst-case scenario.

In [None]:
import time
def test (x):
  start =time.time()
  for i in range(x):
    for j in range (x):
        i*j
  stop =time.time()
  print("time"+str(x)+"is:"+str(stop-start))

When we look at the simple example above, we can see that our `test` function calculates i*j x^2 times. When we increase x linearly, computation time will increase proportionally to x^2. We can say that this function has a complexity of O(n^2). In the example below, we gave x as 1000 2000 and 3000. When we increased x 2 times, computation increased nearly four times, and the original calculation increased nine times when we increased three times.

In [None]:
test(10000)
test(20000)
test(30000)

time10000is:7.011929750442505
time20000is:27.872304677963257
time30000is:62.832411766052246


Then how is this going to help us in the NLP? Preprocessing steps generally require custom functions. One of the problems I encountered in previous semesters is that some students created preprocessing loops (tokenization etc.) inside each other. Although coding like this may not create problems while the dataset is small, It will increase your training times exponentially. Let's look at the IMDB dataset, for example. 
* IMDB dataset is a relatively small dataset for sentiment analysis with 100.000 lines train and test set combined. Okay, `100.000` no problem, you probably dealt with image sets this big previously. 
* Each line contains an average of `250` tokens up to 1800. Let's take the standard and say we have `2.5m` tokens in total. It is big, but we can handle it still. 
* Each word has an average of `8` characters, which makes `20m` characters.
* The tricky part is that we want to remove punctuations from these characters. with `14` different  punctuation characters, it makes `280m` calculations to check whether they are punctuation or not

Or we could remove punctuations after tokenization. It is reduced to `35m` calculations.

Although it is a simple step to remove punctuations, adding one wrong step will increase the total number of calculations from 35m to 280m(i.e., an eight-fold time increase). 

You can check the following tutorial or other sources online for more information.
* https://www.geeksforgeeks.org/analysis-algorithms-big-o-analysis/

# GPU and PyTorch
======================================

If you recall from previous tutorials, PyTorch and other Deep learning libraries utilize Nvidia's CUDA API/language. CUDA helps these platforms use GPU at full potential. 

Although you can optimize your codes considering CPU optimization methods, you need to learn more about GPU architecture to optimize your codes for GPU. Although it is out of scope in this course, you can check classes given at the MMI department if you want to learn more about optimization for GPU.

These are some of the things you need to consider while creating your codes:

* Copying data from/to GPU takes lots of time. Try to avoid it as much as possible. (things like print (tensor) if tensor in Cuda)

* Try to minimize operations. For example, use Torch.no grad or merge pointwise operations. In addition, bigger models do not always give better results, but they will increase your computation time.

* Try to utilize GPU efficiently and optimize performance. The easiest way to use GPU effectively is to maximize your batch size. If your batch size is small, your model will utilize only a fraction of the GPU's resources. On the other hand, if you select your batch size too large, You could see a memory problem (You exceeded memory of the GPU due to model size* batch_size exceeded memory)

In [1]:
import torch

In [None]:
torch.zeros(size=(1000,1000,1000);

In [None]:
torch.zeros(size=(1000,1000,1000,1000));# your colab will crash

You can also reduce your model sizes utilizing different methods like pruning or quantization. For more information, you can check Machine Learning Systems Design and Deployment course given at the MMI department.

Recomended Tutorials
* [PyTorch optimization guide](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)
* [Multi GPU paralel computing guide](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
* [tutorial on medium](https://medium.com/sicara/deep-learning-memory-usage-and-pytorch-optimization-tricks-e9cab0ead93)


# Weight Initialization

========================================

The last topic I want to talk about is weight initialization.. PyTorch initializes layer weights with uniform distribution by default, utilizing standard deviation as limits. But if you're going to change the default initialization, You can use [torch.nn.init](https://pytorch.org/docs/stable/nn.init.html) module to change the default initialization of your weights.

One of the initialization methods is Xavier initialization. This method initializes the layer so that variance between layers is constant. 

In [4]:
x = torch.empty(3,6)
torch.nn.init.xavier_normal_(x)

tensor([[-0.3283, -0.1985,  0.6296, -0.3714,  0.0924, -0.7093],
        [-0.8120, -1.0073,  0.9944, -0.5294,  0.7641,  0.0185],
        [-0.5995,  0.0640, -0.6631,  0.1494, -0.4949,  0.2794]])

Kaiming initialization, on the other hand, initializes the layer while considering the non-linearity of RELU layers.

In [5]:
torch.nn.init.kaiming_uniform_(x)

tensor([[ 0.7769,  0.1460, -0.7329,  0.9230,  0.2007, -0.9080],
        [ 0.6273,  0.8227,  0.7985, -0.5005, -0.3230, -0.3344],
        [-0.7048, -0.7225,  0.1837, -0.5248,  0.7227, -0.8803]])

Okay, so we know how to change weights. The main question is why we need to change weights. Optimal initialization can prevent vanishing or exploding gradient problems. In addition, with different initialization methods, model can reach better equilibrium points.
You can olso check [Weight Initialization Techniques in Neural Networks](https://towardsdatascience.com/weight-initialization-techniques-in-neural-networks-26c649eb3b78) for further details