# Note 01 for NLP Study
**Author: Yuxi Zhou** <br>
Start from: 2021-10-04 21:01:50

<h2>Note about the time stamp: </h2>

*Settings - Live Template* <br>
[Tutorial of Quick Insert Time](https://www.hhtjim.com/the-idea-of-quick-insert-the-current-time.html)

<h2>Machine Learning: Overfitting and Underfitting</h2>

[Overfitting and Underfitting With Machine Learning Algorithms](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/)

<h3>Overfitting in Machine Learning </h3>

> Overfitting refers to a model that **training data too well**.
>
> Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
>
> Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.
>
> For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.


<h3>Underfitting in Machine Learning</h3>

> Underfitting refers to a model that **can neither model the training data nor generalize to new data**.
>
> An underfitting machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
>
> Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.
>
> Underfitting occurs when a **model is too simple** — informed by too few features or regularized too much — which makes it inflexible in learning from the dataset. Simple learners tend to have less variance in their predictions but more bias towards wrong outcomes.

<h3>How do you know if you are Overfitting or Underfitting?</h3>

> - **Overfitting** is when the model's error on the training set (i.e. during training) is very low but then, the model's error on the test set (i.e. unseen samples) is large <br> (**Low train error, High test error**)
>
> - **Underfitting** is when the model's error on both the training and test sets (i.e. during training and testing) is very high <br> (**High train error, High test error**)

<h3>How do I fix Underfitting problems</h3>

> Below are a few techniques that can be used to reduce Underfitting:
>
> 1. Decrease regularization. Regularization is typically used to reduce the variance with a model by applying a penalty to the input parameters with the larger coefficients.
> 2. Increase the duration of training.
> 3. Feature selection.

<h3>How do I fix Overfitting problems</h3>

> Below are a few techniques that can be used to reduce Overfitting:
>
> 1. **Reduce the network’s capacity** by removing layers or reducing the number of elements in the hidden layers
> 2. Apply **regularization**, which comes down to adding a cost to the loss function for large weights
> 3. Use **Dropout layers**, which will randomly remove certain features by setting them to zero

<h3>Regularization</h3>

Regularization is a technique used for tuning the function by adding an additional penalty term in the error function. The additional term controls the excessively fluctuating function such that the coefficients don't take extreme values.

L1 regularization gives output in binary weights from 0 to 1 for the model's features and is adopted for decreasing the number of features in a huge dimensional dataset.

L2 regularization disperse the error terms in all the weights that leads to more accurate customized final models.

<h2>Regularization in Logistic Regression</h2>

- [Does Regularization in Logistic Regression Always Results in Better Fit and Better Generalization](https://sebastianraschka.com/faq/docs/regularized-logistic-regression-performance.html)
> Now, if we regularize the cost function (e.g., via L2 regularization), we add an additional term to our cost function (J) that increases as the value of your parameter weights (w) increase; keep in mind that the regularization we add a new hyperparameter, lambda, to control the regularization strength.



- [Overfitting vs. Underfitting](https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765)
- [Underfitting and Overfitting in machine learning and how to deal with it](https://towardsdatascience.com/underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6fe4a8a49dbf#:~:text=Underfitting%20occurs%20when%20a%20model,more%20bias%20towards%20wrong%20outcomes.)
- [Is your model overfitting? Or maybe underfitting?](https://towardsdatascience.com/is-your-model-overfitting-or-maybe-underfitting-an-example-using-a-neural-network-in-python-4faf155398d2)
- [Techniques that can Reduce Underfitting](https://www.ibm.com/cloud/learn/underfitting)

<h2>Bias and Variance</h2>

- [Bias/Variance and Model Selection](http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote13.html)

[StackOverflow: Can't instantiate abstract class with abstract methods](https://stackoverflow.com/questions/31457855/cant-instantiate-abstract-class-with-abstract-methods)


<h2>How to deal with Structure Data</h2>

<h3>Categorical Data</h3>

>There are many ways to encode categorical data
>
> - **Integer Encoding**: Each unique label is mapped to an integer (but raise the problem of "1" are readed as more important than "0")
> - **OneHot Encoding**: Each label is mapped to a binary vector
> - **Learned Embedding**: A distributed representation of the categories is learned.

<h2>Structured Data vs Unstructured Data</h2>

<h3>Structured Data</h3> | <h3>Unstructured Data</h3>
--- | ---
Structured data is quantitative and is often displayed as **numbers**, **dates**, **values**, and **strings** | Unstructured data is qualitative data and includes **text**, **video**, **audio**, **images**, and more
Structured data is stored in **rows and columns** | Unstructured data is stored as **text**, **audio** and **video files**, or **NoSQL databases**.
Stored in **data warehouses** | Stored in **applications**, **NoSQL (non-relational) databases**, **data lakes**, and **data warehouses**.
Easy to analyze with tools like Excel | Hard to analyze without AI tools

 - [Structure vs. Unstructured Data](https://monkeylearn.com/blog/structured-data-vs-unstructured-data/)
 

<h2>Python Set Doc String</h2>

- [StackOverflow: Set Docstring](https://stackoverflow.com/questions/4056983/how-do-i-programmatically-set-the-docstring)
- [Docstrings in Python](https://www.datacamp.com/community/tutorials/docstrings-python)


<h2>Python pip list freeze</h2>

- [How to install Python packages with pip and requirements.txt](https://note.nkmk.me/en/python-pip-list-freeze/)
> $ pip freeze > requirements.txt
>
> $ pip install -r requirements.txt

<h2>Data Structure: well suited to efficiently implement a Priority Queue</h2>

- [Priority Queue](https://algs4.cs.princeton.edu/24pq/#:~:text=The%20binary%20heap%20is%20a,at%20two%20other%20specific%20positions.)
- [Google Search Results](https://www.google.com/search?q=which+data+structure+is+well+suited+to+efficiently+implement+a+priority+queue&oq=which+data+structure+is+well+suited+to+efficiently+implement+a+priority+queue&aqs=chrome..69i57j0i512.14033j0j7&sourceid=chrome&ie=UTF-8)

**The binary heap** is a data structure that can efficiently support the basic priority-queue operations. In a binary heap, the items are stored in an array such that each key is guaranteed to be larger than (or equal to) the keys at two other specific positions.

<h2>Solving Maze as Represented in ASCII</h2>

- [Google Search Results](https://www.google.com/search?q=suppose+you+need+to+wirte+a+program+to+solve+a+maze+as+represented+in+ASCII&oq=suppose+you+need+to+wirte+a+program+to+solve+a+maze+as+represented+in+ASCII&aqs=chrome..69i57j0i13.26368j0j7&sourceid=chrome&ie=UTF-8)
- [Graphs - Solving a Maze](https://inginious.org/course/competitive-programming/graphs-maze)
- [Google Search Results: Priority Queue Solving Maze](https://www.google.com/search?q=priority+queue+soving+maze&oq=priority+queue+soving+maze&aqs=chrome..69i57j33i10i160l3.11043j0j7&sourceid=chrome&ie=UTF-8)


<h2>Greedy Algorithm</h2>

- [Google Search Results](https://www.google.com/search?q=example+of+greedy+algorithm&oq=example+of+greedy+algorithm&aqs=chrome..69i57j0i512j0i20i263i512j0i22i30l5j0i390.14051j0j7&sourceid=chrome&ie=UTF-8)

>Is Dijkstra a Greedy Algorithm? <br>
Yes. It is a greedy algorithm that solves the single-source the shortest path problem for a directed graph.

**Top 7 Greedy Algorithm Problems**
- Activity Selection Problem.
- Graph Coloring Problem.
- Job Sequencing Problem with Deadlines.
- Find minimum platforms needed to avoid delay in the train arrival.
- Huffman Coding Compression Algorithm.
- Single-Source Shortest Paths — Dijkstra's Algorithm.


<h2>Priority Queue</h2>

- [Priority Queue using Linked List](https://www.geeksforgeeks.org/priority-queue-using-linked-list/)

<h2>Array Sorting</h2>

- [Sort an almost sorted array where only two elements are swapped](https://www.geeksforgeeks.org/sort-an-almost-sorted-array-where-only-two-elements-are-swapped/)
- [What is the best sorting algorithm for an almost sorted array?](https://www.educative.io/edpresso/what-is-the-best-sorting-algorithm-for-an-almost-sorted-array)
- [Sorting Algorithms: Slowest to Fastest with Time Complexity](https://medium.com/javarevisited/sorting-algorithms-slowest-to-fastest-a9f0e30937b9)

> What does it mean for a sorting algorithm to be stable? <br>
> A sorting algorithm is stable if it **preserves the order of duplicate keys** <br>
> Reference: [Stable Sorting Algorithms](https://cs.smu.ca/~porter/csc/common_341_342/notes/sorts_stable.html)

<h2>Hash Table: Bucket</h2>

- [StackOverflow: Hashtable stores multiple items with same hash in one bucket](https://stackoverflow.com/questions/26287852/java-hashtable-stores-multiple-items-with-same-hash-in-one-bucket)

**Problem Description**
>While reading Oracle documentation about Hashtable I found that "in the case of a "**hash collision**", a single bucket stores multiple entries, which must be searched sequentially", so I try to find method which will return me items sequentially, if I have two items with the same hash, but can't find in the documentation. In order to reproduce this situation I try to write a simple code, which you can find below.

<h2>Asymptotically Tight Bound</h2>

[Big-θ (Big-Theta) Notation](https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-big-theta-notation)
>When we use **big-Θ notation**, we're saying that we have an asymptotically tight bound on the running time. "Asymptotically" because it matters for only large values of n. "Tight bound" because we've nailed the running time to within a constant factor above and below.

<h2>Find Unique Array Element</h2>

- [Problem to find unique array element](https://codepumpkin.com/find-unique-array-element/)

    l = [0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 3, 2, 3]


    def counter(list):
        # parse the list and find occurance
        count = dict()
        for item in list:
            if item in count:
                count[item] += 1
            else:
                count[item] = 1
        return count


    print(counter(l))

    # an alternative way just for fun
    result = {
        0: l.count(0),
        1: l.count(1),
        2: l.count(2),
        3: l.count(3)
    }
    print(result)

<h2>Logarithm Time Multiplication Table</h2>


- [Google Search Results](https://www.google.com/search?q=python+logarithm+time+multiplication+table&sxsrf=AOaemvKZ6f1EKTxbZB3YzLupljEqtEx75A%3A1633018724391&ei=ZONVYZ6zF4zAgweH0oeIBw&oq=python+logarithm+time+multiplication+table&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsANKBAhBGABQ-IcBWM-0AWC8twFoAXACeACAAe8CiAGDGZIBAzMtOZgBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwie9ennjKfzAhUM4OAKHQfpAXEQ4dUDCA4&uact=5)

<h2>Data Structures: Questions</h2>

>[Which data structure is used in redo-undo feature?](https://www.geeksforgeeks.org/data-structures-misc-question-1/) <br>
>Answer: **Stack** <br>
>Explanation: Stack data structure is most suitable to implement redo-undo feature. This is because the stack is implemented with LIFO(last in first out) order which is equivalent to redo-undo feature i.e. the last re-do is undo first.

<h2>Machine Learning: The Goal of Cross Validation</h2>

> The purpose of cross–validation is to **test the ability of a machine learning model to predict new data**. It is also used to flag problems like overfitting or selection bias and gives insights on how the model will generalize to an independent dataset. <br>
> Reference: [What is Cross Validation in Machine Learning](https://www.mygreatlearning.com/blog/cross-validation/#:~:text=The%20purpose%20of%20cross%E2%80%93validation,generalize%20to%20an%20independent%20dataset.)

<h2>Logarithm Transformation in Linear Regression Models</h2>

> Problem: <br>
> Taking the logarithm of a feature before performing regression is most beneficial when that feature...
>
> [Logarithmic Transformation in Linear Regression Models: Why & When](https://dev.to/rokaandy/logarithmic-transformation-in-linear-regression-models-why-when-3a7c)
>
> Using the logarithm of one or more variables improves the fit of the model by **transforming the distribution of the features to a more normally-shaped bell curve**.


<h2>Linear Algebra: Projection onto a Subspace</h2>

> [Real Euclidean Vector Spaces](https://www.cliffsnotes.com/study-guides/algebra/linear-algebra/real-euclidean-vector-spaces/projection-onto-a-subspace)
>
> Problem: Orthogonal

<h2>Recommender System</h2>

> [Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering)
>
> The system generates recommendations using only information about rating profiles for different users or items. By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood. Collaborative filtering methods are classified as memory-based and model-based. A well-known example of memory-based approaches is the user-based algorithm

<h2>Resample Time Series Data</h2>

> Question: Reason for resample time series data
>
> [How To Resample and Interpolate Your Time Series Data With Python](https://machinelearningmastery.com/resample-interpolate-time-series-data-python/#:~:text=There%20are%20perhaps%20two%20main,you%20want%20to%20make%20predictions.) <br>
> There are perhaps two main reasons why you may be interested in resampling your time series data: <br>
> 1. **Problem Framing**:  Resampling may be required if your data is not available at the same frequency that you want to make predictions.
> 2. **Feature Engineering**: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

<h2>Note about Markdown Syntax</h2>

- [Basic Syntax](https://www.markdownguide.org/basic-syntax/)

<h2>Machine Learning Models: Labelled Data</h2>

> Does AutoEncoder need labelled data? <br>
> AutoEncoders are considered an unsupervised learning technique since they don't need explicit labels to train on. But to be more precise they are self-supervised because they generate their own labels from the training data.
>
> KNN is a **Supervised Learning Algorithm** <br>
> A supervised machine learning algorithm is one that relies on labelled input data to learn a function that produces an appropriate output when given unlabeled data. ... That is supervised learning. When we substitute the child with a computer, it becomes supervised machine learning.

<h2>N-grams Query for Auto-complete</h2>

- [Implementing Auto-complete in Elasticsearch: N-grams](https://www.learningstuffwithankit.dev/implementing-auto-complete-functionality-in-elasticsearch-part-ii-n-grams)

<h2>Mean, Variance and Standard Deviation</h2>

> Whether mean or variance streaming dataset too large to fit in memory
> - [Streaming Mean and Variance Computation](http://www.nowozin.net/sebastian/blog/streaming-mean-and-variance-computation.html)
> - [StackOverflow: Calculating mean and standard deviation](https://stackoverflow.com/questions/1174984/how-to-efficiently-calculate-a-running-standard-deviation)

<h2>Gradient Boosting Tree vs. Random Forest</h2>

- [StackOverflow: Gradient boosting tree vs random forest](https://stats.stackexchange.com/questions/173390/gradient-boosting-tree-vs-random-forest)
> error = bias + variance
>
> - Boosting is based on **weak** learners (high bias, low variance). In terms of decision trees, weak learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves). Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating the output from many models).
> - On the other hand, Random Forest uses as you said **fully grown decision trees** (low bias, high variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for large, unpruned trees, so that the bias is initially as low as possible.
>
> Please note that unlike Boosting (which is sequential), RF grows trees in **parallel**.

<h2>CNN Provides Invariance to Mirroring</h2>

> Is CNN an invariant?
>
> [Are CNNs invariant to translation, rotation, and scaling?](https://www.pyimagesearch.com/2021/05/14/are-cnns-invariant-to-translation-rotation-and-scaling/)
>
> Unless your training data includes digits that are rotated across the full 360-degree spectrum, your **CNN is not truly rotation invariant**. ... Therefore, CNNs can be seen as “not caring” exactly where an activation fires, simply that it does fire — and, in this way, we naturally handle translation inside a CNN.

> [About CNN, kernels and scale/rotation invariance](https://stats.stackexchange.com/questions/239076/about-cnn-kernels-and-scale-rotation-invariance)
>
>

<h2>Gradient Descent</h2>

<h3>Possible reasons for Gradient Decent not reach Global Minima</h3>

> [Does gradient descent always converge to an optimum?](https://datascience.stackexchange.com/questions/24534/does-gradient-descent-always-converge-to-an-optimum)
> 
> Gradient Descent need not always converge at global minimum. It all depends on following conditions; If the line segment between any two points on the graph of the function lies above or on the graph then it is convex function.
>
> Gradient Descent is an iterative process that finds the minima of a function. This is an optimisation algorithm that finds the parameters or coefficients of a function where the function has a minimum value. Although this function does not always guarantee to find a global minimum and can get stuck at a local minimum.

