## Convexity

Convexity is a property of a function that guarantees that every local minimum is also a global minimum, which is exactly the condition under which gradient descent will reliably find the best possible parameters. [youtube](https://www.youtube.com/watch?v=L2YiNu22saU)

#### Formal definition

- A function $f$ is **convex** if for all points $a$ and $b$ in its domain and all $t$ with $0 \le t \le 1$,

  $$
  t f(a) + (1 - t) f(b) \ge f(t a + (1 - t)b).
  $$

 [courses.cs.washington](https://courses.cs.washington.edu/courses/cse446/22sp/schedule/lecture_12.pdf)

- In words, for any two points on the graph of $f$, every point on the straight line segment between them lies on or **above** the graph, so the graph itself lies on or below every chord connecting two of its points. [youtube](https://www.youtube.com/watch?v=L2YiNu22saU)

#### Geometric intuition

- If you pick any two points on the curve of a convex function and draw the straight line between them, the entire curve between those points stays on or below that line segment. [youtube](https://www.youtube.com/watch?v=L2YiNu22saU)

- This gives convex functions a “bowl-shaped” geometry with no extra bumps: they may be flat in places, but they cannot dip below a chord between two of their points. [youtube](https://www.youtube.com/watch?v=L2YiNu22saU)

#### Convexity and gradient descent

- For a convex function, any point where the gradient is zero is guaranteed to be a global minimum, not just a local one. [courses.cs.washington](https://courses.cs.washington.edu/courses/cse446/22sp/schedule/lecture_12.pdf)

- Consequently, when the loss function in a learning problem is convex, gradient descent (with a suitable step size) is guaranteed to converge to the globally optimal parameter values, regardless of where it starts. [courses.cs.washington](https://courses.cs.washington.edu/courses/cse446/22sp/schedule/lecture_12.pdf)

#### Mean squared error as an example

- The mean squared error (MSE) loss in linear regression is a convex function of the model parameters for any dataset. [courses.cs.washington](https://courses.cs.washington.edu/courses/cse446/22sp/schedule/lecture_12.pdf)

- This convexity explains why both gradient descent and closed-form ordinary least squares give the same optimal parameters: there is a single global minimum of the MSE surface, and all local minima coincide with it. [youtube](https://www.youtube.com/watch?v=L2YiNu22saU)

#### Why non‑convexity is harder

- Non-convex functions can have multiple local minima, maxima, and flat regions; in such landscapes, gradient descent can get stuck in a poor local minimum or wander on plateaus instead of reaching the global minimum. [youtube](https://www.youtube.com/watch?v=L2YiNu22saU)

- Visually, non-convex “bumpy” surfaces may have wide, shallow regions where gradients are tiny far from the optimum, making optimization slow and sensitive to initialization, unlike the consistently “downhill” shape of convex bowls. [d2l](https://d2l.ai/chapter_optimization/convexity.html)