Dear Learner,

Welcome. This is a set of notebooks designed to teach you how to use (and hopefully understand) a software package called Geomstats. Geomstats is an open source code that uses concepts from Riemannian Geometry to analyze data that lie on manifolds (what it means for "data to lie on a manifold" will be explained in further detail in the next section). Geomstats is one of the first software packages of its kind. Before Geomstats and software packages like it, only people with prior knowledge of Riemannian geometry could analyze data on manifolds. The Geomstats package aims to make this type of data analysis more accessible to people who do not have prioro knowledge of Riemannian geometry. These instructional notebooks word towards Geomstats' goal to foster accessibility.

For those interested, the following <a href="https://www.jmlr.org/papers/volume21/19-027/19-027.pdf">paper</a> comares Geomstats to other software packages like it (on page 5). 

# What is the motivation for analyzing data on manifolds?

Many data sets lie on a manifold. Analyzing this data without Riemannain Geometry is often possible, but choosing to analyze data on manifolds is advantageous for three reasons:

    1) Analyzing data on the manifold it lies on allows you to reduce the degrees of freedom of the system, making computations less complicated.
    2) Knowing the manifold that a data set belongs to may give you more predictive power and a better understanding of the data's evolution.
    3) Knowing the manifold a data set lives in will help you extract the "signal" from a noisy data set or a data set with very few datapoints.

### 1) Analyzing data on a manifold reduces the degrees of freedom of the system, making computations less complicated.

The number of $\textbf{Degrees of Freedom}$ a system has is equivalent to the number of variables needed to describe the system completely. For example, an object moving freely in three dimensions will require three variables to describe it completely $(x,y,z), (\phi,\theta, r)$. If you can describe an object's motion in three dimensions, you would not want to use four variables to describe its motion because keeping track of another variable is mentally taxing (if you are solving a problem on paper) and more computationally expensive (if you are solving the problem with a computer). Similarly, if you now know that this free particle is moving on the surface of a sphere, you would want to analyze the particle using two varibles $(\theta,\phi)$ instead of three ($\phi,\theta,$ r).

<img src="figures/intro_degrees_of_freedom.png" />

This is one of the major motivations behind using manifolds to analyze data. Of course, it is sometimes possible to analyze data without manifolds and Riemannain Geometry, but this will be more complicated and computationally intensive.

### 2) Knowing the manifold that a data set belongs to may give you more predictive power and a better understanding of the data's evolution.

Objects travelling along a manifold often follow geodescics on that manifold. A geodescic is the shortest distance that a particle can travel in the space that it is in. For example, geodescics in 2D and 3D space follow straight lines because straight lines are the shortest way to get from one point to another. The figure below shows paths between two points in cartesian space. One path ($\gamma$) follows the geodesic for cartesian coordinates, and the other path ($\gamma^{'}$) does not follow the geodesic

<img src="figures/intro_cartesian_geodesic.png" />

However, when an object lies in a higher dimensional curved space, its geodescic will not follow a straight line. For example, if an object is constrained to move along the surface of a sphere, the shortest path between points is not a striaght line, but a curved line. A straight line in 3D space would not lie on the surface of a sphere, so it cannot be the shortest path along the sphere.

<img src="figures/intro_sphere_geodesic.png" />

If you did not know that the object was moving along the surface of the sphere, you would wonder why it is taking such an "irratic" path instead of just going straight. The motion of the particles in your system might seem random because you do not understand the space they are moving in. However, if you learn more about the space they are moving in (the surface of a sphere), you would realize that the particles are following very reasonable and predictable paths along geodescics, and this would give you $\textit{not only}$ a better understanding of how particles have moved in the past but $\textit{also}$ predictive power to determine how particles will move in the future.

### 3) Knowing the manifold a data set lives in will help you extract the "signal" from a noisy data set or a data set with very few datapoints.

Let's dissect a "noisy data set" case. Let's say that you are measuring the position of a car moving at constant velocity, but you are measuring its position with very bad tools, and your data looks like this. 

<img src="figures/intro_random_points.png" />

How can you get any information from this? It would be very difficult to get information from this if you don't have a model for what the data $\textit{should}$ look like. But if you know that a car moving at constant velocity should follow the curve $x_f = x_i + v\Delta t$, then you can get more information from your data by fitting your data to a line with slope v (shown below in (a)). However, if you did not know that your data should follow a straight line, then you might try to fit the data using (b) or (c), and your (incorrect) model would not provide as much predictive power, or you might not be able to extract any information at all. Similarly, knowing the manifold your data lies on can help you extract information from noisy data.

<img src="figures/intro_point_fits.png" />

Let's now dissect the "data set with very few data points" case, and let's again use the example of a car moving at constant velocity. Let's say you measured the initial position and initial time (first point) and the final position and final time (second point), and saw these two data points.

<img src="figures/intro_two_points.png" />

If you didn't know that the position of a car moving at constant velocity can be modeled by a line, you might not be able to accurately extrapolate the data beyond these two points. However, because you know that these two points should fall on a line, you can accurately predict where the car will be at a later time.

<img src="figures/intro_two_point_fits.png" />

Similarly, if you know the manifold that a data set lies on, you can predict the trajectory of a data point along the manifold.

# What will you learn in these tutorials?

Geomstats is designed to be intuative and user friendly, but having some knowledge about Riemannian Geometry will put you in a good position to understand how to use Geomstats most effectively. Therefore, in the next three notebooks, we will give you an overview of three of the most important parent classes in Geomstats, along with a description of the matematical concepts implemented in each class.

We will cover the parent classes:

    1) Manifold
    2) Connection
    3) RiemannianMetric
    
One instructional notebook will be dedicated to each of these parent classes, starting with Manifold. In each of these notebooks, you should expect to gain an understanding of

    1) the structure/hierarchy of the Geomstats code
    2) how to perform calculations on manifolds
    3) how and where this mathematics is implemented in the code

 # Beginning to build a hierarchal map

Now that we know about these three parent classes, we will begin to draw a hierarchal map of geomstats, which we will build out as we learn more about each parent class.

<img src="figures/intro_hierarchal_map.png" />

In the next notebook, we will discuss the manifold class.