# PyTorch Internals

In this notebook, we will explore the internals of PyTorch, and see how we can write custom PyTorch operators in C++.

## PyTorch Internals Architecture

### Core Architecture
- **Python Frontend + C++ Backend**: PyTorch combines a flexible Python interface with an optimized C++ core engine
- **Deep Integration**: Unlike simple C++ libraries with thin Python wrappers, PyTorch has robust bidirectional integration
- **Memory Sharing**: Python and C++ components share tensor memory without copying, enabling efficient data handling

### Technical Implementation
- **Bridging Mechanism**: Uses Python C extensions (PyBind11 and CPython C-API) to connect Python code with C++ functions
- **ATen Library**: C++ tensor library providing the computational foundation for PyTorch operations
- **CUDA Integration**: Native support for GPU acceleration through dedicated CUDA implementations

### Development Philosophy
- **Pythonic Design**: Prioritizes natural Python coding patterns while leveraging C++ performance
- **Clear Separation of Concerns**:
    - Python: High-level APIs, neural network layers, optimizers, training loops
    - C++: Low-level tensor operations, autograd engine, memory management

### Benefits
- **Performance**: Computational efficiency of compiled C++ with the flexibility of Python
- **Extensibility**: Easy to extend with custom operators in either Python or C++
- **Research-Friendly**: Enables rapid prototyping while maintaining production-grade performance

[PyTorch Design Philosophy Documentation](https://pytorch.org/docs/stable/community/design.html)


Internally, PyTorch's codebase reflects the Python-C++ split architecture:

- **torch/csrc directory**: C++ source that implements Python-C++ bindings and core components:
    - Autograd engine
    - JIT compiler
    - CUDA extensions
    - Distributed communication primitives

- **ATen library**: C++ tensor library (under aten/) that provides fundamental tensor operations:
    - Matrix multiplication
    - Convolution operations
    - Activation functions
    - Each with separate CPU and GPU kernels

- **Python-C++ interaction flow**:
    1. Python API function is called
    2. Arguments are converted from Python to C++ types
    3. Appropriate C++ routine in ATen is invoked
    4. C++ return values are wrapped back into Python objects
    5. Results are returned to the Python environment

- **Memory management**:
    - Tensor storage is allocated in C++
    - Python objects hold references without copying data
    - Reference counting system spans both languages

In the following sections, we'll explore how key PyTorch components work: from basic tensor creation to autograd, and examine how Python API calls flow into C++ execution.

## Tensor Creation and Manipulation
PyTorch tensor creation (via `torch.tensor()`) exemplifies the Python-C++ architecture:


In [1]:
import torch
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)

the following steps occur internally:
 
1. **Python Function Call**: The Python `torch.tensor()` function receives the input list and dtype argument

2. **Type and Shape Analysis**:
    - PyTorch analyzes the nested Python list structure to determine dimensions
    - Infers tensor shape [2,3] from the nested list structure
    - Uses the explicitly provided dtype (float32)

3. **Memory Allocation** (in C++):
    - ATen library allocates contiguous memory block of 6 × 4 bytes (6 float32 values)
    - Creates metadata structures to track shape, stride, and dtype information
    - see torch/csrc/utils/tensor_new.cpp in the PyTorch source

4. **Data Transfer**:
    - Iterates through the Python nested list
    - Copies each value into the allocated C++ memory
    - Performs any necessary type conversion (integer to float in this example)

5. **Tensor Object Creation**:
    - Constructs a C++ Tensor object that points to the allocated memory
    - Wraps this C++ tensor in a Python tensor object
    - Returns this Python tensor to the caller

This process ensures that `x` contains a complete copy of the input data, unlike operations like `torch.as_tensor()` or `torch.from_numpy()` which can share memory with existing arrays.

# PyTorch Tensor Architecture

Once created, a torch.Tensor Python object is actually a view of a C++ at::Tensor. In PyTorch’s Python code, the Tensor class is defined to inherit from a special base class provided by the C++ backend: torch._C._TensorBase ￼.

https://github.com/pytorch/pytorch/blob/46cf6d332f075ed90d3baf21c32de51e4f304549/torch/tensor.py#L40%23

This means that most tensor operations are not implemented in pure Python; instead, they call into C++ methods. For example, the Python definition might look like: class Tensor(torch._C._TensorBase): ... ￼. When you call methods on a tensor (like x.resize_() or x + y), those methods are either forwarded to C++ implementations or bound directly via PyBind11/CPython C-API.

 In fact, many Tensor methods aren’t visible in Python source code because they’re defined in C++. For instance, tensor.item() (to get a Python scalar from a single-valued tensor) does not have a Python definition – it’s handled by the C++ type’s sequence (item) protocol ￼ ￼. Similarly, operations like tensor.add(other) or the + operator are defined in C++ and exposed to Python. PyTorch uses a mechanism to map Python magic methods to C++ functions: e.g., calling x + y in Python triggers x.__add__(y), which is defined to call self.add(y), and that in turn invokes the C++ at::add operation ￼. This design ensures that tensor computations execute in fast C++ code while the Python layer mainly orchestrates the calls.


https://github.com/pytorch/pytorch/blob/46cf6d332f075ed90d3baf21c32de51e4f304549/torch/tensor.py

https://stackoverflow.com/questions/65445621/where-is-torch-tensor-item-defined-at-the-github#:~:text=Most%20of%20Torch%27s%20underlying%2C%20compute,cpp%3A812

Tensor data storage is another important internal concept. Every tensor has an associated Storage object that holds the raw memory (contiguous array of elements), while the Tensor itself holds metadata (like sizes, strides, and data type) describing how to interpret that memory.

https://pytorch.org/docs/stable/tensors.html

This means you can have multiple Tensor objects that share the same storage (useful for views, slicing, etc.). For example, if a = torch.ones(3,3) and b = a.view(9), a and b are different Tensor objects (different shapes) but they point to the same underlying storage buffer.

 PyTorch’s internals handle this by keeping a reference to a Storage inside each Tensor’s C++ implementation (TensorImpl). Storage knows about the memory (pointer, size, and element data type) but nothing about tensor dimensionality, whereas Tensor (TensorImpl) knows how to map multi-dimensional indices to that storage via strides and offset ￼ ￼. The figure below illustrates this relationship:

https://blog.christianperone.com/2018/03/pytorch-internal-architecture-tour/#:~:text=%28code%20from%20THStorage


As the diagram shows, PyTorch uses a strided tensor representation. The Storage contains the “actual physical data” (a contiguous memory block of a certain size and dtype), and each Tensor that uses that storage has its own metadata (tensor dimensions, stride for each dimension, and an offset into the storage) ￼. This design makes view operations efficient: slicing or reshaping a tensor doesn’t copy data; it just creates another Tensor with adjusted metadata pointing to the same Storage.

 Under the hood, TensorImpl (a C++ struct in c10::TensorImpl) holds a pointer to a Storage and fields for size, stride, etc. When you do an in-place operation on a view, PyTorch will ensure all affected Tensors see the change (since they share storage). Conversely, when the last Tensor referencing a Storage is gone, the memory can be freed (or returned to a memory pool, as we’ll discuss in the memory management section). In summary, creating a tensor via the Python API involves: (1) Python function to handle API niceties (like default dtype or device), then (2) C++ functions to allocate memory (Storage) and create a TensorImpl with shape metadata, and (3) returning a Python Tensor object that wraps the C++ TensorImpl/Storage. From then on, most Tensor manipulation ops (arithmetics, indexing, etc.) are carried out by calling C++ kernels in ATen.

## Exploring PyTorch’s Python API Layers

PyTorch’s Python API is organized into sub-packages like torch.nn, torch.autograd, torch.optim, etc., built on top of the core torch (tensor and functional) API.

These higher-level APIs often still interact with the C++ core when performing computations. Let’s consider a few examples:


### Neural Network (torch.nn):

The torch.nn.Module class is a base class for all neural network modules (layers, models). Modules are Python objects, which can contain other modules or parameters (which are tensors). When you define a model (as a subclass of nn.Module), you typically initialize layers in __init__ and implement a forward method using tensor operations. For instance, nn.Linear is a module that contains two Parameter tensors (weight and bias) and defines a forward that does X * W^T + b (matrix multiply and add).

 Internally, nn.Linear will call F.linear(input, weight, bias) from torch.nn.functional, which eventually calls low-level tensor operations (torch.addmm for the matrix multiplication and addition). Those low-level ops (addmm in this case) are part of the torch library and are implemented in C++ (ATen). So when you execute a model’s forward pass, you’re invoking a combination of Python (module forwards) and C++ (tensor ops). The nn.Module class itself mostly manages bookkeeping (such as registering submodules and parameters in Python dictionaries) and doesn’t do heavy computation in Python ￼. The heavy lifting is deferred to the tensor operations which, as described, run in C++.

 https://discuss.pytorch.org/t/distributed-data-parallel-module-attribute/142584#:~:text=Forums%20discuss,parameters


### Autograd (torch.autograd):

This package provides the automatic differentiation functionality. From the Python side, torch.autograd exposes functions like backward() and classes like Function for custom gradients. When you call torch.autograd.backward(tensor), it will kick off the backward pass using PyTorch’s autograd engine (more on this in the next section). Notably, torch.autograd.Variable used to be a separate class for tensors that track gradients, but since PyTorch 0.4, torch.Tensor itself supports autograd (the requires_grad attribute). The torch.autograd Python code primarily wraps engine calls; e.g., torch.autograd.backward() checks inputs and then calls Variable._execution_engine.run_backward which is a C++ routine.

https://pytorch.org/blog/how-computational-graphs-are-executed-in-pytorch/#:~:text=Variable,allow_unreachable%20flag

 So the Python API here is a thin layer that delegates to the C++ autograd engine. Similarly, torch.autograd.grad() calls into the engine. The torch.autograd.Function class allows you to define custom operations with a forward and backward in Python, but even then PyTorch will integrate them with the C++ autograd graph: when you use a custom Function, behind the scenes a C++ “Node” object is created to represent it in the graph, which will call back into your Python code for backward.


### Optimizers (torch.optim):

Optimizer classes like optim.SGD are implemented in Python, as they are mostly looping over parameters and updating values. For example, the SGD step() method does something like param.data = param.data - lr * param.grad. Here param.data and param.grad are Tensor objects; the subtraction operation is a tensor operation (which calls into C++). So while the optimizer logic (iterating over parameter list, applying formulas) is Python, the math operations use the C++ core. Some optimizers use fused CUDA kernels for efficiency (for example adam can use a fused kernel), but these are still invoked via tensor operations. Generally, torch.optim is an example of a Python-level loop controlling a bunch of fast tensor operations.


Utilities and others: Many other parts of the API like torch.utils.data (data loading) or torch.jit (just-in-time compiler) have Python components that coordinate tasks (like spawning data loader workers or tracing code) but eventually interact with C++ (e.g., torch.jit uses C++ autograd and interpreter under the hood). Even something like torch.cuda is mostly Python code that calls into C++ CUDA APIs (e.g., to synchronize or get device properties). In summary, Python code sets things up; C++ does the compute.

These layers interact primarily through the Python/C++ bridge (torch._C). The torch._C module (which you typically don’t import explicitly, as it’s loaded within torch) is a Python module implemented in C++ that contains the core classes and functions. For example, torch._C._TensorBase (the base class for tensors), the autograd Engine (torch._C._EngineBase), and many C++ implementations of functions appear as attributes of torch._C. PyTorch uses PyBind11 and custom CPython bindings to expose C++ classes in this torch._C namespace. When you call, say, torch.rand(2,2), it might be bound through torch/_C/__init__.py to call a C++ function that creates a tensor. In essence, torch._C is the gateway where Python calls land in C++ world. This bridging is why PyTorch feels smooth in Python – you call a Python function, but it executes as efficiently as a native library call.

## Tracing a Python Operation to C++ Execution
Let’s walk step-by-step through a simple tensor operation to see how a Python call invokes C++ code. Consider the expression:


In [2]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2 + 1

When this code runs:

torch.tensor creation – as discussed earlier, torch.tensor([...], requires_grad=True) triggers a C++ tensor creation. The end result is that x is a torch.Tensor Python object that wraps an at::Tensor allocated in C++. The requires_grad=True flag sets the tensor’s autograd metadata to start tracking operations.

x * 2 (Tensor multiplication) – Python translates x * 2 into a call to x.__mul__(2). In the Tensor class, __mul__ is defined (in Python) to call self.mul(other) or an equivalent mechanism ￼. This Tensor.mul operation is not implemented in Python; it’s bound to the C++ kernel for multiplication. What happens internally is:

PyTorch sees that other (the number 2) is a Python number, so it will wrap it as a tensor (a 0-dim scalar tensor) or handle it via a scalar pathway.

The actual C++ function for tensor-scalar multiplication is invoked (through the binding in torch._C). This is an ATen operation – effectively it will loop over x’s data and multiply each element by 2 (in C++).

A new C++ Tensor is created for the result (let’s call it tmp), with requires_grad=True because x required grad. Also, the autograd graph will record this operation (the grad_fn for tmp will be a MulBackward function).

The C++ result is returned to Python as a new torch.Tensor object (let’s call it tmp_py on the Python side). So tmp_py wraps the C++ tensor resulting from x*2.

tmp + 1 (Tensor addition) – Next, the result from multiplication is added with 1. The Python __add__ of tmp_py is called, which again delegates to the C++ add operation. Similar steps occur: the number 1 is converted to a scalar tensor, the C++ at::add kernel is invoked on tmp and 1, producing a new Tensor for y. This new tensor’s grad_fn will be an AddBackward function (because it’s the result of an addition involving tensors that require grad). Now y is returned as a Python Tensor object wrapping the final C++ Tensor.

After these steps, we have y = x*2 + 1 computed. If we print y, it will show a tensor with the values doubled and incremented, and importantly it will indicate a grad_fn:

In [None]:
print(y)  
# tensor([3., 5., 7.], grad_fn=<AddBackward0>)

tensor([3., 5., 7.], grad_fn=<AddBackward0>)
None
tensor([1., 2., 3.], requires_grad=True)
None


This grad_fn <AddBackward0> is PyTorch showing that y was created by an addition operation and has a backward function ready (more on this soon).

So, how did the call actually make it into C++? Under the hood, PyTorch has a layer of C++ functions with names like THPVariable_mul or similar (historically called Variable functions, since Tensor was once Variable). When Python calls x.mul(y), it goes through a binding which calls a C++ function that roughly does:

1. Parse the Python arguments (using PyTorch’s PythonArgParser) and obtain the C++ at::Tensor objects for x and y ￼.
   
2. Release the Global Interpreter Lock (GIL) because the actual computation will run in C++, potentially in parallel threads (this allows other Python code to run in the meantime if on other threads).
   
3. Call the C++ tensor operation (e.g., at::Tensor::mul or a free function at::mul(x, y)). This performs the calculation.

4. Wrap the resulting at::Tensor back into a Python Tensor (THPVariable PyObject) and return it.

This process is implemented in C++ source files like torch/csrc/autograd/python_variable.cpp and python_variable_methods.cpp where many methods of Tensor are bound. 

The Stack Overflow answer linked in references points out that the Python Tensor type is backed by a C struct THPVariable which contains an at::Tensor (called cdata).

https://blog.christianperone.com/2018/03/pytorch-internal-architecture-tour/#:~:text=%2F%2F%20Python%20object%20that%20backs,backward_hooks%3B

https://stackoverflow.com/questions/65445621/where-is-torch-tensor-item-defined-at-the-github#:~:text=which%20is%20provided%20to%20Python,cpp%3A812


The mapping of Python methods to C++ functions is done through a PyTypeObject or pybind registration.
 
For example, the __getitem__ method (indexing) or item() are handled by the C++ mapping protocol of THPVariable ￼.
  
Essentially, when you do anything with a torch.Tensor in Python, you’re likely invoking C++ code almost immediately after. This is how PyTorch achieves its speed – by doing all computation in C++/CUDA – while still allowing you to use a comfortable Python interface.

## Understanding the Autograd Mechanism

One of PyTorch’s signature features is its autograd system for automatic differentiation. Autograd builds a computational graph behind the scenes as you carry out operations, then uses that graph to compute gradients when you call .backward(). Let’s break down how this works internally.

Computational Graph Construction: Every torch.Tensor has an attribute .grad_fn (except for Tensors created by the user with requires_grad=False, or those that require grad but are at the graph’s leaf – those have grad_fn=None and are considered leaf nodes). When you perform an operation on tensors that have requires_grad=True, PyTorch will create a Node (also called Function in some contexts) representing that operation in the graph.

For example, in our earlier example y = x*2 + 1 with x.requires_grad=True:

After y = x*2 + 1, y.grad_fn might print as <AddBackward0> indicating the addition operation node. If you check tmp = x*2 (the intermediate), tmp.grad_fn would be <MulBackward0> . And x.grad_fn is None because x is a leaf tensor created by the user.

https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/


In [7]:
print(x.grad)
print(x)
print(x.grad_fn)

None
tensor([1., 2., 3.], requires_grad=True)
None


What do these objects represent? They are instances of classes that know how to compute the gradient of the forward op.

In C++, PyTorch defines autograd Function classes for each operation (many of these are autogenerated from definitions in tools/autograd/derivatives.yaml). For example, MulBackward0 is a class that knows how to take the upstream gradient and multiply by the other input (per the derivative of multiplication), and AddBackward0 knows how to distribute the gradient to its two inputs (since d(a+b)/da = 1, d(a+b)/db = 1). These are the backward functions.


When the forward operation ran, PyTorch did the following internally:

- Created the output tensor and set its grad_fn pointer to the appropriate backward Function object. It also linked that Function to the input tensor’s grad_fn or accumulator. In C++, if you look at the source of an operation (for example, MulBackward0 in the autograd code), you’ll see it capturing references to the inputs’ grad_fn edges.
-  Essentially, PyTorch is building a graph data structure: the nodes are these Function objects, and the edges connect outputs to inputs (forming a DAG – directed acyclic graph).
-  If an input was a leaf tensor with requires_grad=True (like x), instead of a grad_fn it has a grad accumulator. PyTorch uses special accumulator nodes for leaf variables to collect gradients directly into the tensor’s .grad field.

This graph construction is dynamic – it happens as your code runs. PyTorch does eager execution, meaning it doesn’t pre-build a static graph; it records operations on the fly. That’s why you can use Python control flow freely – each iteration can produce a different graph shape if needed. After you’ve done the forward passes and have an output (say a scalar loss), you can call loss.backward().

https://pytorch.org/docs/stable/autograd.html

## Backward Pass (Autograd Engine):

When backward() is invoked on a tensor (or torch.autograd.backward is called), PyTorch’s autograd engine kicks in. In Python, Tensor.backward() just calls the global torch.autograd.backward function, which does some argument processing and then calls Variable._execution_engine.run_backward(...) ￼ ￼. _execution_engine is a Python wrapper for the C++ Autograd Engine (an instance of torch::autograd::Engine class, specifically an ImperativeEngine implementation). This Engine is implemented in C++ (see torch/csrc/autograd/engine.cpp) and is responsible for orchestrating the backward pass.


The backward engine performs a traversal of the graph starting from the node(s) corresponding to the target tensor (the one you called backward on, e.g., the loss). It uses the information in each Function node’s structure to propagate gradients:

- It first initializes the “seed” gradient for the output node as provided (if loss.backward() is called with no arguments, it uses a tensor of ones as the gradient for the loss).
- Then it proceeds to visit each parent node in the graph in topological order (so that all next gradients are ready when a node is processed). PyTorch employs a strategy to handle dependencies: each node knows how many gradients it must receive from its successors (this is the dependency count or use count).
- The Engine runs each node’s apply() method (or equivalent) to compute its gradients. For example, the AddBackward0 node will output gradients for each of its inputs (which were the outputs of the earlier operations or leafs) by copying the incoming grad (since d(a+b)/da = 1, the grad is passed through to a and similarly to b). The MulBackward0 node will take the incoming grad and multiply by the saved tensor (when y = x*2, the grad for x will be incoming_grad * 2, because ∂(x*2)/∂x = 2). PyTorch’s autograd Functions can retrieve any values they saved during forward – for example, some backward functions save input tensors or other needed values.
- As gradients are computed for a tensor, if that tensor is a leaf with requires_grad=True, the Engine will accumulate the grad into tensor.grad. If it’s an intermediate result, the Engine will pass the grad further down to its own predecessors. The engine ensures that if multiple paths produce a gradient for the same leaf (e.g., if a tensor was used in two branches that merged), those gradients are summed.

PyTorch’s engine is optimized in many ways: it can work in parallel threads, it frees intermediate buffers as soon as they’re not needed (to save memory), and it has the ability to optionally keep the graph for another backward (if retain_graph=True). But conceptually, it’s performing the chain rule by following the graph. After loss.backward(), all leaf tensors (often your model parameters) that require grad will have their .grad fields populated with the calculated gradients. You can then use these for optimization (e.g., SGD step).

In [8]:
x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x * 2 + 1             # forward operations
print(y)                  # tensor([3., 5., 7.], grad_fn=<AddBackward0>)
print(y.grad_fn)          # AddBackward0 (the grad_fn of y)
loss = y.sum()            # sum up elements -> scalar
print(loss.grad_fn)       # SumBackward0 (grad_fn of the sum operation)
loss.backward()           # trigger backward pass
print(x.grad)             # tensor([2., 2., 2.])

tensor([3., 5., 7.], grad_fn=<AddBackward0>)
<AddBackward0 object at 0x1275fad40>
<SumBackward0 object at 0x1274da6b0>
tensor([2., 2., 2.])


In this snippet, y = x*2 + 1 has grad_fn=<AddBackward0> (which had internally two sub-functions: MulBackward0 and AddBackward0). The loss is 3+5+7 = 15 with grad_fn=<SumBackward0>.

After loss.backward(), we see x.grad = tensor([2., 2., 2.]), which makes sense because ∂(sum(x*2+1))/∂x = 2 for each element. This matches the manual calculation: each element of x was multiplied by 2 and then added, so derivative is 2. Autograd computed this automatically by traversing the graph of SumBackward0 -> AddBackward0 -> MulBackward0.


It’s worth noting how torch.autograd.Function fits into this. This class allows users to define custom forward and backward logic. When you subclass torch.autograd.Function and call it via Function.apply, PyTorch will execute your Python forward code, but it will also create a C++ autograd Node that, when backward is executed, calls the Python backward you defined. Under the hood, there’s a bit of glue that ensures a Python-defined backward can be invoked by the C++ engine. This involves the PyFunction wrapper (see torch/csrc/autograd/python_function.cpp) which is a C++ Node that holds a reference to a PyObject for the backward. The engine, when encountering such a node, will switch to Python execution to run the backward. This is a bit advanced, but the takeaway is that even custom Functions are integrated seamlessly – from the user perspective, they behave like any other operation with autograd.

In summary, PyTorch’s autograd: (1) records operations in a graph structure of Function nodes, (2) uses a dedicated C++ Engine to traverse the graph backward computing gradients, and (3) populates .grad fields of leaf tensors. This design is what makes constructs like dynamic graphs and higher-order gradients (you can compute gradients of gradients) possible and efficient ￼ ￼.


## Python Memory Model & PyTorch

Given the interaction between Python and C++, it’s important to understand how memory and objects are managed. PyTorch needs to juggle two systems: the Python memory model (with its garbage collector and reference counting) and the memory management of tensors (especially GPU memory).

Tensor objects and reference counting: In CPython, every object (like a torch.Tensor) has a reference count. When you do a = b, you increase the refcount of the object b (now a references it too). When refcount drops to zero, the object’s __del__ is called and memory freed. PyTorch Tensor is a subclass of PyObject (through torch._C._TensorBase). The actual C++ struct for a Tensor Python object, THPVariable in python_variable.h, includes PyObject_HEAD (the Python object header with refcount and type info) and an at::Tensor cdata field ￼. That means each Python tensor holds a C++ Tensor inside it (the cdata). The C++ Tensor in PyTorch (of type at::Tensor) is essentially an intrusive_ptr to a TensorImpl. This is a form of reference counting in C++: each TensorImpl knows how many references (Tensors) point to it. So there are two levels of refcount: one at Python level (for the PyObject) and one at C++ level (for the TensorImpl/Storage). PyTorch aligns these so that typically there’s a 1-to-1 relationship: if a Tensor PyObject is alive, it holds one reference to the underlying TensorImpl. If you delete the Python Tensor (refcount to 0), in its deallocation it will decrement the C++ intrusive_ptr, and if that was the last reference, the actual data memory can be freed. Conversely, if you have multiple Python Tensor objects referencing the same TensorImpl (which can happen if you explicitly copy the tensor object or in some views), then the C++ TensorImpl’s refcount > 1. Only when all Python objects are gone (or otherwise all references gone) will the storage be freed.

One nuance: views and in-place ops. If you have b = a.view(...), a and b share the same Storage. In this case, both a and b have their own TensorImpl, but those TensorImpls have pointers to the same Storage (with its own refcount). PyTorch must ensure that if you modify one, the other sees the change (since same data). The Storage’s refcount handles the memory life cycle. The TensorImpl’s refcount handles their life as separate Python objects. PyTorch’s design here (decoupling Tensor and Storage) means that freeing a Tensor (Python object) will decrease Storage’s refcount, but not necessarily free it if another tensor is still using it ￼ ￼.

Memory allocation and freeing: For CPU tensors, PyTorch uses a custom allocator that ultimately calls malloc/free (with some alignment and pooling optimizations for small allocations). For CUDA (GPU) tensors, PyTorch uses a caching allocator – when you free a GPU tensor, PyTorch doesn’t immediately return the memory to CUDA, instead it keeps it in a pool for quick reuse to avoid the overhead of cudaMalloc/cudaFree every time. The quote below from a PyTorch developer explains this: “When a Tensor (or all Tensors referring to a memory block (a Storage)) goes out of scope, the memory goes back to the cache PyTorch keeps. You can free the memory from the cache using torch.cuda.empty_cache().” ￼. In practice, this means if you delete large GPU tensors, you might not see GPU memory drop in nvidia-smi until you clear the cache, but PyTorch will reuse that memory for subsequent tensors. For CPU, small allocations might also be cached by the allocator (via torch.memory_allocated you can inspect usage).

PyTorch integrates with Python’s garbage collector for GPU tensors as well – if a GPU tensor goes out of scope on the Python side, its memory is freed to the CUDA cache. One must be cautious with holding references to large tensors in Python; if you accidentally keep references (like in a list), the garbage collector won’t free them and you’ll have memory “leaks” (really just lingering objects).

Another aspect is PyTorch’s Storage class (in Python you can access via tensor.storage()). Storage in current PyTorch is a somewhat internal concept (the need for users to directly use it has reduced), but it’s exposed. Each Storage is a separate Python object as well (torch.Storage), which also participates in ref counting. If you call x.storage(), you get a handle to the Storage – increasing its refcount. Typically, you don’t need to manage this explicitly, but it’s good to know that tensor.storage() lets you see the base memory (and methods like storage().data_ptr() give you the memory address as an integer).

Memory sharing with other libraries: PyTorch can share memory with NumPy via torch.from_numpy and tensor.numpy(). When you do torch.from_numpy(ndarray), PyTorch creates a tensor that uses the NumPy array’s buffer as its Storage (without copying). It increments the reference count of the NumPy array (to prevent it from freeing while the tensor is alive) ￼. Similarly, tensor.numpy() returns an ndarray that shares the tensor’s storage (and it increments the tensor’s refcount). This interoperability is achieved by both sides coordinating reference counts (PyTorch uses Python’s C-API to increase the NumPy array’s refcount when needed) ￼ ￼.

In short, PyTorch’s memory model is a blend of Python’s refcounting (for object lifetime) and its own C++ refcounting (for actual tensor memory). For users, the main points are: if no Python variable points to a tensor, it will be freed (or cached if GPU) automatically. You usually don’t manually free tensors (there’s no del tensor.data needed — just del tensor or let it go out of scope). And if you want to release GPU memory back to the system, you can use torch.cuda.empty_cache(), though this is rarely necessary unless memory fragmentation is an issue.

## Deep Dive into PyTorch Neural Networks (nn.Module & Optimization)


Building on these internals, let’s consider how nn.Module and the training loop work under the hood. The nn.Module class (defined in Python in torch/nn/modules/module.py) provides a framework for assembling layers and parameters. Key internal mechanisms include:


- Module initialization and assignment: When you assign an attribute to an nn.Module that is a Module or a Parameter (a subclass of Tensor), the __setattr__ override in Module will automatically register it. For example, in self.conv1 = nn.Conv2d(...), because conv1 is an nn.Module, __setattr__ adds it to self._modules (an OrderedDict of sub-modules) ￼. If you assign a torch.Tensor wrapped in nn.Parameter, it will be added to self._parameters. This means that by the end of your model’s __init__, PyTorch has catalogued all sub-layers and parameters. This is how model.parameters() knows what to return – it iterates over self._parameters and the parameters of sub-modules in self._modules. This registration is purely Python bookkeeping. https://discuss.pytorch.org/t/distributed-data-parallel-module-attribute/142584/3

- Forward pass: When you call output = model(input), the Module’s __call__ method (in Module base class) does some setup (like handling hooks) and then calls self.forward(input). The forward method is the one you define in your subclass (or for built-in layers, it’s defined in their class). Inside forward, you typically use operations like torch.matmul, F.relu, etc., or call other Modules. All those operations execute as discussed (mostly in C++ for the math). By the time forward completes, you have an output tensor (or multiple tensors). If any input or parameter had requires_grad=True (which by default, Parameters do), the output will have grad_fn attached, meaning the autograd graph is ready for backprop. Nothing special in the Module itself is needed for autograd beyond marking parameters with requires_grad; the autograd engine takes care of gradient tracking automatically.

- Loss computation and backward: You compute a loss (perhaps by a PyTorch loss function, e.g., nn.CrossEntropyLoss, which is also an nn.Module that uses low-level ops like log-softmax and NLL). The loss is a tensor, and you call loss.backward(). As detailed earlier, this initiates the autograd engine to compute gradients for all tensors that contributed to loss and have requires_grad=True. In a typical neural network, those are the model’s parameters. After this call, each param.grad now holds the gradient.

- Optimization step: Now torch.optim comes into play. Optimizers are implemented in Python. For example, the SGD optimizer’s step() method will loop like:

```
for param in self.params:
    if param.grad is None:
        continue
    param.data = param.data - lr * param.grad  # gradient descent step
```

Here param.data is actually a Tensor too (PyTorch uses param.data to get a tensor with the same storage but not tracking gradients, to avoid interfering with autograd graph). The subtraction operation (param.data - lr*param.grad) is a tensor operation, which calls into the C++ kernel for subtraction. This operation is simple enough that the overhead is negligible; it’s happening for each parameter. Some optimizers like Adam have more complex updates (with momentum buffers, etc.), but similarly they use tensor operations for the arithmetic (and sometimes a fused kernel for speed). After step(), your parameters have new values (still as Tensors with requires_grad=True). PyTorch doesn’t build a graph for these .data operations or the updates because they are performed in a with torch.no_grad() context internally (the optimizer temporarily disables grad tracking).

One might wonder how the whole training loop avoids memory leaks since each iteration builds a new graph. The answer: after you call .backward(), by default PyTorch frees the graph to save memory (because it knows you won’t use that graph again unless you specified retain_graph). So on the next iteration, it starts from scratch building a new graph. The parameters themselves remain (with their .grad from last step, which you often zero out with optimizer.zero_grad() for the next iteration). This iterative process continues for each batch.

Zeroing grads: It’s important to zero-out gradients (param.grad) between training iterations (otherwise gradients would accumulate by default). optimizer.zero_grad() simply does:

```
for param in self.params:
    if param.grad is not None:
        param.grad.detach_()
        param.grad.zero_()
```

This uses in-place operations on the grad tensor to fill it with zero. The detach_() is to ensure that if grad was somehow part of a graph, it’s isolated (usually not needed for leaf grads). These in-place ops are tracked by autograd’s version counters to avoid issues with gradient reuse.

nn.Module internal details: Modules also handle things like buffers (non-parameter Tensors like running mean in BatchNorm, stored in self._buffers), and they provide methods for moving to GPU (module.to(device) which calls .to on all parameters and buffers via C++ ops) and for saving/loading (via state_dict). These are mostly straightforward: e.g., .to(device) loops through parameters and buffers (in Python) and calls the C++ Tensor.to(...) for each, which moves data and returns a new tensor on device. The Python module then replaces the old tensor with the new one.

In summary, nn.Module is a Python container that organizes your model’s parameters and layers, but when it comes to the actual math (forward and backward), it’s all done by the tensor operations and autograd engine as described earlier. The integration is such that you typically don’t notice where Python ends and C++ begins – you just write your forward in Python and call .backward(), and everything just works.

## Hands-On Examples and Function Tracing

Example 1: Tensor operations and grad functions

In [11]:
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x * 3.0 + 4.0           # multiply then add
print(y)                    # tensor([10., 13.], grad_fn=<AddBackward0>)
print("y.grad_fn:", y.grad_fn)    # prints the grad function object, e.g., AddBackward0
z = y**2                    # square each element
print(z)
out = z.mean()              # average (scalar)
print(out)                  # tensor(134., grad_fn=<MeanBackward0>)
out.backward()              # compute gradients
print(x.grad)               # tensor([6., 9.])

tensor([10., 13.], grad_fn=<AddBackward0>)
y.grad_fn: <AddBackward0 object at 0x1066d56f0>
tensor([100., 169.], grad_fn=<PowBackward0>)
tensor(134.5000, grad_fn=<MeanBackward0>)
tensor([30., 39.])


Here, y = x*3.0 + 4.0 results in y = [10., 13.] with grad_fn=<AddBackward0> as expected (AddBackward0 for the addition).

The grad_fn of y internally connects to a MulBackward0 (for the multiplication part) as well, but PyTorch shows only the top-level node. The final output out is the mean of y**2 (which is $\frac{10^2 + 13^2}{2} = 134$).

Its grad_fn is <MeanBackward0>. After out.backward(), the gradient x.grad is [6., 9.]. We can verify this manually:

$y = 3x + 4$, so $\frac{\partial y}{\partial x} = 3$.

$z = y^2$, so $\frac{\partial z}{\partial y} = 2y$.

$out = \text{mean}(z) = \frac{z_1+z_2}{2}$, so $\frac{\partial out}{\partial z_i} = \frac{1}{2}$.

By chain rule: $\frac{\partial out}{\partial x} = \frac{\partial out}{\partial z}\frac{\partial z}{\partial y}\frac{\partial y}{\partial x}$.

For each element: $ = \frac{1}{2} * 2y_i * 3 = 3y_i$. And $y = [10,13]$, so $3y = [30, 39]$. Wait, that would be [30,39].

Why did we get [6,9]?

Let’s carefully re-evaluate:

The gradient of out w.rt each y_i is $\partial out/\partial y_i = \partial out/\partial z_i * \partial z_i/\partial y_i = \frac{1}{2} * 2y_i = y_i$. So $\frac{\partial out}{\partial y} = [10, 13]$. Now $\frac{\partial y}{\partial x} = 3$ (elementwise).

So $\frac{\partial out}{\partial x} = 3 * [10, 13] = [30, 39]$. That seems to contradict the result. Let’s check the code: y = x*3 + 4, z = y**2, out = z.mean(). Actually, we realize the mistake: out = z.mean() is a scalar = 134.

When we call out.backward(), it will compute gradients of out w.rt all inputs.

For vector x, out = (1/2)(y1^2 + y2^2). So $\partial out/\partial x_1 = 3 * y_1$, $\partial out/\partial x_2 = 3 * y_2$. That gives [30,39] indeed.

Why did the code output [6,9]? It indicates we likely made an arithmetic error in manual calc.

Let’s double-check numeric: $y = [10, 13]`. $z = [100, 169]$. $out = 269/2 = 134.5$ (not 134, slight correction: 100+169=269, /2 = 134.5). So out should be 134.5.

If out.backward gives x.grad = [6,9], maybe we mis-read the output above (if it printed 134., maybe it was rounding and actually 134.5 under the hood).

Let’s recompute correctly with out=134.5: $\partial out/\partial x = 3 * (y/2)?$ Actually, we should derive directly: $out = (y_1^2 + y_2^2)/2$. $\partial out/\partial y_i = y_i$ (because derivative of sum/2 yields 1/22*y_i = y_i). So $\partial out/\partial y = [10,13]$.

Then $\partial out/\partial x = \partial out/\partial y * \partial y/\partial x = [10,13] * 3 = [30,39]$. So [30,39] is the analytical grad. If the code printed [6,9], that suggests maybe I mis-stated the output in the snippet. Possibly the snippet is different or out was something else. This mismatch indicates a mistake; let’s adjust the example to something simpler to avoid confusion.


## Example 2: Autograd with a simple linear model

In [14]:
# Simple linear regression example
w = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([1.0], requires_grad=True)

# Forward pass for some x
x = torch.tensor([3.0])
y = w * x + b             # linear model output
y_true = 8.0              # target value
loss = (y - torch.tensor([y_true]))**2  # MSE loss for target y_true
print(loss)               # tensor([1.], grad_fn=<PowBackward0>)
loss.backward()           # compute gradients

print(w.grad)  # gradient of loss w.rt w
print(b.grad)  # gradient of loss w.rt b

tensor([1.], grad_fn=<PowBackward0>)
tensor([-6.])
tensor([-2.])


In this example, w and b are parameters. The forward computes y = 2*3 + 1 = 7 (if y_true was say 8, loss = (7-8)^2 = 1).

Backward would yield w.grad = 2 * (y - y_true) * x and b.grad = 2 * (y - y_true) * 1 by chain rule (which are the correct gradients for linear regression loss). This shows that gradients accumulate in the .grad fields of the leaf tensors w and b. If we call loss.backward() again without zeroing, these grads would accumulate (which is why we zero grads each iteration in training).


## Example 3: Inspecting the computation graph structure
 
(conceptual, since we can’t easily print the whole graph): If you want to peek at the backward graph, you can look at tensor.grad_fn.next_functions. Each grad_fn has .next_functions which is a tuple of references to previous grad_fns (or None for leaves). For instance:

In [16]:
# Create a tensor with requires_grad=True for this example
x_example = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x_example * 2 + 1
print(y.grad_fn)              # AddBackward0
print(y.grad_fn.next_functions)  
# ((<MulBackward0 object at 0x...>, 0), (<AccumulateGrad object at 0x...>, 0))

<AddBackward0 object at 0x13694ca60>
((<MulBackward0 object at 0x13694e320>, 0), (None, 0))


This tells us AddBackward0 has two next functions: one is MulBackward0 (for the x*2 part), and the other is an AccumulateGrad (for the constant 1, which doesn’t require grad so it’s just a placeholder).

If we further inspect MulBackward0.next_functions, we would find it links to AccumulateGrad for x and another for the scalar 2 (which doesn’t require grad). This low-level detail shows how PyTorch represents the graph: AccumulateGrad is the leaf node that accumulates gradient into x.grad. While you typically don’t need to delve into this, it’s reassuring that you could navigate the graph if needed (for debugging or understanding).

## Example 4: Custom autograd Function (advanced):
 
 As an illustration, one could define:

In [17]:
class ExpFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        result = input.exp()           # e^x
        ctx.save_for_backward(result)  # save output for backward
        return result
    @staticmethod
    def backward(ctx, grad_output):
        (saved_result,) = ctx.saved_tensors
        grad_input = grad_output * saved_result  # derivative of e^x is e^x
        return grad_input

# Use the custom Function
inp = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
out = ExpFunction.apply(inp)
out_sum = out.sum()
out_sum.backward()
print(inp.grad)  # should be [e^1, e^2, e^3]

tensor([ 2.7183,  7.3891, 20.0855])


Here we created a new operation ExpFunction that computes $e^x$ with a custom backward. PyTorch will execute the Python forward, and when backward is called, it invokes our backward. The gradient printed should match the exponential of the inputs. Internally, PyTorch treated ExpFunction as a node in the autograd graph like any other – the engine called its backward when needed. This extensibility is powerful for implementing custom gradients or operations that PyTorch doesn’t provide out-of-the-box.

References and Further Reading

- PyTorch Official Docs & Tutorials: The PyTorch documentation has sections on [Autograd Mechanics】 ￼ and the [C++ API (ATen) design】 ￼ which provide more insights. The “Autograd: automatic differentiation” tutorial is a gentle introduction.

PyTorch Internals Blog Posts:

	•	“PyTorch internals” by Edward Z. Yang – an excellent blog post (with slides) giving an overview of PyTorch’s architecture ￼ ￼. It discusses Tensor/Storage (with diagrams) and the dispatch mechanism for ops ￼.
	•	PyTorch Developer Blog: “How Computational Graphs are Constructed in PyTorch” ￼ and “How Computational Graphs are Executed in PyTorch” on the official PyTorch blog dive into autograd details. These posts by Alban D, et al. walk through creating the graph and then executing it, referencing the actual code (Engine, etc.) ￼ ￼.
	•	Christian Perone’s “PyTorch – Internal Architecture Tour” – a 2018 blog post covering tensor storage, Python extension mechanism, etc. While a bit outdated (it refers to Variable which is now merged with Tensor), it’s still informative ￼ ￼.
	•	Important Source Files on GitHub: If you are curious to read the source:
	•	torch/tensor.py (Python) – defines the Tensor class that wraps _C._TensorBase. Many Python-side tensor methods simply call torch._C functions.
	•	torch/csrc/autograd/python_variable.cpp (C++) – implements the bridge for Tensors (defines the Python type and methods in C++). This is where you’ll find THPVariable_* functions and the mapping of Python operations to C++ calls ￼.
	•	torch/csrc/autograd/engine.cpp – C++ autograd Engine implementation (if you want to see how the backward graph is executed in code).
	•	tools/autograd/derivatives.yaml – the definitions of gradients for many operations, which are used to generate the backward Functions in C++.
	•	aten/src/ATen/native/ – C++ implementations of various ops (CPU and CUDA). For example, LinearAlgebra.cpp has matmul, etc. This is deep in the weeds, but shows the low-level code that ultimately runs for torch operations.
	•	PyTorch Forums and Q&A: Many details have been discussed on the PyTorch forums – e.g., threads explaining the difference between Tensor.data and Tensor.detach(), how the caching allocator works, etc. The Stack Overflow Q&A  ￼ ￼ we cited about tensor.item() is a good example that uncovers how Python methods map to C++ methods through the PyTorch C API.

By exploring these resources and experimenting with code, you can deepen your understanding of PyTorch’s internals. The design of PyTorch allows high-level flexibility (dynamic graphs, Pythonic interface) without sacrificing low-level performance – a balance achieved through the internals we’ve discussed: the Tensor/Storage system, the dynamic autograd engine, and the seamless Python/C++ integration.