# Introduction

DFS is an abstraction or a layer. The primary purpose of a DS is connecting users and resources. Resources can be inherently distributed, can actually be data(files, databases). Their availability becomes a crucial issue for the performance of a DS.

You put your files that go into some servers, and all you see is some cloud out there which you put your files in. You don't care how things are distributed. User just see their files and access those files through a service that brings transparency.

If you simply put your files in multiple servers, that is not a distributed file system, you need to know where did you put your files.

In first generation of distributed systems, file systems were the only networked storage systems. With the advent of distributed object systems and the web, the picture has become more complex.

<img src="img/img48.png" width="500">

The web does not give consistency of multiple copies. I may access a stock data on a web server and it could be many many hours old, and the original data in the NewYork Stock Exchange may have different value.

Distributed shared memory: The main memory of a machine is shared with lots of users.

# File System

It persistent stored data sets. 

Hierarchic name space visible to all processes. You want to achieve in a distributed fashion a similar user interface to access your files, you want to access each file regardless which machine it is on. For example, In your PC file interface, all you see is some directories and files. But in a distributed system, it may involve different machines. However, you don't want to access a file using the format $this\_machine.directory.file$, you just want $directory.file$.

API with the following characteristics:
* access and update operations on persistently stored data sets
* sequential access model(with additonal random facilities)
* API gives you capability to do operations on files as if it is on a single machine

Sharing of data between users, with access control

Concurrent access:
* certainly for read-only access
* One copy: fault tolerance low
* Multiple copies: consistency issue
* If two people work on two copies of the same file, which version get updated eventually?
* Naive solution: Lock, allows for one copy to be maintained consistently over time

Unix file system operations can be used to construct more sophisticated functions.

Typical module structure for non-DFS:

<img src="img/img49.png" width="500">

<img src="img/img50.png" width="500">

Reference count - number of links to the file

# Distributed File System

Transparency
* Access: Same operations as Non-DFS, client programs are unaware of distribution of files)
* Location: Same name space after relocation of files or processes, client programs should see a uniform file name space
* Mobility: Automatic relocation of files is possible(neither client programs nor system admin tables in client nodes need to be changed when files are moved). If the server goes down, and files are moving around, the file should still be accessible via the same fileId without knowing the location of files has been changed.
Performance: You should not know whether you are reading from a local hard disk or remote hard disk since the speed will be different.
Scaling: Service can be expanded to meet additional loads or growth.

Concurrency
* Changes to a file by one client should not interfere with the operation of other clients simultaneously accessing or changing the same file.
* Properties:
    * Isolation
    * File-level or record-level lock
    * Other forms of concurrency control to minimize contention
    
Replication
* Load-sharing between servers makes service more scalable
* Local access has better response(low latency)
* Fault tolerance
* Caching(of all or part of a file) gives most of the benefits. Make sure somebody writes the file, all the cache gets updated.

Heterogeneity
* Service can be accessed by clients running on (almost) any OS or hardware platform
* Design must be compatible with the file systems of different OSes.
* Service interface must be open - precise specifications of APIs are published

Fault tolerance
* Service must continue to operate even when clients make errors or crash
* Service must resume after a server machine crashes
* If the service is replicated, it can continue to operate even during a server crash

Consistency
* Unix offers one-copy update semantics for operations on local files - caching is completely transparent
* Difficult to achieve the same for distributed file systems while maintaining good performance and scalability

Security
* Must maintain access control and privacy as for local files
    * based on identity of user making request
    * identities of remote users must be authenticated
    * privacy requires secure communication
* Service interface are open to all processes not excluded by a firewall, vulnerable to impersonation and other attacks

Efficiency
* Goals for distributed file systems is usually performance comparable to local file system

## File Service Architecture

An architecture that offers a clear separation of the main concerns in providing access to files is obtained by structuring the file service as three components:
* A flat file service
* A directory service
* A client module

The client module implements exported interfaces by flat file and directory services on server side.

Model file service architecture:

<img src="img/img51.png" width="500">

* Flat file service: Concerned with the implementation of operations on the contents of file. Unique File Identifiers(UFIDs) are used to refer to files in all requests for flat file service operations. UFIDs are long sequences of bits chosen so that each file has a unique among all of the files in a distributed system.
* Directory Service: Provides mapping between text names for the files and their UFIDs. Clients may obtain the UFID of a file by quoting its text name to directory service. Directory service supports functions needed to generate directories and to add new files to directories.
* Client Module: It runs on each computer and provides integrated service(flat file and directory) as a single API to application programs. For example, in Unix hosts, a client module emulates the full set of Unix file operations. It holds information about the network locations of flat-file and directory server processes; and achieve better performance through implementation of a cache of recently used file blocks at the client.

The operating process: application program opens a file, connect to client module, client module communicates and finds the file. Multiple clients should be able to do the same thing.

Pathname loop: Pathnames such as '/usr/bin/tar' are resolved by iterative calls to *lookup()*, one call for each component of the path, starting with the ID of the root directory '/' which is known in every client.

## File Group

A collection of files that can be located on any server or moved between servers while maintaining the same names.
* Similar to a Unix filesystem
* Helps with distributing the load of file serving between several servers
* File groups have identifier which are unique throughout the system(and hence for an open system, they must be globally unique)

To construct a globally unique ID we use some unique attribute of the machine on which it is created, e.g. IP number, even though the file group may move subsequently.

<img src="img/img52.png" width="200">

# Sun NFS

* An industry standard for file sharing on local networks since the 1980s
* An open standard with clear and simple interfaces
* Closely follows the abstract file service model defined above

architecture:

<img src="img/img53.png" width="500">

Virtual file system hides everything. You don't realize whether it is a remote file or not.

Clients can also be a server, in this way, it is a P2P architecture.

The implementation doesn't have to be in the system kernel. There are examples of NFS clients and servers that run at application-level as libraries or processes(e.g. early Windows and MacOS implementation)

But, for a Unix implementation(in system kernel) there are advantages:
* Binary code compatible - no need to recompile applications on different machines
* Shared cache of recently-used blocks at client
* Kernel-level server can access i-nodes and file blocks directly
* Efficient-security embedded

## Access Control and Authentication

Stateless server, no memory for user's action, so the user's identity and access rights must be checked by the server on each request. In the local file system they are checked only on *open()*.

Every client request is accompanied by the userID and groupID, which are inserted by the RPC system.

Server is exposed to imposter attacks unless the userID and groupID are protected by encryption.

Kerberos has been integrated with NFS to provide a stronger and more comprehensive security solution.

## Architecture Components(Unix)

Server
* nfsd: NFS server daemon that services requests from clients
* mountd: NFS mount daemon that carries out the mount request passed on by nfsd
* rpcbind: RPC port mapper used to locate the nfsd daemon
* /etc/exports: configuration file that defines which portion of the file system are exported through NFS and how

Client
* mount: standard file system mount command
* /etc/fstab: file system table file
* nfsiod: (optional) local asynchronous NFS I/O server

## Mount service

Mount: find where is the fileID on the network. The client side will send a request, I want this file, which will be converted to the mount call, in the remote server, will let that part of the system mounted, and set visible to the client.

Mount operation: $mount(remotehost, remotedirectory, localdirectory)$

Server maintains a table of clients who have mounted file systems at that server.

Each client maintains a table of mounted file systems holding: $<IP\space address, port\space number, file\space handle>$

Hard mount: Once you mount a set of files successful, you access those files permanently. There is no way you call will go through unless mount is successful. In case of a failure, you don't realize the system is distributed.

Soft mount: You access upon demand, and you wait for a while, if there's a failure(timeout), then you're aware of failure, the application interface is aware(such as throw an exception) and you take an action.

Local and remote file systems accessible on an NFS client:

<img src="img/img54.png" width="500">

## Automounter

It deals with the problem that you don't know where to mount or what to mount.

NFS client catches attempts to access 'empty' mount points and routes them to the Automounter. Automounter has a table of mount points and multiple candidate serves for each. It sends a probe message to each candidate server and then uses the mount service to mount the file system at the first server to respond.

Keep the mount table small.

Provides a simple form of replication for read-only file systems. If there are several servers with identical copies of /usr/lib then each server will have a chance of being mounted at some clients.

# New design approaches

Distribute single file across several servers.

Serverless architecture - like P2P
* Exploits processing and disk resources in all available network nodes
* Service is distributed at the level of individual files