<a href="https://colab.research.google.com/github/annesjyu/tifp2024/blob/main/Data_Engr_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. Working in POXIS Environments



## 2.1 POXIS

POSIX (Portable Operating System Interface) is an IEEE 1003.1 standard that defines the language interface between application programs (along with command line shells and utility interfaces) and the UNIX operating systems. It was created to standardize the functionality and behaviour of Unix-like operating systems. It specifies APIs (Application Programming Interfaces) for various components, including file I/O, process management, inter-process communication, and more. POSIX compliance ensures that software written to the POSIX standards can be executed across different Unix-like systems without major modifications or compatibility issues.

### UNIX

UNIX is an operating system which was first developed in the 1960s. An operating system can be thought of as a collection of programs that make the computer work. It is a *stable*, *multi-user*, *multi-tasking* system for servers, desktops and laptops. UNIX systems also have a Graphical User Interface (GUI) similar to Microsoft Windows which provides an easy to use environment. However, knowledge of UNIX is required for operations which aren't covered by a graphical program, or for when there is no GUI available, for example, a telnet session.

#### Types

There are many different versions of UNIX, although they share common similarities. The most popular varieties of UNIX are Sun Solaris, GNU/Linux, and MacOS X.

### Linux

Linux, on the other hand, is a free and open-source operating system kernel that was created by Linus Torvalds in 1991. Linux was developed as a Unix-like system, inspired by Minix and older Unix branches. Unlike Unix, Linux is not a single operating system but rather a kernel that can be combined with various software packages to create complete operating systems called distributions. These distributions, such as Ubuntu, include the Linux kernel and additional software to provide a complete computing environment. It is under GPL license.

### Architecture

The UNIX / Linux operating system is made up of three parts; the **kernel**, the **shell** and the **Application**.

* **Kernel** is the hub of the operating system which allocates resources to programs and handles the file storage and communications in response to system calls.
* **Shell** acts as an interface between the user and the kernel. When a user logs in, the login program verifies the username and password, and starts another program called the shell. The shell is a command line interpreter (CLI). It interprets the commands the user types in and arranges for them to be carried out. The commands are themselves programs: when they terminate, the shell gives the user another prompt.
* **Application** layer runs many user and background processes. A process is an executing program identified by a unique PID (process identifier).

### Ubuntu

Ubuntu is a Linux-based operating system distribution. It was created by Canonical Ltd. and released in 2004 with the goal of making Linux more accessible to a wider range of users. Ubuntu builds upon user-friendly interfaces, and software ecosystem. It aims to provide a complete and polished user experience, focusing on ease of use, regular releases, and a large community of users and developers. Ubuntu also uses the APT package management system and offers a wide variety of software applications through its repositories. It is known for its strong focus on desktop usability but also has versions tailored for servers, cloud environments, and other specialized use cases.

### Files and Processes

Everything in UNIX is either a **file** or a **process**. A file is a collection of data and are created by users using text editors, running compilers and so on. A process is simply an executing program as mentioned previously.

Examples of files:

- A document (report, essay etc.).
- The text of a program written in some high-level programming language.
- Instructions comprehensible to the machine and incomprehensible to users, for example, a collection of binary digits (an executable or binary file).
- A directory, containing information about its contents, which may be a mixture of other sub-directories and ordinary files.

### The Directory Structure

All files are grouped together in the directory structure. The file-system is arranged in a *hierarchical* structure, like a tree data structure. The top of the hierarchy is known as **root** (written as a forward slash `/` ).

In [None]:
!ls -la /

In [None]:
!sudo apt-get install tree

In [None]:
!tree -L 2 /

Note:
1. The full path to `config.gz` is /proc/config.gz, within which `proc` is the intermediate directory (or dir).

2. Linux adheres to the **Filesystem Hierarchy Standard** (FHS) for directory and file naming. This standard allows users and software programs to predict the location of files and directories. The root level directory is represented by the forward slash `/`. At the root level, systems could include these directories:

| Directories | Description                                                  |
| :---------: | :----------------------------------------------------------- |
|      /      | Root Directory: the highest level of the file system tree    |
|    /bin     | Contains essential executable programs (/bin/cat, /bin/rm, /bin/cp) |
|    /boot    | Boot Directory: Holds essential files to boot the system such as  the Linux kernel and associated configuration files. It contains static files of boot loader GRUB (GRand Unified Bootloader) |
|    /dev     | Populated with files that represent hardware devices and other special files such as the /dev/null and /dev/zero files. |
|    /etc     | Holds configuration files of the Linux system  (/etc/inittab, /etc/group, /etc/hosts) |
|    /home    | User Directory: User’s workspace to work with files          |
|    /lib     | Libraries Directory: Contains libraries that are used by programs in the /bin and /sbin directories |
|   /media    | Mountpoints for Removable Media                              |
|    /mnt     | Mountpoint for Temporarily Mounted File System               |
|    /opt     | Application Directory: Stores installed programs (/opt/GNOME). Optional third party software installation location |
|    /proc    | A  virtual filesystem for the kernel to report processes and other information. |
|    /root    | Home Directory of the Administrator (root user)              |
|    /sbin    | System Binaries: Contains important system binaries (programs) for administration (/sbin/fdisk) |
|    /srv     | Data Directories for Services (/srv/ftp)                     |
|    /sys     | System Information Directory: A virtual  filesystem holding information about hardware devices connected to the system. |
|    /tmp     | Temporary Area: Directory where programs create temporary files that is supposed to clear at boot  time. |
|    /usr     | Second hierarchy of non-essential files for multi-user use and takes up the most space. Contains all user programs (`/usr/bin`), libraries (`/usr/lib`), documentation (`/usr/share/doc`) and so on. |
|    /var     | Variable Files: It contains files that can be modified while the system is running |

`/dev/null` is a special file in Unix-like operating systems that serves as a "null device" or a bit bucket. It is commonly used to discard or redirect unwanted output, or to provide an empty file for input.

In [None]:
!find / -name "config.gz"

In [None]:
!find / -name "config.gz" 2> /dev/null

## 2.2 UNIX Commands

We will launch a terminal and interact with the shell with the following commands to familiarize ourselves with the terminal environment. The language that the terminal uses is known as <u>bash</u>. Bash is a Unix shell and command language written by Brian Fox for the GNU Project and has been used as the default login shell for most Linux distributions. You can check the current shell with `echo $SHELL`.

In [None]:
!echo $SHELL

| Command     | Description                                                  |
| :---------- | :----------------------------------------------------------- |
| ```ls```    | list files in a directory.                                   |
| ```mkdir``` | make a directory.                                            |
| ```cd```    | change to a new directory.                                   |
| ```pwd```   | show working directory.                                      |
| ```touch``` | create a file, change a timestsamp.                          |
| ```cp```    | copy a new file from an old file.                            |
| ```mv```    | move file to a new directory.                                |
| ```rm```    | delete a file.                                               |
| ```rmdir``` | delete a directory.                                          |
| ```clear``` | clear all text from current screen.                          |
| ```cat```   | display the content of a file.                               |
| ```less```  | display the content of a file onto the screen a page at a time. |
| ```head```  | show the first ten lines of a file.                          |
| ```tail```  | show the last ten lines of a file.                           |
| ```grep```  | search a term in a file.                                     |
| ```wc```    | count the words in a file.                                   |
| ```diff```  | diff two files.                                              |

### A bash example

In [None]:
!whoami

In [None]:
%%bash

### Shebang for Bash Scripts. The shebang line allows the operating system to determine the appropriate interpreter for executing the script.
#!/bin/bash
# Or #!/usr/bin/env bash

# This is a simple script that greets the user

echo "Hello, $(whoami). Welcome to the shell scripting world."

## 2.3 File System Security

These are the file system security components to control access to files in a Linux file system:

| Element     | Description                                                  |
| ----------- | :----------------------------------------------------------- |
| Users       | Users are **individual accounts** on the Linux system.       |
| Groups      | Groups are **collections of users**. Users are assigned to a group when they are created. Every user must belong to at least one group. Only **root** / the **owner** of a file can change the group to which the file or directory is assigned. |
| Ownership   | The user who creates a file or directory. Ownership can only be changed by **root** user. |
| Permissions | Permissions determine user **access** to a file or directory. |

The `id` command displays information about a user’s UID and which group the user is assigned to. Similarly the `groups` command displays the name of the group.

In [None]:
!id $USER

Show group information,

In [None]:
!groups $USER

**Note:** Root user (`root`) always has a UID of 0. UID numbering for normal users starts (by default) at 1000 for Linux. A root user can use the following commands to perform certain user management tasks:

| Command | Description                                  |
| ------- | :------------------------------------------- |
| useradd | Create a new user account                    |
| usermod | Modify settings for an existing user account |
| userdel | Delete an existing user account              |

The following options are commonly used with `useradd`.

| Option | Description                                                  |
| ------ | :----------------------------------------------------------- |
| -m     | Automatically generates the new user’s home directory. By default, the user directory is  created under `/home`. |
| -c     | For  comment. It is generally a short description of the login. |
| -e     | For  expiration date of the user account (YYYY-MM-DD)        |
| -u     | Specifies a custom UID of the new account. If the option is not given, the next available UID is  used. |
| -g     | Specifies either the GID or the name of the group            |
| -G     | This  option defines any supplementary groups (separated by a comma) the user should be a member of. |

For example, run the following command to create a new user, `johndoe`.

In [None]:
!sudo useradd -m johndoe -u 9999

**Note:** Local user information is stored in the [`/etc/passwd`](https://linuxize.com/post/etc-passwd-file/) file. To view user information, run `cat /etc/passwd`.

Next, the `passwd` command is used to establish or change the password of a user account. Then, you will be prompted for a new password and will be asked to confirm it.

Now, we will look at file security. In your **proc** directory, type `ls -la` to turn on the long listing format:

In [None]:
!ls -la /content

`drwxr-xr-x 1 root root    4096 May  9 13:24 sample_data` is a directory. and `-rw-r--r-- 1 root root 1048576 May 11 13:40 foobar` is a file.

We will examine the output for better understanding, `-rw-r--r-- 1 root root 1048576 May 11 13:40 foobar`:

| Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 | Field 9      | Field 10      |
| ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------------ | ------------- |
| -       | rw-     | r--     | r--     | 1       | root   | root    | 1048576      | May 11 13:40 | foobar |

**Field 1**: **-** for File, **d** for Directory, **l** for link (e.g., create a link for *text_file.txt*: `ln -s text_file.txt link1`)

**Field 2, Field 3 and Field 4**

Those are the **r**ead, **w**rite and e**x**ecute permissions for *owner*, *group* and *others*:

- **Field 2**: The permissions that the **owner/user** has over the file.
- **Field 3**: The permissions that the **group** has over the file.
- **Field 4**: The permissions that **everybody** (**other**) else has over the file.

**Field 5**: This field specifies the number of links / directories inside this directory.

**Field 6**: The user that owns the file / directory.

**Field 7**: The group which the file belongs to, and any user in that group will have the same permissions as Field 3 over that file.

**Field 8**: The file size in bytes.

This can be formatted to human readable format with the `-h` option alongside  the `-l` option. This will output the file size in `k`, `M`, `G`.

**Field 9**: The last modified date

**Field 10**: The name of the file


### Access Rights on Files

- r (or -), indicates read permission (or otherwise) to read and copy the file.
- w (or -), indicates write permission (or otherwise) to change a file.
- x (or -), indicates execution permission (or otherwise) to execute a file.

### Access Rights on Directories

- r allows users to list files in the directory.
- w allows users to delete files from the directory or move files into it.
- x allows users to access files in the directory.

Hence, to **read** a file, you must have the **execute** permission on the directory containing that file, and on any directory containing that directory as a subdirectory, and so on, up the tree.

### Changing Access Rights

#### chmod (changing a file mode)

Only the owner of a file can use `chmod` to change the permissions of a file. The options of `chmod` are as follows:

| Symbol | Meaning                        |
| ------ | ------------------------------ |
| u      | user                           |
| g      | group                          |
| o      | other                          |
| a      | all                            |
| r      | read                           |
| w      | write (and delete)             |
| x      | execute (and access directory) |
| +      | add permission                 |
| -      | take away permission           |

For example, to **add** *read* and *write* permissions on the file **foobar** for the *group* and *others*, type:

In [None]:
!chmod go+rw foobar

In [None]:
!ls -l /content

## 2.4 Processes and Jobs

A process is an executing program identified by a unique PID (process identifier). To view information on your processes, with their PID and status, type:

In [None]:
!ps

A process may be in the *foreground*, in the *background*, or *suspended*. In general the shell does not return the prompt until the current process has finished executing. Some processes take a long time to run and may hold up the terminal. Backgrounding a long process has the effect that the prompt is returned immediately, and other tasks can be carried out while the original process continues executing.

### Running Background Processes

To background a process, type **&** at the end of the command line. For example, the command **sleep** waits for a given number of seconds before continuing. Type:

In [None]:
!sleep 2

The terminal will wait for 10 seconds before returning the control back to user. Until the command prompt is returned, you can do nothing except wait.

To run sleep in the background, type:

In [None]:
!sleep 10 &

`jobs` lists the jobs that were started in the current shell session and are running in the background, stopped (suspended), or in the foreground. `ps` command is used to display information about all active processes running on the system, not just those started in the current shell session.

In [None]:
!ps -ef

## 2.5 Other UNIX Commands

The `df` command reports on the space left on the file system. For example, to find out how much space is left on the fileserver, type:

In [None]:
!df .

Similarly, you can add a `-h` flag to format the value to human readable format.

In [None]:
!df -h .

The `du` command outputs the number of kilobytes used by each subdirectory. Navigate to your `home` directory and type:

In [None]:
!du -s -h *

he `gzip` command reduces the size of a file. For example:

In [None]:
!ls -l -h foobar

In [None]:
!cp foobar new_foobar
!gzip new_foobar
!ls -l

This will compress the file and place it in a file called **new_foorbar.gz**. To see the change in size, type **ls -l** again.

In [None]:
!ls -lh

To expand the file, use the `gunzip` command.

In [None]:
!gunzip new_foobar.gz
!ls -lh

The `file` command classifies the named files according to the type of data they contain, for example ascii (text), pictures, compressed data, etc. To report on all files in your home directory, type the following:

In [None]:
!file *

The `find` command searches through the directories for files and directories with a given name, date, size, or any other attribute. It has many options and it is a good idea to read the manual with `man find`.

To search for all files with the extension **.txt**, starting at the **current directory (.)** and working through all sub-directories, and then printing the name of the file to the screen, type:

In [None]:
!find /content -name "foobar" -print

The `find` command searches through the directories for files and directories with a given name, date, size, or any other attribute. It has many options and it is a good idea to read the manual with `man find`.

To search for all files with the extension **.txt**, starting at the **current directory (.)** and working through all sub-directories, and then printing the name of the file to the screen, type:

```bash
find . -name "*.txt" -print
```

To find files over 1kb in size, and display the result as a long listing, type:

```bash
find . -size +1k -print
```

The `history` command keeps an ordered list of all the commands that you have entered. Each command is given a number according to the order it was entered.

```bash
history
```

To clear the history commands:

```bash
history -c
```

## 2.6 The Lifecycle of Manipulating a Remote File

### An Example

We will download a piece of free software that converts between different units of measurements.

At the home directory, create a download directory. Be noted, `rm -rf` removes a file or dir regardless of its existance.



In [None]:
%%bash

DIR_NAME="download"

# Check if the directory exists
if [ -d "$DIR_NAME" ]; then
  # If it exists, remove it
  rm -rf "$DIR_NAME"
  echo "Removed existing directory: $DIR_NAME"
fi

# Create the directory
mkdir "$DIR_NAME"
echo "Created new directory: $DIR_NAME"

Download `units-1.74.tar.gz` from [here](http://www.ee.surrey.ac.uk/Teaching/Unix/units-1.74.tar.gz) and move it to the download directory. Most default download folder is Downloads.

In [None]:
%%bash

DIR_NAME="download"

# Go to the download dir
cd $DIR_NAME

# List all files under the dir
echo "list files in current dir:"

# Delete any .gz files
rm -rf *.gz

# Show the current dir
ls -l ./

# Show the current working dir
echo "current working dir:"
pwd

# Download the file
wget http://www.ee.surrey.ac.uk/Teaching/Unix/arc/unixtut.tar.gz -O unixtut.tar.gz

#### Extracting the Source Code

Notice that the filename ends with `tar.gz`. The tar command turns several files and directories into one single tar (Tape ARchive) file. This is then compressed using the gzip command (to create a tar.gz file).

First unzip the file using the `gunzip` command to create a .tar file.

Then extract the contents of the tar (tape archive) file.


In [None]:
%%bash

DIR_NAME="download"

cd $DIR_NAME

# Go to the download dir
echo "list files in current dir:"
ls -l ./unixtut.tar.gz

# Check if the file is indeed a gzip file by using the file command:
file uknixtut.tar.gz

# Unzip to a tar file and extract files from it
tar -xzvf unixtut.tar.gz

In `tar -xvf unixtut.tar.gz`

> - The `x` flag is used to extract files from an archive.
> - The `v` flag (optional) stands for "verbose" and provides more detailed output while the extraction process is happening. It shows the names of the files being extracted.
> - The `f` flag is used to specify the archive file name.
> - So `tar -xvf <archive_name.tar>` extracts the files from the `<archive_name.tar>` archive in the current directory.

In `tar -xvzf units-1.74.tar.gz`

The `z` flag is used in conjunction with the `x` flag to decompress the archive if it is gzip-compressed. It is used to handle `.tar.gz` or `.tgz` archives.

### Managing Packages

Debian and Debian-based Linux distributions use the Debian Package Manager (dpkg) to manage packages. The *advanced package tool* command `apt` is advantageous over `dpkg` because it resolves dependencies and downloads updated software automatically. To download software this command points to a series of software repositories located in the file `/etc/apt/sources.list`. The apt command uses the dpkg program to manage packages.

Software installation can be done through the terminal via the command:

```bash
sudo apt-get install <package>
# sudo apt install <package>
```

> * `apt-get` is the older and more established command-line package management tool.
>
> * `apt` is a newer command-line package management tool that provides a more user-friendly and simplified interface.

Software uninstallation can be done through the terminal via the command:

```bash
sudo apt-get remove <package>
# sudo apt remove <package>
```

The following table contrasts the traditional command against the apt equivalent commands:

| Traditional Command        | apt Equivalent         |
| -------------------------- | ---------------------- |
| apt-get update             | apt update             |
| apt-get dist-upgrade       | apt full-upgrade       |
| apt-cache search string    | apt search string      |
| apt-get install \<package> | apt install \<package> |
| apt-get remove \<package>  | apt remove \<package>  |
| apt-get purge \<package>   | apt purge \<package>   |

Note: The '*remove*' command only uninstalls a package but its configuration file stays right there. However, with the '*purge*' command, the package along with its configuration file is deleted which means that no traces of that package are left behind in this situation.

In [None]:
# Update the package list
!apt-get update

In [None]:
!apt-get remove tree

## 2.7 UNIX Variables

Standard UNIX variables are split into two categories, environment variables and shell variables. In broad terms, shell variables apply only to the current instance of the shell and are used to set short-term working conditions; environment variables have a farther reaching significance, and those set at login are valid for the duration of the session.

### Environment Variables

An example of an environment variable is the OSTYPE variable. The value of this is the current operating system you are using. Type:

In [None]:
!echo $OSTYPE

More examples of environment variables are

- `SHELL`: This describes the shell that will be interpreting any commands you type in. In most cases, this will be bash by default.
- `TERM`: This specifies the type of terminal to emulate when running the shell. Different hardware terminals can be emulated for different operating requirements. You usually won’t need to worry about this though.
- `USER`: The current logged in user.
- `PWD`: The current working directory.
- `OLDPWD`: The previous working directory. This is kept by the shell in order to switch back to your previous directory by running `cd -`.
- `LS_COLORS`: This defines color codes that are used to optionally add colored output to the `ls` command. This is used to distinguish different file types and provide more info to the user at a glance.
- `MAIL`: The path to the current user’s mailbox.
- `PATH`: A list of directories that the system will check when looking for commands. When a user types in a command, the system will check directories in this order for the executable.
- `LANG`: The current language and localization settings, including character encoding.
- `HOME`: The current user’s home directory.
- `_`: The most recent previously executed command.

ENVIRONMENT variables are set using the **setenv** command, displayed using the **printenv** or **env** commands, and unset using the **unsetenv** command.


### Shell Variables

An example of a shell variable is the history variable. The value of this is how many shell commands to save, allow the user to scroll back through all the commands they have previously entered. Type

In [None]:
!echo $BASH_VERSION

More examples of shell variables are

- `BASHOPTS`: The list of options that were used when bash was executed. This can be useful for finding out if the shell environment will operate in the way you want it to.

- `BASH_VERSION`: The version of bash being executed, in human-readable form.

- `BASH_VERSINFO`: The version of bash, in machine-readable output.

- `COLUMNS`: The number of columns wide that are being used to draw output on the screen.

- `DIRSTACK`: The stack of directories that are available with the `pushd` and `popd` commands.

- `HISTFILESIZE`: Number of lines of command history stored to a file.

- `HISTSIZE`: Number of lines of command history allowed in memory.

- `HOSTNAME`: The hostname of the computer at this time.

- `IFS`: The internal field separator to separate input on the command line. By default, this is a space.

- `PS1`: The primary command prompt definition. This is used to define what your prompt looks like when you start a shell session. The `PS2` is used to declare secondary prompts for when a command spans multiple lines.

- `SHELLOPTS`: Shell options that can be set with the `set` option.

- `UID`: The UID of the current user.

### Creating Shell Variables

We will begin by defining a shell variable within our current session. Simply specify a name and a value. We’ll adhere to the convention of keeping all caps for the variable name, and set it to a simple string.

In [None]:
%%bash

# Creating a variable
GREETING="Hello, Bash!"

# Printing the variable value
echo $GREETING

# Exporting a variable
export GREETING

# Predefined user input
COLOR="blue"

# Printing the predefined input
echo "Your favorite color is $COLOR"

### IDE

**vim (Vi IMproved)**

"Vi IMproved," is a highly configurable and powerful text editor that is available on various operating systems, including Linux, macOS, and Windows. It is an enhanced version of the original Vi text editor, which was developed in the 1970s.

Vim is designed to be lightweight and efficient while offering a wide range of features for editing and manipulating text files. It operates in different modes, allowing users to navigate, edit, and execute commands with keyboard shortcuts.

1. Normal mode: The default mode for navigating and executing commands.
2. Insert mode: Used for inserting and editing text.
3. Visual mode: Enables selecting and manipulating blocks of text.
4. Command-line mode: Allows entering editor commands or executing external commands.

Vim offers a vast array of features, including syntax highlighting, auto-indentation, search and replace, regular expressions, macros, split-screen editing, and support for numerous programming languages. It also has a plugin system that allows users to extend its functionality further.

**nano**

It is designed as a simpler alternative to more advanced text editors like Vim or Emacs, making it accessible to both beginner and experienced users.

Nano provides a straightforward and intuitive interface with basic editing functionalities. It allows users to create, view, and modify text files directly from the command line or terminal window. Nano supports common operations such as inserting, deleting, copying, and pasting text. It also offers features like search and replace, spell checking, line numbering, and syntax highlighting for various programming languages.


#### Ex. Vim or Nano

Create a new bash file `current_time.sh`. We want to run it from Shell and print out current time from the file. For example,

```shell
vim current_time.sh
```

The code looks like,

```bash
#!/bin/bash
echo "Current time: $(date)"
```

Within `vim` type `Esc`, `wq` to save the quit. Change its permission to be executable,

```shell
chmod +x current_time.sh
```

Run `current_time.sh` from Shell.
```shell
./current_time.sh
```

For example, output looks like the below,

```shell
anneyu@DESKTOP-KMDHETK:~/tmp$ ./current_time.sh
Current time: Wed May 15 14:30:53 +08 2024
```

In [None]:
%%bash

# You can type and run the same code in Jupyter Notebook.
# !/bin/bash
echo "Current time: $(date)"

## 2.8 A few more common commands

To check the user manual of a command `ls`, type `man ls`.

To check the number of characters and words of a file, use `wc filename`. For example,

In [None]:
!echo "This is a test file." > test.txt
!wc test.txt

To move an existing file to a name file,

In [None]:
!mv test.txt new_file.txt

To create a new file from existing files. For example,

In [None]:
!echo "This is a test file." > test1.txt
!echo "This is another test file." > test2.txt

# Cancatenate two files into a new file.
!cat test1.txt test2.txt > test3.txt

# Show the new file content.
!cat test3.txt

diff two files can use command `diff`,

In [None]:
!diff test1.txt test2.txt

### Grep for searching within content

In [None]:
!grep "test" test3.txt

Some common usages,

* The -i option makes the search case-insensitive.
* The -r option enables recursive search within a directory and its subdirectories.
* The -n option displays the line numbers along with the matching lines.
* The -c option counts the number of lines that contain the matching pattern.
* The -v option inverts the match, displaying lines that do not match the pattern.

For example,

In [None]:
# Count the number of occurences of unix in the file
!grep -c "unix" ./download/unixtut/unix8.html

# Count and case-insensitive
!grep -c -i "unix" ./download/unixtut/unix8.html

### Peak into a data file

When given a data file formated as `csv`, run some commands to look at the top and bottom rows.

In [None]:
# Define the content of the CSV file
csv_content = """Name,Age,City
Alice,30,New York
Bob,25,Los Angeles
Charlie,35,Chicago
Diana,28,Houston
Eve,22,Phoenix
"""

# Write the content to a file named 'example.csv'
with open('example.csv', 'w') as f:
    f.write(csv_content)

In [None]:
# Display the first few lines of the CSV file
!echo "the top 2 lines: " & head -n 2 example.csv

!echo ""

# Display the last few lines of the CSV file
!echo "the bottom 2 lines: " & tail -n 2 example.csv

In [None]:
%%bash

# Calculate the total number of lines
total_lines=$(wc -l < example.csv)

# Calculate the middle line number
middle_line=$(( 1 + (total_lines + 1) / 2 ))

# Print the middle line using head and tail
head -n $middle_line example.csv | tail -n 1