In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


There are two characters one can use before Bash commands in the Colab IPython environment. The character `!` is used for executing any standard shell command. The example below shows the use of `!`. The directories **drive** and **sample_data** will be listed.

In [None]:
!pwd

/content


In [None]:
!ls

drive  sample_data


Let us try a simple Bash command.

In [None]:
!echo "Hello World!"

Hello World!


However, you cannot use **!cd** to navigate the filesystem. The command **!pwd** prints the current directory. You can see that the **!cd** did not take you into **sample_data** directory.

In [None]:
!cd sample_data/

In [None]:
!pwd

/content


The reason is that shell commands in the notebook are executed in a temporary subshell that does not maintain state from command to command. If you'd like to change the working directory in a more enduring way, you can use the **%cd** magic command.

In [None]:
%cd sample_data

/content/sample_data


In [None]:
!pwd

/content/sample_data


The UNIX user command **ls** lists the contents of the current directory, **sample_data**.

In [None]:
!ls

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


## 1. Using `man` to get help
Nearly every user command in Linux will have a `man` (manual) page, so finding them is as simple as typing `man command` to bring up the manual entry for that specific command.

For example, `man mv` will bring up the manual page for the `mv` (move) command. Note: when running this command in the terminal, you will need to use the arrow keys to move up and down the page. To get back to the command prompt, type `q`.

`man intro` is a useful place to start. It displays the "Introduction to User" commands which is a well-written, fairly brief introduction to the Linux command line.

**Note**: If you are getting a message telling you to run the `unminimize` command:
```
This system has been minimized by removing packages and content that are not required on a system that users do not log into.
To restore this content, including manpages, you can run the 'unminimize'
command. You will still need to ensure the 'man-db' package is installed.
```

then run the command `yes y | unminimize`. This command will automatically type `y` when the command asks you to agree.


In [None]:
!yes y | unminimize
!man intro

Some software developers prefer `info` to `man` (for instance, GNU developers). So if you find a very widely used command or app that doesn't have a `man` page, it might be worth your while checking for an `info` page.

Virtually all commands understand the `-h` (or `--help`) option which will produce a short usage description of the command and its options.

`man` pages can be lengthy though so if you are looking for a specific option etc. it could be useful (when using the terminal) to look up some word using the syntax `/word` and then use the `n` key to move to the next occurence.

If you aren’t sure which command or application you need to use, you can try searching the manual pages. Each manual page has a name and a short description.


*   If you know part of the command name, use the following command: `whatis -r <string>`. For example try with the following: `whatis -r cpy`
*   To search the names or descriptions for <string> enter: `apropos -r <string>`. For example, `apropos -r "copy files"` will list manual pages whose names or descriptions contain copy files.

In [None]:
!apropos -r "copy files"

cp (1posix)          - copy files
gh-codespace-cp (1)  - Copy files between local and remote file systems
git-checkout-index (1) - Copy files from the index to the working tree


## 2. File and Directory Commands

The Linux hierarchy is typical of Unix systems (with some variations depending on the specific distributions). For the moment you just need to know that the file system is a tree that starts at the root (represented with the symbol /). Note that if you are familiar with DOS/Windows the path delimiter is the forward slash and not the backward slash... A path then looks like this in Linux: /var/log/auth.log. This leads to the file auth.log in the folder log in the folder var which is right after the root of the file system.

*   The tilde `(~)` symbol stands for your home directory. If you are anthony, then the tilde `(~)` stands for `/home/anthony`. So `/home/anthony/myFile` and `~/myFile` point to the same file.
*   `pwd`: The `pwd` command will allow you to know in which directory you’re located (`pwd` stands for ”print working directory”). Example: `pwd` in the Desktop directory will show ~/Desktop.
*   `ls`: The `ls` (’list’) command will show you the files in your current directory. Used with certain options, you can see the size of files, when files were created, and permissions for files. Example: `ls ~` will show you the files that are in your home directory.

*    `cd`: The `cd` command will allow you to change directories. When you open a terminal you will be in your home directory. To move around the file system you will use `cd`. Examples:
      * To navigate into the root directory, use `cd /`
      * To navigate to your home directory, use `cd` or `cd ~`
      * To navigate up one directory level, use `cd ..`
      * To navigate to the previous directory (or back), use `cd -`
      * To navigate through multiple levels of directory at once, specify the full directory path that you want to go to. For example, use, `cd /var/log` to go directly to the `/log` subdirectory of `/var/`.
*    `cp`: The `cp` command will make a copy of a file for you. Example: `cp file foo` will make an exact copy of ”file” and name it ”foo”, but the file ”file” will still be there. If you are copying a directory, you must use `cp -r directory foo` (copy recursively). (To understand what ”recursively” means, think of it this way: to copy the directory and all its files and subdirectories and all their files and subdirectories of the subdirectories and all their files, and on and on, ”recursively”).
*    `mv`: The `mv` command will move a file to a different location or will rename a file. Examples are as follows: `mv file foo` will rename the file ”file” to ”foo”. `mv foo ~/Desktop` will move the file ”foo” to your Desktop directory, but it will not rename it. You must specify a new file name to rename a file.
*    `rm`: Use this command to remove or delete a file in your directory.
*    `rmdir`: The `rmdir` command will delete an empty directory. To delete a directory and all of its contents recursively, use `rm -r` instead.
*    `mkdir`: The `mkdir` command will allow you to create directories. Example: `mkdir music` will create a directory called ”music”.

Let us create a directory for our tutorial work today.

In [None]:
%cd /content

/content


In [None]:
!mkdir -p COMP47470/Week1

In [None]:
!ls

COMP47470  drive  sample_data


In [None]:
%cd COMP47470/Week1

/content/COMP47470/Week1


Download the `unirank.csv` file using `!wget csserver.ucd.ie/~rholmes/unirank.csv`

In [None]:
!wget csserver.ucd.ie/~rholmes/unirank.csv

--2025-09-15 20:00:19--  http://csserver.ucd.ie/~rholmes/unirank.csv
Resolving csserver.ucd.ie (csserver.ucd.ie)... 193.1.133.60
Connecting to csserver.ucd.ie (csserver.ucd.ie)|193.1.133.60|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://csserver.ucd.ie/~rholmes/unirank.csv [following]
--2025-09-15 20:00:20--  https://csserver.ucd.ie/~rholmes/unirank.csv
Connecting to csserver.ucd.ie (csserver.ucd.ie)|193.1.133.60|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13062 (13K) [text/csv]
Saving to: ‘unirank.csv’


2025-09-15 20:00:20 (341 KB/s) - ‘unirank.csv’ saved [13062/13062]



In [None]:
!ls

unirank.csv


## 3. Examining Files:

`cat` stands for conCATenate. You can use this command to dump an entire text file to the screen.

In [None]:
!cat unirank.csv

If the text file is too long, you might find that it scrolls past too quickly and you cannot see the beginning of the file anymore. In which case, you can use either the ”more” or ”less” command. For example:

`more COMP47470/unirank.csv`

`less COMP47470/unirank.csv`

Both these commands perform similar functions as they allow to see the text file one page at a time. You use the spacebar to continue paging, `enter` key will move down one line, and `q` to quit.

`less` has actually more features than `more`! The most useful feature is that it can scroll backwards (or up) whereas `more` cannot. Press `h` (while in the program) to see more options. Another interesting option is the search option, which is similar to the one you’ve seen when we presented `man`. If you are looking for a specific string in a text file use the syntax /string - and then hit the key `n` to move to the next occurrence.

## 4. Showing part of a file:
`head` and `tail` are two opposite commands, showing the beginning or the end of a file respectively. Try the following commands - what do they give you?

`head COMP47470/unirank.csv`

`tail COMP47470/unirank.csv`

Both of them have various options that can be powerful - and a little complicated!

`head -n 3 COMP47470/unirank.csv` # Shows the first 3 lines

`tail -n 5 COMP47470/unirank.csv` # Shows the last 5 lines

`tail -n +20 COMP47470/unirank.csv` # Shows from the 20th line to the end

In [None]:
!head -n 15 unirank.csv | tail -n 5

Johns Hopkins University,Baltimore, MD,50410,6524,10
Dartmouth College,Hanover, NH,51438,4307,11
California Institute of Technology,Pasadena, CA,47577,1001,12
Northwestern University,Evanston, IL,50855,8314,12
Brown University,Providence, RI,51367,6652,14


2. Combine `head` and `tail` using a pipe (`|`) to print the 7th line in `COMP47470/unirank.csv`.

In [None]:
!head -n 7 unirank.csv | tail -n 1

Stanford University,Stanford, CA,47940,6999,5


## 5. Echo and Special Characters:

In this section we will demostrate how special characters can impact the execution of commands - and how they can be disabled conveniently!

In [None]:
# We can print or "echo" a simple word:
!echo Hello

# Or several words:
!echo Hello everybody

# We can also print nonalphabetical characters on screen
# (As long as they don't belong to the previously mentioned
# list of special characters)
!echo Hello to the 2 of you!

# If we want to disable a special character like # we would
# place a backslash before it like so:
!echo \# is a very useful character in bash

# We can even disable the \ character itself with another \
!echo \# is more useful than \\ in bash

# If we want to disable a lot of special characters we can use
# single quotes to disable all of them:
!echo '# is less useful than * in bash'
!echo 'The $PATH variable is very important.'

# Or double quotes which will disable all special characters
# except $, ` and \
!echo "# is less useful than * in bash"
!echo "The \$PATH variable is very important."

Hello
Hello everybody
Hello to the 2 of you!
# is a very useful character in bash
# is more useful than \ in bash
# is less useful than * in bash
The $PATH variable is very important.
# is less useful than * in bash
The $PATH variable is very important.


Escaping is important. Here's the same sentence without the \ before the $ character:

In [None]:
!echo "The $PATH variable is very important."

The /opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin variable is very important.


## 6. Selecting Columns:

Linux command `cut` is used for text processing. You can use this command to extract portions of text from a file by selecting columns.
Option `-cN` extracts only character N of each line from a file. For example:

In [None]:
!cut -c2 unirank.csv

will extract only the second character of each line. A range of characters can also be extracted from a file by specifying start and end position delimited with -.

In [None]:
!cut -c2-5 unirank.csv

Either start position or end position can be passed to the cut command with the -c option.

The following command specifies only the start position before the ’-’. This example extracts from the 10th character to the end of each line.


In [None]:
!cut -c10- unirank.csv

The following commad specifies only the end position after the ’-’. This example extracts 10 characters from the beginning of each line.

In [None]:
!cut -c-10 unirank.csv

Instead of selecting **x** number of characters, if you like to extract a whole field, you can combine options **-f** and **-d**. Option **-f** specifies which field you want to extract, and option **-d** specifies the field delimiter that is used in the input file.

The following example displays only the first field of each line from the **unirank.csv** file using the field delimiter, (comma). In this case, the 1st field is the name of the university.

In [None]:
!cut -d"," -f1 unirank.csv

You can also extract more than one field from a file. The example below displays the University name and state.

In [None]:
!cut -d"," -f1,3 unirank.csv

To display a range of fields specify start field and end field as shown below. In this example, we are selecting fields 1 through 3 as well as 5 and 6.

In [None]:
!cut -d"," -f1-3,5,6 unirank.csv

## 7. Sorting:

The sort command rearranges the lines in a text file so that they are sorted lexicographically.

In [None]:
!sort unirank.csv

These are the default sorting rules:
* a number is before a letter.
* letters follow their order in the alphabet.
* a lowercase letter is before a same uppercase letter.

You can change the default sorting rules by providing a suitable parameter. For example:

* Reverse sort: `sort -r unirank.csv`
* Ignore case: `sort -f unirank.csv`

You can also concatenate and sort several files at once:
`sort /file/one /file/two`

You can also save the result of the sort in a another file:

`sort unirank.csv -o result.txt`

or

`sort unirank.csv > result.txt`

or

`sort unirank.csv >> result.txt`

The **-o** option is not available for all commands. Using **>** and **>>** works for any command. **>** will overwrite *result.txt* if it already exists while **>>** will append the output to *result.txt* if it already exists. Both create the file if it does not exist.

## 8. Duplicates:

This command filters duplicate adjacent lines:

`uniq unirank.csv`

It can filter them by ignoring the case:

`uniq -i unirank.csv`

You can report the duplicate lines by:

`uniq -d unirank.csv`

You can aso print the number of occurrences of each line:

`uniq -c unirank.csv`

## 9. tee: copy input to two places

The tee command sends its standard input both to standard out and to a file that you specify on the command line. You can think of this command as equivalent to a tee fixture in plumbing. It splits the linear command pipeline into two.

The device /dev/tty is a synonym for the current terminal. For example, the following command prints both the pathnames of files with .csv extension and a count of the number of csv files that were found. Here, tee sends the output to /dev/tty and to the command wc.




In [None]:
!find . -name "*.csv" | tee /dev/tty | wc -l

tee: /dev/tty: No such device or address
1


## 10. Grep:

Grep prints out the lines containing a certain string. For example, the following will return all lines that contain the Kentucky state code (KY):

In [None]:
!grep KY unirank.csv

University of Kentucky,Lexington, KY,26334,22705,133
University of Louisville,Louisville, KY,24626,15769,171


Options:
* `-c`: Only gives the number of matching lines.
* `-v`: Shows only the lines that do not match the pattern (inverted search).
* `-i`: Ignore case.
* `-n`: Gives the line number as well as the matching lines.


### Exercise 2:

1. What is the output of the commands:
    * `echo {0..9}`
    * `echo 1.{0..9}`
    * `echo {A..C}{0..2}`

In [None]:
!echo {0..9}
!echo 1.{0..9}
!echo {A..C}{0..2}

2. Use `sort` to sort the content of unirank.csv.

In [None]:
!sort unirank.csv

3. Use `grep` to find all the lines with 'ville' un unirank.csv. Then redirect this output to a file.

In [None]:
!grep ville unirank.csv > result.txt

4. List all of the files in a directory in reverse date order, so that the most recent is at the bottom (`ls`).

In [None]:
!ls -r -t

unirank.csv  result.txt


5. Find all the ”colleges” in the list (e.g. University *College* Dublin, think `grep` and its options!)

In [None]:
!cut -d"," -f1 unirank.csv

6. Find the number of HEIs per state (`cut`, `sort`, `uniq`)?

In [None]:
!cut -d"," -f3 unirank.csv | tail -n +2 | sort | uniq -c > HEI_Per_State.txt

7. Which state has the most HEI institutes in the dataset?

In [None]:
!sort -n -r HEI_Per_State.txt | head -n 1

     22  CA


## 11. Bash Scripts:

Now we'll play with some basic Bash scripts. To write these scripts via the terminal we would usually need a text editor such as `nano` or `vi`. However, as we're using a notebook, we can simple create a new file and run it using the command: `!bash path/to/file`.

You can create new files by right clicking on the top most icon when in the files view of the lefthand menu. Remember to move any new files you create to your drive in order to save them! Bash scripts (by convention) have a `.sh` extension e.g. the example `hello.sh` provided to you.

Alternatively, on the left-hand menu, click the file directory button. Go to directory, COMP47470/Week1. Then upload the files `hello.sh` and `hello2.sh` that are provided to you. Move them to the proper directory.

You will need to make sure the first line of any scripts you write is `#!/bin/bash`. `#!` is called the shebang and is followed directly by the path to the executable used to interpret the script (bash in this case).


In [None]:
!ls

sample_data


In [None]:
!cat hello.sh

cat: hello.sh: No such file or directory



If you take a look at the `hello.sh` script you'll notice that it contains just one other line: `echo "Hello $1"`. This `$1` is what is known as a positional parameter. This is something we will pass as an argument to the script. For example:

In [None]:
!bash hello.sh "COMP47470!"

Hello COMP47470!


`$@` is a special parameter that expands to all positional parameters starting from `$1` (i.e. a list of arguments provided to the script).

Take a look at the `hello2.sh` script. Using a for loop and `$@`, our script can now say hello to each of the demonstrators whose names we pass as arguments.

In [None]:
!bash hello2.sh And Welcome to COMP47470 Big Data Programming Module!

## 12. Snapshot of Social Network (facebook.csv):

One of the first tasks that data scientists do when they receive a new dataset is to check the data: reading its content, understanding its format, etc.

For this next task we'll use the facebookdata.csv. This can be dowloaded using the command `!wget csserver.ucd.ie/~thomas/facebookdata.csv`.

The csv file is provided to you if you have issues downloading it.

In [1]:
!wget csserver.ucd.ie/~thomas/facebookdata.csv

--2025-09-30 13:43:37--  http://csserver.ucd.ie/~thomas/facebookdata.csv
Resolving csserver.ucd.ie (csserver.ucd.ie)... 193.1.133.60
Connecting to csserver.ucd.ie (csserver.ucd.ie)|193.1.133.60|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://csserver.ucd.ie/~thomas/facebookdata.csv [following]
--2025-09-30 13:43:37--  https://csserver.ucd.ie/~thomas/facebookdata.csv
Connecting to csserver.ucd.ie (csserver.ucd.ie)|193.1.133.60|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 808806 (790K) [text/csv]
Saving to: ‘facebookdata.csv’


2025-09-30 13:43:37 (8.06 MB/s) - ‘facebookdata.csv’ saved [808806/808806]




1. Read the first 10 lines of the file. What are the different fields?

In [3]:
!head facebookdata.csv

status_id,status_message,link_name,status_type,"status,_link",status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys
7331091005_10154123560186006,"Ben Simmons will likely be the No. 1 pick of the NBA Draft, but who should it be?","Ben Simmons will likely be the No. 1 pick of the NBA Draft, bu...",video,https://www.facebook.com/bleacherreport/videos/10154123560186006/,2016-06-22 17:33:39,5565,178,461,5488,43,13,19,0,2
7331091005_10154123362896006,"How to coach the ""Triangle Offense,"" as explained by Metta World Peace. (via QBronald/Twitter)","How to coach the ""Triangle Offense,"" as explained by Metta Wor...",video,https://www.facebook.com/bleacherreport/videos/10154123362896006/,2016-06-22 16:20:56,11997,1932,3158,10385,96,15,1499,0,2
7331091005_10154123319126006,"The new team, reportedly called the Black Knights, will be starting in 2017-18.",NHL Announces Las Vegas Will Officially Get Expansion Team,link,http://ble.ac/28Nr

2. Read the last 10 lines of the file.

In [4]:
!tail -n 10 facebookdata.csv

7331091005_10153833038526006,"""That behavior will not be tolerated as we move forward.""",Browns Head Coach Hue Jackson Comments on Johnny Manziel,link,http://ble.ac/20VnzGV,2016-02-24 13:17:04,2006,77,39,1737,17,23,132,69,28
7331091005_10153832973191006,This is actually happening.,San Jose Sharks to Honor Brent Burns' Hair and Beard with 'Chia Pet' Giveaway,link,http://ble.ac/20VjPFl,2016-02-24 12:29:21,1090,86,97,1004,19,8,52,1,6
7331091005_10153832924036006,"""I've always wanted to be Samus."" - Ronda Rousey",Ronda Rousey Reveals She'd Love to Star in a 'Metroid' Movie,link,http://ble.ac/1S1kFjj,2016-02-24 11:54:59,5269,427,277,4818,165,46,55,11,174
7331091005_10153832880656006,Baalke expects Kaepernick to be on team's roster on April 1.,49ers General Manager Trent Baalke Comments on Colin Kaepernick,link,http://ble.ac/1QaLrAx,2016-02-24 11:26:08,395,14,19,375,4,3,7,2,4
7331091005_10153832871926006,Jeffrey could be staying in Chicago long term.,Alshon Jeffery Reportedly to Be Franc

3. Print line 1515 using tail, head, and a pipe (`|`).

In [5]:
!head -n 1515 facebookdata.csv | tail -n 1

7331091005_10153983215131006,"The Golden State Warriors may have won a lot of games this season, but the Internet handed them an 'L' after this photoshoot  STORY: http://ble.ac/26g5icS",Timeline Photos,photo,https://www.facebook.com/bleacherreport/photos/a.10150274478951006.330140.7331091005/10153983215131006/?type=3,2016-04-20 23:45:42,12742,693,505,11876,144,18,695,4,5


4. Count the number of characters in the first 50 lines.

In [8]:
!head -n 50 facebookdata.csv | wc -m

12722


5. Try to print only columns 4 and 6. you should realise there is something wrong with the file. What is the problem?

In [9]:
!cut -d"," -f4,f6 facebookdata.csv

cut: invalid field value ‘f6’
Try 'cut --help' for more information.


We will clean the file *facebookdata.csv* using the utility `sed`. **sed** ("stream editor") is a Unix utility that parses and transforms text.

Please see the bash script provided to you, fix.sh. It . More information on the data preprocessing activity is given in the file, "Additional Context - Grep, Sed, CSVs and Commas.pdf".

The bash script "fix.sh" contains a complex regular expression to clean the file. There is no need to understand this expression since mastering regular expressions is not the goal of this module.

6. Upload and execute the file fix.sh.
Now print only columns 4 and 6 in your amended file.

In [10]:
!bash fix.sh

7. Now write commands and scripts to answer the following questions about the dataset:

    a. How many statuses of each type are there?

In [11]:
!head -n 1 facebookdata-clean.csv | grep "status"

status_id,status_message,link_name,status_type,"status;_link",status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys


b. Find the 10 most popular status entries. For that, add all the values you find in columns 8-15. Your script should look something like the following:

    #!/bin/bash

    #declare 10 variables (initialise them with a 0)

    #here you're reading the output of a command line by line
    for line in $(command-similar-to-previous-question); do

      #get the values (cut) in several variables:
      num_comments=???
      num_shares=???
      num_likes=???
      etc.

      #add the values
      #keep only this sum if it's among the top 10:
      #think insertion sort?
    done
      #print the 10 status entries

Arithmetic expansion and evaluation in Bash is done by placing an integer expression using the following format `$((expression))`, e.g., `$(( n1+n2 ))`.

In [None]:
# Solve