# Data preparation in the shell

UNIX has a great set of tools to clean and analyze data.

This is important because [these tools are](https://jeroenjanssens.com/dsatcl/chapter-1-introduction#why-data-science-at-the-command-line):

- **Agile**: You can quickly explore data and see the results.
- **Fast**: They're written in C. They're easily parallelizable.
- **Popular**: Most systems and languages support shell commands.

In this notebook, we'll explore log files with these shell-based commands.

## Download logs

[This file](https://drive.google.com/file/d/1J1ed4iHFAiS1Xq55aP858OEyEMQ-uMnE/view) has Apache web server logs for the site [s-anand.net](https://s-anand.net/) in the month of April 2024.

You can download files using `wget` or `curl`. One of these is usually available by default on most systems.

We'll use `curl` to download the file from the URL `https://drive.usercontent.google.com/uc?id=1J1ed4iHFAiS1Xq55aP858OEyEMQ-uMnE&export=download`

In [1]:
# curl has LOTs of options. You won't remember most, but it's fun to geek out.
!curl --help all

Usage: curl [options...] <url>
     --abstract-unix-socket <path> Connect via abstract Unix domain socket
     --alt-svc <file name> Enable alt-svc with this cache file
     --anyauth            Pick any authentication method
 -a, --append             Append to target file when uploading
     --aws-sigv4 <provider1[:provider2[:region[:service]]]> Use AWS V4 signature authentication
     --basic              Use HTTP Basic Authentication
     --cacert <file>      CA certificate to verify peer against
     --capath <dir>       CA directory to verify peer against
 -E, --cert <certificate[:password]> Client certificate file and password
     --cert-status        Verify the status of the server cert via OCSP-staple
     --cert-type <type>   Certificate type (DER/PEM/ENG)
     --ciphers <list of ciphers> SSL ciphers to use
     --compressed         Request compressed response
     --compressed-ssh     Enable SSH compression
 -K, --config <file>      Read config from a file
     --connect-tim

In [2]:
# We're using 3 curl options here:
#   --continue-at - continues the download from where it left off. It won't download if already downloaded
#   --location downloads the file even if the link sends us somewhere else
#   --output FILE saves the downloaded output as
!curl --continue-at - \
  --location \
  --output s-anand.net-Apr-2024.gz \
  https://drive.usercontent.google.com/uc?id=1J1ed4iHFAiS1Xq55aP858OEyEMQ-uMnE&export=download

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 5665k  100 5665k    0     0  3139k      0  0:00:01  0:00:01 --:--:-- 9602k


## List files

`ls` lists files. It too has lots of options.

In [3]:
!ls --help

Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters
      --block-size=SIZE      with -l, scale sizes by SIZE when printing them;
                               e.g., '--block-size=M'; see SIZE format below
  -B, --ignore-backups       do not list implied entries ending with ~
  -c                         with -lt: sort by, and show, ctime (time of last
                               modification of file status information);
                               with -l: show ctime and sort by name;
                               othe

In [4]:
# By default, it just lists all file names
!ls

sample_data  s-anand.net-Apr-2024.gz


In [5]:
# If we want to see the size of the file, use `-l` for the long-listing format
!ls -l

total 5672
drwxr-xr-x 1 root root    4096 Jun  6 14:21 sample_data
-rw-r--r-- 1 root root 5801198 Jun  9 05:18 s-anand.net-Apr-2024.gz


## Uncompress the log file

`gzip` is the most popular compression format on the web. It's fast and pretty good. (`xz` is much better but slower.)

Since the file has a `.gz` extension, we know it's compressed using `gzip`. We can use `gzip -d FILE.gz` to decompress the file. It'll replace `FILE.gz` with `FILE`.

(Compression works the opposite way. `gzip FILE` replaces `FILE` with `FILE.gz`)[link text](https://)

In [6]:
# gzip -d is the same as gunzip. They both decompress a GZIP-ed file
!gzip -d s-anand.net-Apr-2024.gz

In [7]:
# Let's list the files and see the size
!ls -l

total 50832
drwxr-xr-x 1 root root     4096 Jun  6 14:21 sample_data
-rw-r--r-- 1 root root 52044491 Jun  9 05:18 s-anand.net-Apr-2024


In this case, a file that was ~5.8MiB became ~52MiB, roughly 10 times larger. Clearly, it's more efficient to store and transport compressed files -- especitally if they're plain text.

## Preview the logs

To see the first few lines or the last few lines of a text file, use `head` or `tail`*italicized text*

In [8]:
# Show the first 5 lines
!head -n 5 s-anand.net-Apr-2024

17.241.219.11 - - [31/Mar/2024:07:16:50 -0500] "GET /hindi/Hari_Puttar_-_A_Comedy_of_Terrors~Meri_Yaadon_Mein_Hai_Tu HTTP/1.1" 200 2839 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)" www.s-anand.net 192.254.190.216
17.241.75.154 - - [31/Mar/2024:07:17:40 -0500] "GET /hindimp3/~AAN_MILO_SAJNA%3DRANG_RANG_KE_PHOOL_KHILE HTTP/1.1" 200 2786 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)" www.s-anand.net 192.254.190.216
101.44.248.120 - - [31/Mar/2024:07:19:03 -0500] "GET /hindi/BRAHMCHARI HTTP/1.1" 200 2757 "http://www.s-anand.net/hindi/BRAHMCHARI" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" www.s-anand.net 192.254.190.216
17.241.227.200 - - [31/Mar/2024:07:19:31

In [9]:
# Show the last 5 files
!tail -n 5 s-anand.net-Apr-2024

47.128.125.180 - - [30/Apr/2024:07:07:47 -0500] "GET /tamil/Subramaniyapuram HTTP/1.1" 406 226 "-" "Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.0.0 Safari/537.36" www.s-anand.net 192.254.190.216
37.59.21.100 - - [30/Apr/2024:07:10:27 -0500] "GET /blog/bollywood-actress-jigsaw-quiz/feed/ HTTP/1.1" 200 1072 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36" www.s-anand.net 192.254.190.216
40.77.167.48 - - [30/Apr/2024:07:11:10 -0500] "GET /tamilmp3 HTTP/1.1" 200 4157 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" www.s-anand.net 192.254.190.216
52.167.144.19 - - [30/Apr/2024:07:11:15 -0500] "GET /malayalam/Ayirathil%20Oruvan HTTP/1.1" 403 450 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Clearly, the data is from around 31 Mar 2024 a bit after 7 am EST (GMT-5) until 30 Apr 2024, a bit after 7 am EST.

Each line is an Apache log record. It has a lot of data. Some are clear. For example, taking the last row:

- `37.59.21.100` is the IP address that made a request. That's from [OVH](https://www.whois.com/whois/37.59.21.100) - a French cloud provider. Maybe a bot.
- `[30/Apr/2024:07:11:31 -0500]` is the time of the request
- `"GET /blog/2003-mumbai-bloggers-meet-photos/feed/ HTTP/1.1"` is the request made to [this page](https://s-anand.net/blog/2003-mumbai-bloggers-meet-photos/feed/)
- `200` is the HTTP reponse status code, indicating that all's well
- `686` bytes was the size of the response
- `"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36"` is the user agent. That's Chrome 30 -- a really old versio of Chrome on Linux. Very likely a bot.

## Count requests

`wc` counts the number of lines, words, and characters in a file. The number of lines is most often used with data.

In [10]:
!wc s-anand.net-Apr-2024

  208539  4194545 52044491 s-anand.net-Apr-2024


So, in Apr 2024, there were ~208K requests to the site. Useful to know.

I wonder: **Who is sending most of these requests?**

Let's extract the IP addresses and count them.

## Extract the `IP` column

We'll use `cut` to cut the first column. It has 2 options that we'll use.

`--delimiter` is the character that splits fields. In the log file, it's a space. (We'll confirm this shortly.)
`--fields` picks the field to cut. We want field 1 (IP address)

Let's preview this:

In [11]:
# Preview just the IP addresses from the logs
!cut --delimiter " " --fields 1 s-anand.net-Apr-2024 | head -n 5

17.241.219.11
17.241.75.154
101.44.248.120
17.241.227.200
37.59.21.100


We used the `|` operator. That passes the output to the next command, `head -n 5`, and gives us first 5 lines. This is called **piping** and is the equivalent of calling a function inside another in programming languages.

We'll use `sort` to sort these IP addresses. That puts the same IP addresses next to each other.

In [12]:
# Preview the SORTED IP addresses from the logs
!cut --delimiter " " --fields 1 s-anand.net-Apr-2024 | sort | head -n 5

100.20.65.50
100.43.111.139
101.100.145.51
101.115.156.11
101.115.205.68


There are no duplicates there... maybe we need to go a bit further? Let's check the top 25 lines.

In [13]:
# Preview the SORTED IP addresses from the logs
!cut --delimiter " " --fields 1 s-anand.net-Apr-2024 | sort | head -n 25

100.20.65.50
100.43.111.139
101.100.145.51
101.115.156.11
101.115.205.68
101.126.25.225
101.132.248.41
101.166.40.221
101.166.6.221
101.183.40.167
101.185.221.147
101.188.225.246
101.200.218.166
101.201.66.35
101.2.187.83
101.2.187.83
101.2.187.83
101.2.187.83
101.2.187.83
101.2.187.83
101.2.187.83
101.44.160.158
101.44.160.158
101.44.160.177
101.44.160.177


OK, there are some duplicates. Good to know.

We'll use `uniq` to count the unique IP addresses. It has a `--count` option that displays the number of unique values.

**NOTE**: `uniq` works ONLY on sorted files. You NEED to `sort` first.

In [14]:
!cut --delimiter " " --fields 1 s-anand.net-Apr-2024 | sort | uniq --count | head -n 25

      1 100.20.65.50
      1 100.43.111.139
      1 101.100.145.51
      1 101.115.156.11
      1 101.115.205.68
      1 101.126.25.225
      1 101.132.248.41
      1 101.166.40.221
      1 101.166.6.221
      1 101.183.40.167
      1 101.185.221.147
      1 101.188.225.246
      1 101.200.218.166
      1 101.201.66.35
      7 101.2.187.83
      2 101.44.160.158
      2 101.44.160.177
      2 101.44.160.189
      3 101.44.160.20
      2 101.44.160.41
      1 101.44.161.208
      1 101.44.161.71
      3 101.44.161.77
      2 101.44.161.93
      2 101.44.162.166


That's useful. [101.2.187.83](https://www.whois.com/whois/101.2.187.83) from Colombo visited 7 times.

But I'd like to know who visited the MOST. So let's `sort` it further.

`sort` has an option `--key 1n` that sorts by field `1` -- the count of IP addresses in this case. The `n` indicates that it's a numeric sort (so 11 appears AFTER 2).

Also, we'll use `tail` instead of `head` to get the highest entries.

In [15]:
# Show the top 5 IP addresses by visits
!cut --delimiter " " --fields 1 s-anand.net-Apr-2024 | sort | uniq --count | sort --key 1n | tail -n 5

   2560 66.249.70.6
   3010 148.251.241.12
   4245 35.86.164.73
   7800 37.59.21.100
 101255 136.243.228.193


WOW! [136.243.228.193](https://www.whois.com/whois/136.243.228.193) from Dataforseo, Ukraine, sent roughly HALF of ALL the requests!

I wonder if we can figure out what User Agent they send. Is it something that identifies itself as a bot of some kind?

## Find lines matching an IP

`grep` searches for text in files. It uses [Regular Expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) which are a powerful set of wildcards.

💡 TIP: You **MUST** learn regular expressions. They're very helpful.

Here, we'll search for all lines BEGINNING with 136.243.228.193 and having a space after that. That's `"^136.243.228.193 "`. The `^` at the beginning matches the start of a line.

In [16]:
# Preview lines that begin with 136.243.228.193
!grep "^136.243.228.193 " s-anand.net-Apr-2024 | head -n 5

136.243.228.193 - - [31/Mar/2024:11:27:43 -0500] "GET /kannadamp3 HTTP/1.1" 200 4162 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)" www.s-anand.net 192.254.190.216
136.243.228.193 - - [31/Mar/2024:11:31:07 -0500] "GET /kannadamp3 HTTP/1.1" 200 4162 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)" www.s-anand.net 192.254.190.216
136.243.228.193 - - [03/Apr/2024:17:46:42 -0500] "GET /robots.txt HTTP/1.1" 200 195 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)" www.s-anand.net 192.254.190.216
136.243.228.193 - - [06/Apr/2024:02:58:43 -0500] "GET /Statistically_improbable_phrases.html HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)" www.s-anand.net 192.254.190.216
136.243.228.193 - - [08/Apr/2024:22:38:25 -0500] "GET /robots.txt HTTP/1.1" 200 195 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://datafor

These requests have clearly identified themselves as `DataForSeoBot/1.0`, which is helpful. It also seems to be crawling `robots.txt` to check if it's allowed to crawl the site, which is polite.

Let's look at the second IP address: [37.59.21.100](https://www.whois.com/whois/37.59.21.100). That seems to be from OVH, a French cloud hosting provider. Is that a bot, too?

In [17]:
# Preview lines that begin with 37.59.21.100
!grep "^37.59.21.100 " s-anand.net-Apr-2024 | head -n 5

37.59.21.100 - - [31/Mar/2024:07:19:41 -0500] "GET /blog/matching-misspelt-tamil-movie-names/feed/ HTTP/1.1" 200 1105 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36" www.s-anand.net 192.254.190.216
37.59.21.100 - - [31/Mar/2024:07:19:53 -0500] "GET /blog/hindi-songs-online/feed/ HTTP/1.1" 200 1382 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36" www.s-anand.net 192.254.190.216
37.59.21.100 - - [31/Mar/2024:07:24:26 -0500] "GET /blog/check-your-mobile-phones-serial-number/feed/ HTTP/1.1" 200 1572 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36" www.s-anand.net 192.254.190.216
37.59.21.100 - - [31/Mar/2024:07:33:10 -0500] "GET /blog/classical-ilayaraja-2/feed/ HTTP/1.1" 200 1286 "-" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36" www.s-anand.net 192.254.

Looking at the user agent, `Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36`, it looks like Chrome 30 -- a very old version.

Personally, I believe it's more likely to be a bot than a French human so interested in my website that they made over 250 requests *every day*.

## Find bots

But, I'm curious. What are the user agents that DO identify themselves as bots? Let's use `grep` to find all words that match bot.

`grep --only-matching` will show only the matches, not the entire line.

The regular expression `'\S*bot\S*'` (which ChatGPT generated) finds all words that have bot.

- `\S` matches non-space characters
- `\S*` matches 0 or more non-space characters

In [18]:
# Find all words with `bot` in it
!grep --only-matching '\b\w*bot\w*\b' s-anand.net-Apr-2024 | head

Applebot
applebot
Applebot
applebot
Applebot
applebot
Applebot
applebot
Applebot
applebot


In [19]:
# Count frequency of all words with `bot` in it and show the top 10
!grep --only-matching '\S*bot\S*' s-anand.net-Apr-2024 | sort | uniq --count | sort --key 1n | tail

   4134 PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
   4307 /robots.txt
   5664 bingbot/2.0;
   5664 +http://www.bing.com/bingbot.htm)
   8771 +claudebot@anthropic.com)"
   8827 +http://www.google.com/bot.html)"
   8830 Googlebot/2.1;
  13798 (Applebot/0.1;
  13798 +http://www.apple.com/go/applebot)"
 101262 +https://dataforseo.com/dataforseo-bot)"


That gives me a rough sense of who's crawling my site.

1. [DataForSEO](https://dataforseo.com/)
2. [Apple](https://www.apple.com/)
3. [Google](https://www.google.com/)
4. [Anthropic](https://www.anthropic.com/)
5. [Bing](https://www.bing.com/)
6. [PetalBot](https://aspiegel.com/petalbot)

## Convert logs to CSV

This file is *almost* a CSV file separated by spaces instead of commas.

The main problem is the date. Instead of `[31/Mar/2024:11:27:43 -0500]` it should have been `"31/Mar/2024:11:27:43 -0500"`

We'll use `sed` (stream editor) to replace the characters. `sed` is like `grep` but lets you replace, not just search.

(Actually, `sed` can do a lot more. It's a full-fledged editor. You can insert, delete, edit, etc. programmatically. In fact, `sed` has truly remarkable features that this paragraph is too small to contain.)

The regular expression we will use is `\[\([^]]*\)\]`. The way this works is:

- `\[`: Match the opening square bracket.
- `\([^]]*\)`: Capture everything inside the square brackets (non-greedy match for any character except `]`).
- `\]`: Match the closing square bracket.

BTW, I didn't create this. [ChatGPT did](https://chatgpt.com/share/7f14e9d2-15ec-4562-b263-61547d2230f3).

`sed "s/abc/xyz/" FILE` replaces `abc` with `xyz` in the file. We can use the regular expression above for the search and `"\1"` for the value -- it inserts captured group enclosed in double quotes.

In [20]:
# Replace [datetime] etc. with "datetime" and save as log.csv
!sed 's/\[\([^]]*\)\]/"\1"/' s-anand.net-Apr-2024 > log.csv

In [21]:
# We should now have a log.csv that's roughly the same size as the original file.
!ls -l

total 101660
-rw-r--r-- 1 root root 52044491 Jun  9 05:19 log.csv
drwxr-xr-x 1 root root     4096 Jun  6 14:21 sample_data
-rw-r--r-- 1 root root 52044491 Jun  9 05:18 s-anand.net-Apr-2024


You can download this `log.csv` and open it in Excel as a CSV file with space as the delimiter.

But when I did that, I faced another problem. Some of the lines had extra columns.

That's because the "User Agent" values sometimes contain a quote. CSV files are supposed to escape quotes with `""` -- two double quotes. But Apache uses `\"` instead.

I'll leave it as an exercise for you to fix that.

## More commands

We've covered the commands most often used to process data before analysis.

Here are a few more that you'll find useful.

- `cat` concatenates multiple files. You can join multiple log files with this, for example
- `awk` is almost a full-fledged programming interface. It's often used for summing up values
- `less` lets you open and read files, scrolling through it

You can read the book [Data Science at the Command Line](https://jeroenjanssens.com/dsatcl/) for more tools and examples.