# Special Commands: sed, awk, grep, sort

---

This tutorial covers four essential command-line tools that are fundamental for text processing in shell scripting:

- **grep** - Pattern Searching
- **sed** - Stream Editor
- **awk** - Pattern Processing Language
- **sort** - Sorting Text

## Setup: Create Sample Data

Let's create a sample employee dataset we'll use throughout this tutorial:

In [1]:
cat << 'EOF' > /tmp/employees.txt
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Engineering,80000,San Francisco
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston
EOF

echo "Sample data created:"
cat /tmp/employees.txt

Sample data created:
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Engineering,80000,San Francisco
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Engineering,80000,San Francisco
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston


## grep - Pattern Searching

### Basic grep Usage

In [2]:
echo "Employees in Engineering:"
grep "Engineering" /tmp/employees.txt

echo
echo "Case-insensitive search for 'john':"
grep -i "john" /tmp/employees.txt

echo
echo "Number of Sales employees:"
grep -c "Sales" /tmp/employees.txt

Employees in Engineering:
Jane Smith,[01;31m[KEngineering[m[K,75000,San Francisco
Alice Brown,[01;31m[KEngineering[m[K,80000,San Francisco
Jane Smith,[01;31m[KEngineering[m[K,75000,San Francisco
Alice Brown,[01;31m[KEngineering[m[K,80000,San Francisco


Case-insensitive search for 'john':
Case-insensitive search for 'john':
[01;31m[KJohn[m[K Doe,Sales,50000,New York
Bob [01;31m[KJohn[m[Kson,Marketing,45000,Chicago
[01;31m[KJohn[m[K Doe,Sales,50000,New York
Bob [01;31m[KJohn[m[Kson,Marketing,45000,Chicago


Number of Sales employees:
Number of Sales employees:
2
2


### Advanced grep Options

In [3]:
echo "Employees in Sales or HR:"
grep -E "Sales|HR" /tmp/employees.txt

echo
echo "Non-Engineering employees:"
grep -v "Engineering" /tmp/employees.txt

echo
echo "With line numbers:"
grep -n "San Francisco" /tmp/employees.txt

Employees in Sales or HR:
John Doe,[01;31m[KSales[m[K,50000,New York
Charlie Wilson,[01;31m[KSales[m[K,52000,New York
Diana Davis,[01;31m[KHR[m[K,48000,Boston
John Doe,[01;31m[KSales[m[K,50000,New York
Charlie Wilson,[01;31m[KSales[m[K,52000,New York
Diana Davis,[01;31m[KHR[m[K,48000,Boston


Non-Engineering employees:
Non-Engineering employees:
John Doe,Sales,50000,New York
Bob Johnson,Marketing,45000,Chicago
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston
John Doe,Sales,50000,New York
Bob Johnson,Marketing,45000,Chicago
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston


With line numbers:
With line numbers:
[32m[K2[m[K[36m[K:[m[KJane Smith,Engineering,75000,[01;31m[KSan Francisco[m[K
[32m[K4[m[K[36m[K:[m[KAlice Brown,Engineering,80000,[01;31m[KSan Francisco[m[K
[32m[K2[m[K[36m[K:[m[KJane Smith,Engineering,75000,[01;31m[KSan Francisco[m[K
[32m[K4[m[K[36m[K:[m[KAlice Brown,Engineering,800

In [4]:
echo "Extract only salary information:"
grep -o "[0-9]\{5\}" /tmp/employees.txt

Extract only salary information:
[01;31m[K50000[m[K
[01;31m[K75000[m[K
[01;31m[K45000[m[K
[01;31m[K80000[m[K
[01;31m[K52000[m[K
[01;31m[K48000[m[K
[01;31m[K50000[m[K
[01;31m[K75000[m[K
[01;31m[K45000[m[K
[01;31m[K80000[m[K
[01;31m[K52000[m[K
[01;31m[K48000[m[K


## sed - Stream Editor

### Basic sed Operations

In [5]:
echo "Replace 'Engineering' with 'Development':"
sed 's/Engineering/Development/' /tmp/employees.txt

echo
echo "Replace all commas with pipes:"
sed 's/,/|/g' /tmp/employees.txt | head -3

Replace 'Engineering' with 'Development':
John Doe,Sales,50000,New York
Jane Smith,Development,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Development,80000,San Francisco
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston
John Doe,Sales,50000,New York
Jane Smith,Development,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Development,80000,San Francisco
Charlie Wilson,Sales,52000,New York
Diana Davis,HR,48000,Boston


Replace all commas with pipes:
Replace all commas with pipes:
John Doe|Sales|50000|New York
Jane Smith|Engineering|75000|San Francisco
Bob Johnson|Marketing|45000|Chicago
John Doe|Sales|50000|New York
Jane Smith|Engineering|75000|San Francisco
Bob Johnson|Marketing|45000|Chicago


In [6]:
echo "Delete HR employees:"
sed '/HR/d' /tmp/employees.txt

Delete HR employees:
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Engineering,80000,San Francisco
Charlie Wilson,Sales,52000,New York
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Engineering,80000,San Francisco
Charlie Wilson,Sales,52000,New York


### Advanced sed Usage

In [7]:
echo "Multiple operations:"
sed -e 's/Engineering/Tech/' -e 's/Sales/Business/' /tmp/employees.txt

echo
echo "Add header:"
sed '1i\Name,Department,Salary,Location' /tmp/employees.txt | head -4

Multiple operations:
John Doe,Business,50000,New York
Jane Smith,Tech,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Tech,80000,San Francisco
Charlie Wilson,Business,52000,New York
Diana Davis,HR,48000,Boston
John Doe,Business,50000,New York
Jane Smith,Tech,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Tech,80000,San Francisco
Charlie Wilson,Business,52000,New York
Diana Davis,HR,48000,Boston


Add header:
Add header:
Name,Department,Salary,Location
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Name,Department,Salary,Location
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Bob Johnson,Marketing,45000,Chicago


In [8]:
echo "Extract names only:"
sed 's/^\([^,]*\).*/\1/' /tmp/employees.txt

echo
echo "Add dollar signs to salaries:"
sed 's/\([0-9]\{5\}\)/$\1/' /tmp/employees.txt | head -3

Extract names only:
John Doe
Jane Smith
Bob Johnson
Alice Brown
Charlie Wilson
Diana Davis
John Doe
Jane Smith
Bob Johnson
Alice Brown
Charlie Wilson
Diana Davis


Add dollar signs to salaries:
Add dollar signs to salaries:
John Doe,Sales,$50000,New York
Jane Smith,Engineering,$75000,San Francisco
Bob Johnson,Marketing,$45000,Chicago
John Doe,Sales,$50000,New York
Jane Smith,Engineering,$75000,San Francisco
Bob Johnson,Marketing,$45000,Chicago


## awk - Pattern Processing Language

### Basic awk Concepts

In [9]:
echo "Print names and salaries:"
awk -F',' '{print $1, $3}' /tmp/employees.txt

echo
echo "Formatted output:"
awk -F',' '{printf "Name: %-15s Salary: $%s\n", $1, $3}' /tmp/employees.txt

Print names and salaries:
John Doe 50000
Jane Smith 75000
Bob Johnson 45000
Alice Brown 80000
Charlie Wilson 52000
Diana Davis 48000
John Doe 50000
Jane Smith 75000
Bob Johnson 45000
Alice Brown 80000
Charlie Wilson 52000
Diana Davis 48000


Formatted output:
Formatted output:
Name: John Doe        Salary: $50000
Name: Jane Smith      Salary: $75000
Name: Bob Johnson     Salary: $45000
Name: Alice Brown     Salary: $80000
Name: Charlie Wilson  Salary: $52000
Name: Diana Davis     Salary: $48000
Name: John Doe        Salary: $50000
Name: Jane Smith      Salary: $75000
Name: Bob Johnson     Salary: $45000
Name: Alice Brown     Salary: $80000
Name: Charlie Wilson  Salary: $52000
Name: Diana Davis     Salary: $48000


In [10]:
echo "High earners (>60000):"
awk -F',' '$3 > 60000' /tmp/employees.txt

echo
echo "Count and average salary:"
awk -F',' '{count++; sum+=$3} END {print "Employees:", count, "Average:", sum/count}' /tmp/employees.txt

High earners (>60000):
Jane Smith,Engineering,75000,San Francisco
Alice Brown,Engineering,80000,San Francisco
Jane Smith,Engineering,75000,San Francisco
Alice Brown,Engineering,80000,San Francisco


Count and average salary:
Count and average salary:
Employees: 6 Average: 58333,3
Employees: 6 Average: 58333,3


### Advanced awk Programming

In [11]:
echo "Department summary:"
awk -F',' '
{
    dept[$2]++
    total[$2] += $3
}
END {
    for (d in dept) {
        printf "%s: %d employees, avg salary: $%.0f\n", d, dept[d], total[d]/dept[d]
    }
}' /tmp/employees.txt

Department summary:
Marketing: 1 employees, avg salary: $45000
Engineering: 2 employees, avg salary: $77500
Sales: 2 employees, avg salary: $51000
HR: 1 employees, avg salary: $48000
Marketing: 1 employees, avg salary: $45000
Engineering: 2 employees, avg salary: $77500
Sales: 2 employees, avg salary: $51000
HR: 1 employees, avg salary: $48000


In [12]:
echo "Engineering department details:"
awk -F',' '
BEGIN { print "Engineering Employees:" }
$2 == "Engineering" { print "  " $1 " - $" $3 " (" $4 ")" }
END { print "--- End ---" }
' /tmp/employees.txt

Engineering department details:
Engineering Employees:
  Jane Smith - $75000 (San Francisco)
  Alice Brown - $80000 (San Francisco)
--- End ---
Engineering Employees:
  Jane Smith - $75000 (San Francisco)
  Alice Brown - $80000 (San Francisco)
--- End ---


### awk Built-in Variables

In [13]:
echo "Line numbers and field counts:"
awk -F',' '{print "Line " NR ": " NF " fields"}' /tmp/employees.txt

echo
echo "Name lengths:"
awk -F',' '{print $1 " (length: " length($1) ")"}' /tmp/employees.txt

Line numbers and field counts:
Line 1: 4 fields
Line 2: 4 fields
Line 3: 4 fields
Line 4: 4 fields
Line 5: 4 fields
Line 6: 4 fields
Line 1: 4 fields
Line 2: 4 fields
Line 3: 4 fields
Line 4: 4 fields
Line 5: 4 fields
Line 6: 4 fields


Name lengths:
Name lengths:
John Doe (length: 8)
Jane Smith (length: 10)
Bob Johnson (length: 11)
Alice Brown (length: 11)
Charlie Wilson (length: 14)
Diana Davis (length: 11)
John Doe (length: 8)
Jane Smith (length: 10)
Bob Johnson (length: 11)
Alice Brown (length: 11)
Charlie Wilson (length: 14)
Diana Davis (length: 11)


## sort - Sorting Text

### Basic Sorting

In [14]:
echo "Alphabetical sort:"
sort /tmp/employees.txt | head -3

echo
echo "Reverse alphabetical:"
sort -r /tmp/employees.txt | head -3

Alphabetical sort:
Alice Brown,Engineering,80000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Charlie Wilson,Sales,52000,New York
Alice Brown,Engineering,80000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Charlie Wilson,Sales,52000,New York


Reverse alphabetical:
Reverse alphabetical:
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Diana Davis,HR,48000,Boston
John Doe,Sales,50000,New York
Jane Smith,Engineering,75000,San Francisco
Diana Davis,HR,48000,Boston


In [15]:
echo "Sort by salary (numeric):"
sort -t',' -k3 -n /tmp/employees.txt

echo
echo "Sort by department, then by salary:"
sort -t',' -k2,2 -k3,3n /tmp/employees.txt

Sort by salary (numeric):
Bob Johnson,Marketing,45000,Chicago
Diana Davis,HR,48000,Boston
John Doe,Sales,50000,New York
Charlie Wilson,Sales,52000,New York
Jane Smith,Engineering,75000,San Francisco
Alice Brown,Engineering,80000,San Francisco
Bob Johnson,Marketing,45000,Chicago
Diana Davis,HR,48000,Boston
John Doe,Sales,50000,New York
Charlie Wilson,Sales,52000,New York
Jane Smith,Engineering,75000,San Francisco
Alice Brown,Engineering,80000,San Francisco


Sort by department, then by salary:
Sort by department, then by salary:
Jane Smith,Engineering,75000,San Francisco
Alice Brown,Engineering,80000,San Francisco
Diana Davis,HR,48000,Boston
Bob Johnson,Marketing,45000,Chicago
John Doe,Sales,50000,New York
Charlie Wilson,Sales,52000,New York
Jane Smith,Engineering,75000,San Francisco
Alice Brown,Engineering,80000,San Francisco
Diana Davis,HR,48000,Boston
Bob Johnson,Marketing,45000,Chicago
John Doe,Sales,50000,New York
Charlie Wilson,Sales,52000,New York


### Advanced Sorting

In [16]:
echo "Sort by salary (highest first):"
sort -t',' -k3 -nr /tmp/employees.txt

echo
echo "Unique departments:"
sort -t',' -k2 /tmp/employees.txt | cut -d',' -f2 | sort -u

Sort by salary (highest first):
Alice Brown,Engineering,80000,San Francisco
Jane Smith,Engineering,75000,San Francisco
Charlie Wilson,Sales,52000,New York
John Doe,Sales,50000,New York
Diana Davis,HR,48000,Boston
Bob Johnson,Marketing,45000,Chicago
Alice Brown,Engineering,80000,San Francisco
Jane Smith,Engineering,75000,San Francisco
Charlie Wilson,Sales,52000,New York
John Doe,Sales,50000,New York
Diana Davis,HR,48000,Boston
Bob Johnson,Marketing,45000,Chicago


Unique departments:
Unique departments:
Engineering
HR
Marketing
Sales
Engineering
HR
Marketing
Sales


## Combining Commands

### Complex Data Processing Pipeline

In [17]:
echo "Top earning employees by department:"
awk -F',' 'NR>1{dept[$2]++; if($3>max[$2] || max[$2]=="") {max[$2]=$3; top[$2]=$1}} END{for(d in dept) print d","top[d]","max[d]}' /tmp/employees.txt | sort -t',' -k3 -nr

Top earning employees by department:
Engineering,Alice Brown,80000
Sales,Charlie Wilson,52000
HR,Diana Davis,48000
Marketing,Bob Johnson,45000
Engineering,Alice Brown,80000
Sales,Charlie Wilson,52000
HR,Diana Davis,48000
Marketing,Bob Johnson,45000


In [18]:
echo "Employee count by location (sorted):"
grep -v "^$" /tmp/employees.txt | cut -d',' -f4 | sort | uniq -c | sort -nr

Employee count by location (sorted):
      2 San Francisco
      2 New York
      1 Chicago
      1 Boston
      2 San Francisco
      2 New York
      1 Chicago
      1 Boston


### Log Analysis Example

In [19]:
# Create sample log
cat << 'EOF' > /tmp/access.log
192.168.1.100 - - [26/Oct/2023:14:32:10] "GET /index.html" 200 1024
192.168.1.101 - - [26/Oct/2023:14:32:15] "GET /about.html" 200 2048
192.168.1.100 - - [26/Oct/2023:14:32:20] "POST /login" 401 512
192.168.1.102 - - [26/Oct/2023:14:32:25] "GET /index.html" 200 1024
192.168.1.101 - - [26/Oct/2023:14:32:30] "GET /products" 404 256
EOF

echo "Top IP addresses by request count:"
awk '{print $1}' /tmp/access.log | sort | uniq -c | sort -nr

echo
echo "HTTP status code distribution:"
awk '{print $9}' /tmp/access.log | sort | uniq -c | sort -nr

echo
echo "Most requested pages:"
awk '{print $7}' /tmp/access.log | sort | uniq -c | sort -nr

Top IP addresses by request count:
      2 192.168.1.101
      2 192.168.1.100
      1 192.168.1.102
      2 192.168.1.101
      2 192.168.1.100
      1 192.168.1.102


HTTP status code distribution:
HTTP status code distribution:
      5 
      5 


Most requested pages:
Most requested pages:
      3 200
      1 404
      1 401
      3 200
      1 404
      1 401


### Data Cleaning and Formatting

In [20]:
# Create messy data
cat << 'EOF' > /tmp/messy_data.txt
John  Doe  ,Sales, 50000,  New York
Jane Smith,Engineering,75000,San Francisco
   Bob Johnson,Marketing,45000,Chicago

Alice Brown,Engineering,80000,San Francisco
EOF

echo "Cleaned and formatted data:"
sed '/^$/d' /tmp/messy_data.txt | \
sed 's/^[ \t]*//;s/[ \t]*$//' | \
sed 's/[ \t]*,[ \t]*/,/g' | \
awk -F',' '{printf "%-20s %-15s %8s %-15s\n", $1, $2, $3, $4}'

Cleaned and formatted data:
John  Doe            Sales              50000 New York       
Jane Smith           Engineering        75000 San Francisco  
Bob Johnson          Marketing          45000 Chicago        
Alice Brown          Engineering        80000 San Francisco  
John  Doe            Sales              50000 New York       
Jane Smith           Engineering        75000 San Francisco  
Bob Johnson          Marketing          45000 Chicago        
Alice Brown          Engineering        80000 San Francisco  


## Practical Example: Generate Summary Report

In [21]:
echo "=== Employee Summary Report ==="
echo

awk -F',' '
BEGIN {
    print "Department Summary"
    print "=================="
}
{
    dept[$2]++
    total_salary[$2] += $3
    if ($3 > max_salary[$2]) {
        max_salary[$2] = $3
        highest_paid[$2] = $1
    }
}
END {
    for (d in dept) {
        avg = total_salary[d] / dept[d]
        printf "\nDepartment: %s\n", d
        printf "  Employees: %d\n", dept[d]
        printf "  Total Salary: $%d\n", total_salary[d]
        printf "  Average Salary: $%.0f\n", avg
        printf "  Highest Paid: %s ($%d)\n", highest_paid[d], max_salary[d]
    }
}' /tmp/employees.txt

=== Employee Summary Report ===


Department Summary

Department: Marketing
  Employees: 1
  Total Salary: $45000
  Average Salary: $45000
  Highest Paid: Bob Johnson ($45000)

Department: Engineering
  Employees: 2
  Total Salary: $155000
  Average Salary: $77500
  Highest Paid: Alice Brown ($80000)

Department: Sales
  Employees: 2
  Total Salary: $102000
  Average Salary: $51000
  Highest Paid: Charlie Wilson ($52000)

Department: HR
  Employees: 1
  Total Salary: $48000
  Average Salary: $48000
  Highest Paid: Diana Davis ($48000)
Department Summary

Department: Marketing
  Employees: 1
  Total Salary: $45000
  Average Salary: $45000
  Highest Paid: Bob Johnson ($45000)

Department: Engineering
  Employees: 2
  Total Salary: $155000
  Average Salary: $77500
  Highest Paid: Alice Brown ($80000)

Department: Sales
  Employees: 2
  Total Salary: $102000
  Average Salary: $51000
  Highest Paid: Charlie Wilson ($52000)

Department: HR
  Employees: 1
  Total Salary: $48000
  Average Sala

## Exercise

Create a script called `log_analyzer.sh` that processes web server log files and generates a comprehensive report. The script should:

1. Extract the top 10 IP addresses by request count
2. Identify the most popular pages (excluding .css, .js, .png)
3. Generate an hourly request distribution
4. Find error rates by status code
5. Format the output as a readable report

Use combinations of `grep`, `sed`, `awk`, and `sort` to achieve this analysis.

In [22]:
# Your solution here
# Hint: Start by examining the log structure
echo "Sample log analysis:"
echo "Top 3 IPs:"
awk '{print $1}' /tmp/access.log | sort | uniq -c | sort -nr | head -3

echo
echo "Status codes:"
awk '{print $9}' /tmp/access.log | sort | uniq -c

Sample log analysis:
Top 3 IPs:
Top 3 IPs:
      2 192.168.1.101
      2 192.168.1.100
      1 192.168.1.102
      2 192.168.1.101
      2 192.168.1.100
      1 192.168.1.102


Status codes:
Status codes:
      5 
      5 


## Cleanup

In [23]:
# Remove temporary files
rm -f /tmp/employees.txt /tmp/access.log /tmp/messy_data.txt
echo "Temporary files cleaned up"

Temporary files cleaned up
