# Linux

### Word Count

Write a bash script to calculate the frequency of each word in a text file.
For example if a file has the following content

```
The quick brown fox jumped over the lazy dog.
Then the dish ran away with the spoon.

```

Your script should output:

```
The 1
quick 1
brown 1
fox 1
jumped 1
over 1
the 3
lazy 1
dog 1
Then 1
dish 1
ran 1
away 1
with 1
spoon 1
```

#### Environment
* Developed with Bash in a jupyter notebook... because at this point I am hoping to keep solutions consistent

### Thought process
* uniq maybe with sort?
* will google that
* Getting the exact output is very tricky! Made a few mistakes with `cat $file | .. | .. | .. > $file` making empty files

#### Attempt 1
`cat $file_name | tr ' ' '\n' | sort | uniq -c`
* `cat` - output content of file
* `tr ' ' '\n'` - translate/convert space to newline (so sort picks it up)
* `sort` - get repeated words together for uniq
* `uniq -c` - reduce list to unique values with their count

#### Attempt 2
`cat $file_name | tr -d '.' | tr ' ' '\n' | sort | uniq -c`
* `cat` - output content of file
* `tr -d '.'` - Remove '.' from file content
* `tr ' ' '\n'` - translate/convert space to newline (so sort picks it up)
* `sort` - get repeated words together for uniq
* `uniq -c` - reduce list to unique values with their count

In [1]:
%%bash
####################
# Attempt 1
####################

file_name=words_test.txt
truncate -s 0 $file_name

echo -e "The quick brown fox jumped over the lazy dog.\
\nThen the dish ran away with the spoon." >> $file_name

echo -e "\nFile Content:"
cat $file_name

echo -e "\nWord Frequency:"
cat $file_name | tr ' ' '\n' | sort | uniq -c 

# Cleanup
echo -e "\nPlease remove '$file_name' from '$(pwd)'"


File Content:
The quick brown fox jumped over the lazy dog.
Then the dish ran away with the spoon.

Word Frequency:
      1 away
      1 brown
      1 dish
      1 dog.
      1 fox
      1 jumped
      1 lazy
      1 over
      1 quick
      1 ran
      1 spoon.
      3 the
      1 The
      1 Then
      1 with

Please remove 'words_test.txt' from '/home/dvagg/Desktop/FlashPointCode'


In [2]:
%%bash
####################
# Attempt 2
####################

file_name=words_test.txt
truncate -s 0 $file_name

echo -e "The quick brown fox jumped over the lazy dog.\
\nThen the dish ran away with the spoon." >> $file_name

echo -e "\nFile Content:"
cat $file_name

echo -e "\nWord frequency.."
cat $file_name | tr -d '.' | tr ' ' '\n' | sort | uniq -c

# Cleanup
echo -e "\nPlease remove '$file_name' from '$(pwd)'"


File Content:
The quick brown fox jumped over the lazy dog.
Then the dish ran away with the spoon.

Word frequency..
      1 away
      1 brown
      1 dish
      1 dog
      1 fox
      1 jumped
      1 lazy
      1 over
      1 quick
      1 ran
      1 spoon
      3 the
      1 The
      1 Then
      1 with

Please remove 'words_test.txt' from '/home/dvagg/Desktop/FlashPointCode'


#### Solution (messy, but matches output)
```
cat $file_name | tr -d '.' | tr ' ' '\n' > $file_name_words
for word in $(cat -n wordsep_test.txt | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-); do
    echo $(grep "$word$" $file_name_words | uniq -c | awk '{print $2,$1}') 
done
```
* Make a new file from original with all words on separate lines (with . removed)
* for word in (Remove duplicate lines but maintain order)
    * `grep` the word from original file and get `uniq -c` so that we have the format we want 
    * It's only ever N of the same, but 'uniq -c' gives us desired output in one go, then we use awk to switch order. 

In [6]:
%%bash
####################
# Attempt 3
# Rough solution
####################

file_name=words_test.txt
file_name_words=wordsep_test.txt
truncate -s 0 $file_name
truncate -s 0 $file_name_words

echo -e "The quick brown fox jumped over the lazy dog.\
\nThen the dish ran away with the spoon." >> $file_name

echo -e "\nFile Content:"
cat $file_name

echo -e "\nFrequency:"
# First separate the lines into words (removing '.')
cat $file_name | tr -d '.' | tr ' ' '\n' > $file_name_words
# For every word in that file (remove duplicate lines, but maintain order)
for word in $(cat -n $file_name_words | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-); do
    grep "$word$" $file_name_words | uniq -c | awk '{print $2,$1}' # It's only ever N of the same, but 'uniq -c' gives us desired output in one go 
done

# Cleanup
echo -e "\nPlease remove '$file_name' and '$file_name_words' from '$(pwd)'"


File Content:
The quick brown fox jumped over the lazy dog.
Then the dish ran away with the spoon.

Frequency:
The 1
quick 1
brown 1
fox 1
jumped 1
over 1
the 3
lazy 1
dog 1
Then 1
dish 1
ran 1
away 1
with 1
spoon 1

Please remove 'words_test.txt' and 'wordsep_test.txt' from '/home/dvagg/Desktop/FlashPointCode'
