Installation `pip install bash_kernel ; python -m bash_kernel.install`

# Getting help

To get help or information on any command in the shell, use `man` which will list the manual pages. 

In [None]:
unip data.zip

In [None]:
man man

TAB completion 

# Getting around

The content of a directory is shown by the "list" `ls` command. 

In [None]:
ls

In [None]:
ls data

You can get more information using commandline parameters/switches. Check the `man` pages to see what `-a` and `-l` mean. Note that both can be combined into one using `-la`. 

In [None]:
man ls

In [None]:
ls -la

In [None]:
ls -la data

What do we see here. Each line shows you the 
* File permissions (more on that later)
* Number of hard link
* File user
* File group
* File size
* Modification date
* Filename

`.` and `..` are two special directories pointing to the directory itself (`.`) or parent directory (`..`).

You can change the directory using `cd` ("change directory"). You can see where you are with `pwd` ("print working directory").

In [None]:
pwd

In [None]:
cd data
pwd

In [None]:
cd ..
pwd

# Wildcards

The shell also knows wildcards

In [None]:
ls data/1990*.json

`*` matches any character (can also be empty). `?` can be any single character.

In [None]:
ls data/1990-1?-*.json

`[a,b,c]` matches any string that matches an element from the set. 

In [None]:
ls data/199[1,2,4]-1?-*.json

`[a-b]` can also specify a range.

In [None]:
ls data/199[1-5]-12-*.json

Curly brackets `{a,b}` matches any term inside the brackets (including wildcards) with a logical OR, i.e. `a OR b`

In [None]:
ls data/{1990,1991}-{01,1?}-*.json

`[!a-b]` matches anything that does **not** contain the range/set specified in `[  ]`.

In [None]:
ls data/199[!1-7]-01-*.json

# First look at data

The data contains Jeopardy questions from different years in `json` format. 

In [None]:
ls -la data

Since typing `ls -la` all the time, `bash` allows us to define a shortcut. 

In [None]:
alias ll="ls -la"

In [None]:
ll

# Pipes

We there is a lot of information, so maybe we only want to see the first results. We can do that be "rerouting" the output of one command into the next command using a "pipe" `|` and the `head` command.

In [None]:
ll data/*.json | tail -n 20

We can also use the command `wc` to count how many files we have

In [None]:
ls -1 data

In [None]:
ll data/*.json | wc -l

We can use the command `du` to check how much space the entire directory takes. 

In [5]:
du -sh data

 58M	data


# Looking at single files

We can also look at single files using `cat` or `less`. 

In [7]:
cat data/1984-09-10.json

[{"category": "LAKES & RIVERS", "air_date": "1984-09-10", "question": "'River mentioned most often in the Bible'", "value": "$100", "answer": "the Jordan", "round": "Jeopardy!", "show_number": "1"}, {"category": "INVENTIONS", "air_date": "1984-09-10", "question": "'Marconi's wonderful wireless'", "value": "$100", "answer": "the radio", "round": "Jeopardy!", "show_number": "1"}, {"category": "ANIMALS", "air_date": "1984-09-10", "question": "'These rodents first got to America by stowing away on ships'", "value": "$100", "answer": "rats", "round": "Jeopardy!", "show_number": "1"}, {"category": "FOREIGN CUISINE", "air_date": "1984-09-10", "question": "'The \"coq\" in coq au vin'", "value": "$100", "answer": "chicken", "round": "Jeopardy!", "show_number": "1"}, {"category": "ACTORS & ROLES", "air_date": "1984-09-10", "question": "'Video in which Michael Jackson plays a werewolf & a zombie'", "value": "$100", "answer": "\"Thriller\"", "round": "Jeopardy!", "show_number": "1"}, {"category": 

That's not nicely readable. Maybe if we insert a linebreak after each question, it would help reability. We can do this by the stream editor `sed`. It can replace strings with the syntax `s/PATTERN1/PATTERN2/g`. `s` stands for substitution, `g` says that substitutions are made for all non-overlapping matches of the regular expression, not just the first one. Other separation signs than `/` are possible, too. 

In [10]:
sed "s/}, {/},\n{/g" data/1984-09-10.json | tail -n 5

{"category": "'50'S TV", "air_date": "1984-09-10", "question": "'Name under which experimenter Don Herbert taught viewers all about science'", "value": "$1000", "answer": "Mr. Wizard", "round": "Double Jeopardy!", "show_number": "1"},
{"category": "NATIONAL LANDMARKS", "air_date": "1984-09-10", "question": "'D.C. building shaken by November '83 bomb blast'", "value": "$1000", "answer": "the Capitol", "round": "Double Jeopardy!", "show_number": "1"},
{"category": "NOTORIOUS", "air_date": "1984-09-10", "question": "'After the deed, he leaped to the stage shouting \"Sic semper tyrannis\"'", "value": "$1000", "answer": "John Wilkes Booth", "round": "Double Jeopardy!", "show_number": "1"},
{"category": "4-LETTER WORDS", "air_date": "1984-09-10", "question": "'The president takes one before stepping into office'", "value": "$1000", "answer": "oath", "round": "Double Jeopardy!", "show_number": "1"},
{"category": "HOLIDAYS", "air_date": "1984-09-10", "question": "'The third Monday of January s

# Little task

Let's assume we want to collect all questions worth 100$. First we need the lines that match that amount. We can use the command `grep` for it. 

In [13]:
sed "s/}, {/},\n{/g" data/1984-09-10.json | grep '"value": "$100"'

[{"category": "LAKES & RIVERS", "air_date": "1984-09-10", "question": "'River mentioned most often in the Bible'", "value": "$100", "answer": "the Jordan", "round": "Jeopardy!", "show_number": "1"},
{"category": "INVENTIONS", "air_date": "1984-09-10", "question": "'Marconi's wonderful wireless'", "value": "$100", "answer": "the radio", "round": "Jeopardy!", "show_number": "1"},
{"category": "ANIMALS", "air_date": "1984-09-10", "question": "'These rodents first got to America by stowing away on ships'", "value": "$100", "answer": "rats", "round": "Jeopardy!", "show_number": "1"},
{"category": "FOREIGN CUISINE", "air_date": "1984-09-10", "question": "'The \"coq\" in coq au vin'", "value": "$100", "answer": "chicken", "round": "Jeopardy!", "show_number": "1"},
{"category": "ACTORS & ROLES", "air_date": "1984-09-10", "question": "'Video in which Michael Jackson plays a werewolf & a zombie'", "value": "$100", "answer": "\"Thriller\"", "round": "Jeopardy!", "show_number": "1"},


This still has the problem that we could have an opening `[` or closing `]`, so we need to remove that too

In [14]:
sed "s/}, {/},\n{/g; s/\[{/{/g; s/}\]/}/g;" data/1984-09-10.json | grep '"value": "$100"'

{"category": "LAKES & RIVERS", "air_date": "1984-09-10", "question": "'River mentioned most often in the Bible'", "value": "$100", "answer": "the Jordan", "round": "Jeopardy!", "show_number": "1"},
{"category": "INVENTIONS", "air_date": "1984-09-10", "question": "'Marconi's wonderful wireless'", "value": "$100", "answer": "the radio", "round": "Jeopardy!", "show_number": "1"},
{"category": "ANIMALS", "air_date": "1984-09-10", "question": "'These rodents first got to America by stowing away on ships'", "value": "$100", "answer": "rats", "round": "Jeopardy!", "show_number": "1"},
{"category": "FOREIGN CUISINE", "air_date": "1984-09-10", "question": "'The \"coq\" in coq au vin'", "value": "$100", "answer": "chicken", "round": "Jeopardy!", "show_number": "1"},
{"category": "ACTORS & ROLES", "air_date": "1984-09-10", "question": "'Video in which Michael Jackson plays a werewolf & a zombie'", "value": "$100", "answer": "\"Thriller\"", "round": "Jeopardy!", "show_number": "1"},


Now let's say we want to filter out the content for all files. We can use `xargs` for that. 

In [15]:
ls data/*.json | xargs  sed "s/}, {/},\n{/g; s/\[{/{/g; s/}\]/}/g;" |  grep '"value": "$100"' | wc -l

    9029


# Redirect into files

Now, ideally we would like to store that again in a file. We can use the redirection operator `>` for that. 

In [16]:
ls data/*.json | xargs  sed "s/}, {/},\n{/g; s/\[{/{/g; s/}\]/}/g;" |  grep '"value": "$100"' > data/100.json

In [17]:
head data/100.json

{"category": "LAKES & RIVERS", "air_date": "1984-09-10", "question": "'River mentioned most often in the Bible'", "value": "$100", "answer": "the Jordan", "round": "Jeopardy!", "show_number": "1"},
{"category": "INVENTIONS", "air_date": "1984-09-10", "question": "'Marconi's wonderful wireless'", "value": "$100", "answer": "the radio", "round": "Jeopardy!", "show_number": "1"},
{"category": "ANIMALS", "air_date": "1984-09-10", "question": "'These rodents first got to America by stowing away on ships'", "value": "$100", "answer": "rats", "round": "Jeopardy!", "show_number": "1"},
{"category": "FOREIGN CUISINE", "air_date": "1984-09-10", "question": "'The \"coq\" in coq au vin'", "value": "$100", "answer": "chicken", "round": "Jeopardy!", "show_number": "1"},
{"category": "ACTORS & ROLES", "air_date": "1984-09-10", "question": "'Video in which Michael Jackson plays a werewolf & a zombie'", "value": "$100", "answer": "\"Thriller\"", "round": "Jeopardy!", "show_number": "1"},
{"category": "

What did we just do wrong? Added another file that will show up under `*.json`. 

In [18]:
rm data/100.json

We could also have move it with `mv data/100.json ./`

What else?

We are missing the `[` and `]`. They can be added by appending to a file with `>>`. 

In [19]:
echo "Hallo Welt"

Hallo Welt


In [20]:
echo "[" > 100.json
ls data/*.json | xargs  sed "s/}, {/},\n{/g; s/\[{/{/g; s/}\]/}/g;" |  grep '"value": "$100"' >> 100.json
echo "]" >> 100.json


In [23]:
head 100.json

[
{"category": "LAKES & RIVERS", "air_date": "1984-09-10", "question": "'River mentioned most often in the Bible'", "value": "$100", "answer": "the Jordan", "round": "Jeopardy!", "show_number": "1"},
{"category": "INVENTIONS", "air_date": "1984-09-10", "question": "'Marconi's wonderful wireless'", "value": "$100", "answer": "the radio", "round": "Jeopardy!", "show_number": "1"},
{"category": "ANIMALS", "air_date": "1984-09-10", "question": "'These rodents first got to America by stowing away on ships'", "value": "$100", "answer": "rats", "round": "Jeopardy!", "show_number": "1"},
{"category": "FOREIGN CUISINE", "air_date": "1984-09-10", "question": "'The \"coq\" in coq au vin'", "value": "$100", "answer": "chicken", "round": "Jeopardy!", "show_number": "1"},
{"category": "ACTORS & ROLES", "air_date": "1984-09-10", "question": "'Video in which Michael Jackson plays a werewolf & a zombie'", "value": "$100", "answer": "\"Thriller\"", "round": "Jeopardy!", "show_number": "1"},
{"category":

In [None]:
tail 100.json

# Scripting

Now let's say, we'd like to do that for all different amounts of data. We can use a forloop for that and put that into a script. 

In [24]:
rm sort_data.sh

In [25]:
touch sort_data.sh

In [26]:
for AMOUNT in 100 200 400 600 800 1000
do
    FILENAME="$AMOUNT.json"
    echo "Processing $FILENAME"
    echo "[" > $FILENAME
    ls data/*.json | xargs  sed "s/}, {/},\n{/g; s/\[{/{/g; s/}\]/}/g;" |  grep '"value": "$'$AMOUNT'"' >> $FILENAME
    echo "]" >> $FILENAME
done

Processing 100.json
Processing 200.json
Processing 400.json
Processing 600.json
Processing 800.json
Processing 1000.json


We can also put that into a file and execute it.

In [29]:
./sort_data.sh

bash: ./sort_data.sh: Permission denied


: 126

Let's check why we cannot execute it. 

In [30]:
ls -l sort_data.sh

-rw-r--r--  1 fabee  staff  274 Apr 24 13:18 sort_data.sh


We are missing permissions. We can fix that via

In [35]:
chmod a-x sort_data.sh

In [36]:
ls -l sort_data.sh

-rw-r--r--  1 fabee  staff  274 Apr 24 13:18 sort_data.sh


In [37]:
chmod u+x sort_data.sh

In [40]:
ls -la sort_data.sh

-rwxr--r--  1 fabee  staff  274 Apr 24 13:18 sort_data.sh


In [41]:
./sort_data.sh

Processing 100.json
Processing 200.json
Processing 400.json
Processing 600.json
Processing 800.json
Processing 1000.json
