# Bash
## Networking, Debugging, Misc

## wget
- `wget` is a non-interactive downloader
    - Only supports HTTP, HTTPS, and FTP
    - Depending on the website you are downloading from, may support continuing a paused download
    - Also has options to create an entire local copy of a website
```bash
wget [OPTIONS] URL
```


In [1]:
wget www.umbc.edu

--2017-09-28 11:32:27--  http://www.umbc.edu/
Resolving www.umbc.edu (www.umbc.edu)... 130.85.12.160
Connecting to www.umbc.edu (www.umbc.edu)|130.85.12.160|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html.1’

index.html.1            [ <=>                  ]  40.82K  --.-KB/s   in 0.005s 

2017-09-28 11:32:27 (7.87 MB/s) - ‘index.html.1’ saved [41795]



In [2]:
head index.html.1

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>UMBC: An Honors University In Maryland</title>

    <!-- Always force latest IE rendering engine (even in intranet) & Chrome Frame -->
    <!-- <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> -->

    <!-- Sets the viewport width to the width of the device, so media queries work -->


## Common Courtesy with `wget`
- `wget`, and many other command line tools can theoretically launch 100s of request a second
    - This is mean, potentially illegal, and a good way to get your IP blocked
- `wget` has many options to prevent this if you are downloading multiple files at once
    - --limit-rate sets a maximum bandwidth to use
    - --wait sets the number of seconds to wait between each request
    - --random-wait will jitter the amount of time the wait actually is

In [3]:
wget --mirror ‐‐page-requisites \
‐‐convert-links ‐‐adjust-extension  \
-P./local_443-2 --wait 1 --random-wait \
https://www.csee.umbc.edu/~bwilk1/433/

--2017-09-28 11:36:42--  http://xn--page-requisites-zs9ha/
Resolving ‐‐page-requisites (xn--page-requisites-zs9ha)... failed: Name or service not known.
wget: unable to resolve host address ‘xn--page-requisites-zs9ha’
--2017-09-28 11:36:43--  http://xn--convert-links-9l6ga/
Resolving ‐‐convert-links (xn--convert-links-9l6ga)... failed: Name or service not known.
wget: unable to resolve host address ‘xn--convert-links-9l6ga’
--2017-09-28 11:36:43--  http://xn--adjust-extension-ue6ia/
Resolving ‐‐adjust-extension (xn--adjust-extension-ue6ia)... failed: Name or service not known.
wget: unable to resolve host address ‘xn--adjust-extension-ue6ia’
--2017-09-28 11:36:45--  https://www.csee.umbc.edu/~bwilk1/433/
Resolving www.csee.umbc.edu (www.csee.umbc.edu)... 130.85.36.80
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29262 (29K) [text/html]
Saving to: ‘./local_443-2/www.csee.umbc.edu/~bwilk1/433/ind

HTTP request sent, awaiting response... 403 Forbidden
2017-09-28 11:37:01 ERROR 403: Forbidden.

--2017-09-28 11:37:03--  https://www.csee.umbc.edu/~bwilk1/components/bootstrap/fonts/glyphicons-halflings-regular.woff2
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-09-28 11:37:03 ERROR 403: Forbidden.

--2017-09-28 11:37:03--  https://www.csee.umbc.edu/~bwilk1/components/bootstrap/fonts/glyphicons-halflings-regular.woff
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-09-28 11:37:03 ERROR 403: Forbidden.

--2017-09-28 11:37:05--  https://www.csee.umbc.edu/~bwilk1/components/bootstrap/fonts/glyphicons-halflings-regular.ttf
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-09-28 11:37:05 ERROR 403: Forbidden.




2017-09-28 11:37:23 (4.33 MB/s) - ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/Lecture02.html’ saved [201889/201889]

--2017-09-28 11:37:24--  https://www.csee.umbc.edu/~bwilk1/331/Lecture03.html
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 270921 (265K) [text/html]
Saving to: ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/Lecture03.html’


2017-09-28 11:37:24 (11.3 MB/s) - ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/Lecture03.html’ saved [270921/270921]

--2017-09-28 11:37:25--  https://www.csee.umbc.edu/~bwilk1/331/Lecture04.html
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 267820 (262K) [text/html]
Saving to: ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/Lecture04.html’


2017-09-28 11:37:25 (11.3 MB/s) - ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/Lecture04.html’ saved [267820/267820]

--2017-09-2



2017-09-28 11:37:42 (11.6 MB/s) - ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/NFA1.jpg’ saved [36699/36699]

--2017-09-28 11:37:44--  https://www.csee.umbc.edu/~bwilk1/331/NFA2.jpg
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36512 (36K) [image/jpeg]
Saving to: ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/NFA2.jpg’


2017-09-28 11:37:44 (11.3 MB/s) - ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/NFA2.jpg’ saved [36512/36512]

--2017-09-28 11:37:45--  https://www.csee.umbc.edu/~bwilk1/331/aa.png
Connecting to www.csee.umbc.edu (www.csee.umbc.edu)|130.85.36.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29991 (29K) [image/png]
Saving to: ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/aa.png’


2017-09-28 11:37:45 (12.9 MB/s) - ‘./local_443-2/www.csee.umbc.edu/~bwilk1/331/aa.png’ saved [29991/29991]

--2017-09-28 11:37:46--  https://www.csee.umbc.edu/~bwilk1/331/pars

: 4

## More Useful `wget` Features
- `wget` allows you specify a list of urls to download by using the -i flag
- The type of files downloaded can be controlled by the following flags
    - --accept takes a comma separated list of file endings to accept
    - --reject takes a comma separated list of file endings to reject

## Real-World Example
- As a computational linguist, one of the most important steps in research is to gather data
- In this example, pretend we want to build a dataset of text found on academic websites
- The steps we will take are:
    1. Get a list of URLs from a website
    2. Extract the URLS
    3. Use `wget` to download the websites
    4. Use `sed` and other tools to strip the text out from the website

In [6]:
wget "https://univ.cc/search.php?dom=edu&key=&start=1" -Ouni_webs.html

--2017-09-28 11:41:28--  https://univ.cc/search.php?dom=edu&key=&start=1
Resolving univ.cc (univ.cc)... 148.251.1.24
Connecting to univ.cc (univ.cc)|148.251.1.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5601 (5.5K) [text/html]
Saving to: ‘uni_webs.html’


2017-09-28 11:41:29 (17.4 MB/s) - ‘uni_webs.html’ saved [5601/5601]



In [9]:
grep -Po "<a href='.*?'>.*?</a>" uni_webs.html | \
sed -E "s|<a href='([^']*?)'>(.*?)</a>|\1 \2|g" > targets

In [10]:
mapfile sites_to_get < targets

In [12]:
for target in "${sites_to_get[@]}"; do
    array=( $target )
    url=${array[0]}
    name=${array[@]:1}
    name=${name// /_}
    wget $url -O${name}.html
    tr '\n' ' ' < ${name}.html > ${name}.flat.html
    grep -Po "<p>(.*?)</p>" ${name}.flat.html \
    | sed -E "s|<p>(.*?)</p>|\1|g" > ${name}.txt
    break
done

--2017-09-28 11:46:59--  http://www.acu.edu/
Resolving www.acu.edu (www.acu.edu)... 35.164.249.222, 54.218.51.199
Connecting to www.acu.edu (www.acu.edu)|35.164.249.222|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘Abilene_Christian_University.html’

Abilene_Christian_U     [ <=>                  ]  40.12K  --.-KB/s   in 0.08s  

2017-09-28 11:46:59 (522 KB/s) - ‘Abilene_Christian_University.html’ saved [41084]



In [13]:
cat Abilene_Christian_University.txt

Prepare to successfully enter the workforce through one of 70-plus majors and 135 areas of study. <br /> 
Increase your personal and professional expertise by earning one of our many graduate degrees. <br /> 
Combine convenience and quality by earning your degree in our Christ-centered online community. <br /> 
Welcome the Wildcats home for their first season at Anthony Field at Wildcat Stadium. Opening weekend is September 16 and will include a Needtobreathe concert, tailgating, inaugural game against Houston Baptist and fireworks! Limited seats are available, don't miss out on yours!<br /> 
<a class="button button--orange" href="http://www.acusports.com/fbtix" target="_blank">Learn More</a>
ACU invites you to join a learning community founded on Christian tradition and academic rigor, but open to any student seeking to build their future – regardless of their individual background or beliefs. 
Whether you’re starting college for the first time or returning for a graduate degree, ACU 

## `curl`
- `curl` is a more powerful tool that allows uploading and download over
    - (S)FTP
    - HTTP(S)
    - SCP
    - LDAP
- `curl` prints to STDOUT

In [14]:
curl http://www.umbc.edu

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>UMBC: An Honors University In Maryland</title>

    <!-- Always force latest IE rendering engine (even in intranet) & Chrome Frame -->
    <!-- <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> -->

    <!-- Sets the viewport width to the width of the device, so media queries work -->
    <!-- NOTE: We're locking the max scale (which prevents zooming) to fix bugs
         during orientation changes on devices.  Our styles should accomodate this though. -->
<link rel="image_src" href="http://www.umbc.edu/images/UMBC_fb_tmb.png" />
<meta name="description" content="" />
    <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1">
    <link rel="icon" type="image/png" href="images_homepage/icon.png" />

    <!-- Template Styles -->
    <link rel="stylesheet" type="text/css" href="stylesheets/homepage.css" />

    <!-- Site-Specific Customizations -->

<input type="hidden" name="ie" value="UTF-8" />
<input id="umbc-nav-search-query" type="text" placeholder="Search" autocomplete="off" name="q" size="250" maxlength="250">
<input id="umbc-nav-search-submit" type="submit" value="">
</div>
</form>

</nav>
</header>

<header role="banner" id="site-header" >
<a href="#" id="mobile-site-menu-expander">
<span class="mobile-site-menu-expander-bar"></span>
<span class="mobile-site-menu-expander-bar"></span>
<span class="mobile-site-menu-expander-bar"></span>
</a>
</header>

</div><!-- END umbc header -->

<div id="site-menu-background"> 

<nav role="navigation" id="site-menu"><!-- gold toolbar -->
<ul id="menu-main" class="menu">

<li class="utilities"><a href="">Utilities</a>
<ul class="sub-menu">
<li><a href="http://umbc.edu/siteindex">A-Z Index</a></li>
<li><a href="http://my.umbc.edu">myUMBC</a></li>
<li><a href="http://my.umbc.edu/events">Events</a></li>
<li><a href="http://umbc.edu/search/directory">Directory</a></li>
<li><a href="http://

<div class="content-details right">
<a class="mobile-seemore" href="http://bit.ly/2h2wpnY">&raquo;</a>
<div class="title">National Excellence in Innovation and Teaching</div>
<div class="tagline">Growing Retriever community celebrates U.S. News spotlight on UMBC.</div>
<a class="web-seemore" href="http://bit.ly/2h2wpnY">Read more &raquo;</a>
</div>
</div>

<div class="panel panel-3">
<div class="image mobile-center"><img src="images_homepage/win_Strong-Connections-Meaningful-Work2.jpg" alt="Strong Connections, Meaningful Work"></div>
<div class="content-details left">
<a class="mobile-seemore" href="http://bit.ly/2wPg3sd">&raquo;</a>
<div class="title">Strong Connections, Meaningful Work</div>
<div class="tagline">UMBC alumni mentors give today’s students a glimpse of possible career paths.</div>
<a class="web-seemore" href="http://bit.ly/2wPg3sd">Read more &raquo;</a>
</div>
</div>

<div class="panel panel-4">
<div class="image mobile-right"><img src="images_homepage/win_Earth-Space-U

<p><a href="http://my.umbc.edu/topics/involvement-and-leadership">Involvement &amp; Leadership </a></p>
<p><a href="http://my.umbc.edu/topics/arts-culture-and-entertainment">Arts, Culture &amp; Entertainment</a> </p>
<p><a href="http://my.umbc.edu/topics/diversity">Diversity</a></p>
<p><a href="http://undergraduate.umbc.edu/community/professors-not-to-miss.php">Professors Not to Miss</a></p>
<p><a href="http://umbcbreakingground.wordpress.com/">UMBC BreakingGround</a></p>
<p><a href="http://www.umbcretrievers.com/">Athletics &amp; Recreation</a></p>
<p><a href="http://my.umbc.edu/events">Events</a></p>
</div>
    
<div class="column2">
<h3>Get Help</h3> 
<a href="http://www.umbc.edu/saf"><img src="images_homepage/cs_support.jpg" alt="Support Services thumbnail"></a>
<p><a href="http://www.umbc.edu/saf/">Student Affairs</a></p>
<p><a href="http://my.umbc.edu/topics/advising-and-student-support">Advising &amp; Student Support</a></p>
<p><a href="http://my.umbc.edu/topics/jobs-and-interns

<p><a href="http://www.bwtechumbc.com/">Programs</a></p>
<p><a href="http://www.bwtechumbc.com//facilities/space.html">Acquire Space</a></p>  
<p><a href="http://www.bwtechumbc.com//about/directions.html">Maps &amp; Directions</a></p>
</div>
    
<div class="column2">
<h3>Work With Us</h3> 
<a href="http://www.umbc.edu/business"><img src="images_homepage/cp_business.jpg" alt="Corporate Relations thumbnail"></a>
<p><a href="http://www.umbc.edu/business/">Learn How</a></p> 
<p><a href="http://www.umbc.edu/business/students.php">Access Students</a></p>
<p><a href="http://www.umbc.edu/business/training.php">Access Training</a></p>
<p><a href="http://www.umbc.edu/business/research.php">Access Research</a></p>
<p><a href="http://www.umbc.edu/business/space.php">Access Space</a></p>
<p><a href="http://www.umbc.edu/business/support.php">Support UMBC</a></p>  
</div>
    
<div class="column3">
<h3>Give to UMBC</h3> 
<a href="http://www.umbc.edu/giving"><img src="images_homepage/bs_giving.jpg" a

</ul>
</nav>

<div id="umbc-footer-info">&copy; University of Maryland, Baltimore County <span class="bullet">&bull;</span> 1000 Hilltop Circle <span class="bullet">&bull;</span> Baltimore, MD 21250
</div>

</footer>

</body>
</html>


In [15]:
curl -I http://www.umbc.edu

HTTP/1.1 200 OK
Date: Thu, 28 Sep 2017 15:48:42 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips mod_fcgid/2.3.9 PHP/5.4.16 mod_perl/2.0.10 Perl/v5.16.3
X-Powered-By: PHP/5.4.16
Content-Type: text/html; charset=UTF-8



## POST requests
- We will look at HTTP requests more in detail in a few weeks
- IF you submit something in a form and don't see a crazy web address, it was probably submitted using POST
- `curl` allows POST by using the `-X` flag
```bash
curl -X POST -d "DATA" URL
```

## Debugging in `bash`
- The bash command itself has several flags are are useful in debugging
- The flags are included as part of the shebang line
```bash
#!/bin/bash FLAGS
```
- The main flags for debugging are
    - -n Step the through the script but do not running, good for finding syntax errors
    - -x Prints traces of commands and their arguments 

In [16]:
cat syntax_example.sh

#!/bin/bash -n
echo "HELLO"


In [17]:
./syntax_example.sh

In [18]:
cat syntax_error_example.sh

#!/bin/bash -n
for i in (1 2 3 4); do
echo "HELLO"




In [19]:
./syntax_error_example.sh

./syntax_error_example.sh: line 2: syntax error near unexpected token `('
./syntax_error_example.sh: line 2: `for i in (1 2 3 4); do'


: 2

In [20]:
cat ./cla_debug.sh

#!/bin/bash -x
echo "The name of the file is $0"
echo "You passed $# arguments"

echo "The first argument is $1"
echo "The second argument is $2"

echo "All the arguments are $@"


In [21]:
./cla_debug.sh Arg1 SOmething goes here

+ echo 'The name of the file is ./cla_debug.sh'
The name of the file is ./cla_debug.sh
+ echo 'You passed 4 arguments'
You passed 4 arguments
+ echo 'The first argument is Arg1'
The first argument is Arg1
+ echo 'The second argument is SOmething'
The second argument is SOmething
+ echo 'All the arguments are Arg1' SOmething goes here
All the arguments are Arg1 SOmething goes here
