## Unit 1B: How to Handle White Spaces

<b>What are spaces, tabs, and newlines?</b>

This seems like an odd question. Spaces are the spaces between these words, tabs are slightly longer spaces, and new lines are, well... new lines. While these are obvious things to us, it is important to understand that anytime there is a space in between the words we type, there is an invisible character that tells the computer to leave said space there. The same is true for tabs, new lines, and other kinds of spacings. To understand how to deal with these when you are using Python to webscrape, we need to establish what each of these spacings are.

<b>Spaces</b>

There really isn't such thing as a "space" in a string on a computer. There must be some invisible character there that tells Python to leave a blank space there. There are a few different characters that can be used. As seen in Unit 1A, we inserted a space between the words "Hello" and "World" by simply adding " " (quotations marks with a space between them) between the words. 

<b>Tabs</b>

Another import spacing character is the tab, denoted as <b>\t</b>. These characters are often used for indentation in paragraphs, as well as code in Python (which we will see later). In many cases, coding languages that do not require tabs will still be written with tabs to help with human readability.

<b>New Lines</b>

Characters are read by a computer in a straight line. Anytime we see a new line, there is actually an invisible character telling the computer to show a new line on our screens. This is often denoted as <b>\n</b>. When we are processing most files, we will want to remove these characters from the ends of lines so that we can better manipulate the data into the output that we want.

### White Spaces and .strip()

Below are several examples of f-strings that will help visualize what tab characters and new line characters do:

In [1]:
# Tabs are commonly used when producing tab-delimited files (a file where tabs are separating each column) that can be read by Excel

first_name = "John"
last_name = "Doe"
organization = "Fake Organization"

#notice that \t characters are separating each variabke
output = f"{first_name}\t{last_name}\t{organization}"

print(output)

John	Doe	Fake Organization


In [2]:
# New lines are present in almost all documents that you will encounter 
# It would be really difficult to read a file that is just in one straight line

line1 = "This is line 1"
line2 = "This is line 2"
line3 = "This is line 3"

#notice the \n characters separating the variables
output = f"{line1}\n{line2}\n{line3}"

print(output)

This is line 1
This is line 2
This is line 3


As you can see in the first example, we define three variables and print them using an f-string. We also insert a tab character (<b>\t</b>) between each of the fields. You can see in the output that there is a large space between each variable. In the second example, we also define three variable and print them using an f-string. However, this time we inserted new line characters (<b>\n</b>) to make them each print onto a new line.

Now that we've looked at different kinds of white spaces that you might encounter, now lets look at how to remove them.

Consider this line of HTML code:

In [3]:
html_line = f"        <p><u>Client</u>: The GreenLight Fund - Atlanta</p>"
print(html_line)

        <p><u>Client</u>: The GreenLight Fund - Atlanta</p>


We want to clean up this line by removing everything except "The GreenLight Fund - Atlanta." 

First, this line of code has an unknown number of white spaces in the beginning that will need to be removed. Python has a functionality for this called <b>.strip()</b>. The default behavior of this function is to remove all leading and trailing spaces from a string, which is what we need to do.

In [4]:
# since we won't be needing the original html_line string, we can just reassign our new stripped line to the same name
html_line = html_line.strip()
print(html_line)

<p><u>Client</u>: The GreenLight Fund - Atlanta</p>


Notice that our string is printed all the way to the left (all spaces on the left side have been removed. While the default behavior of .strip() is to remove white spaces, you can use it to remvoe whatever you want by putting a string in between the parentheses:

In [5]:
some_text = "__some_text____________"
some_text = some_text.strip('_')
print(some_text)

some_text


In the example above, we told the strip function to remove all leading and trailing underscores by feeding putting an underscore between the parentheses in the strip function. This is known as an <b>argument</b>, and most functions take arguments to provide further specification on what you want the function to do. 

<i>*Notice that, only the leading and trailing underscores are removed, and the underscore between "some" and "text" is left.</i>

Going back to our HTML line example, now we want to remove everything else surrounding the words that we want. However, the characters are slightly different on each side. On the left side of the string, the characters we want to remove are "\<p>\<u>Client\</u>: " and on the right side of the string we want to remove "\</p>." Python offers derivations of .strip() called <b>.rstrip()</b> and <b>.lstrip()</b> (for right strip and left strip), that allow us to remove specific character from each side of the string. These functions also stackable, meaning when can apply them to the same string simultaneously.

In [6]:
html_line = f"        <p><u>Client</u>: The GreenLight Fund - Atlanta</p>"
client = html_line.strip().lstrip("<p><u>Client</u>: ").rstrip("</p>")
print(client)

The GreenLight Fund - Atlanta


### A Worked Example

Now that we have covered the basics of print statements, variable assignment, and string manipulations, we will go over a worked example, and then into some practice problems. This example will be an early introduction to one of the most important aspects of programming: developing an approach before writing code. This might seem a bit silly because this is not a very complicated example. It would be easy enough to just write the code very quickly. However, it is good to start practicing this because it will make more complex tasks more approachable. That being said, let's get into the example.

We have to following strings:

In [7]:
client_html_line = f"    <p><u>Client</u>: The Carter Center</p>"
location_html_line = f"        <p>Location: 453 John Lewis Freedom Pkwy NE, Atlanta, GA 30307</p>"
employees_html_line = f"        <p>Employees: 175; field office staff in more than a dozen countries</p>"

print(client_html_line)
print(location_html_line)
print(employees_html_line)

    <p><u>Client</u>: The Carter Center</p>
        <p>Location: 453 John Lewis Freedom Pkwy NE, Atlanta, GA 30307</p>
        <p>Employees: 175; field office staff in more than a dozen countries</p>


Say we are trying to populate a tab-separated file that can be opened in Excel. Our column headers are <i>Client</i>, <i>City</i>, and <i>Employee Number</i>. Our objective is to print a tab-separated string that has "The Carter Center" for the <i>Client</i> column, "Atlanta" for the <i>City</i> column, and "175" for the employee column.

<u>Approach</u>

Step 1: We want to remove the white space from the front of each line. This can be done with .strip().

Step 2: We want to remove the unwanted characters from the beginning and end of each line. This can be done with .rstrip() and .lstrip()

Step 3: We want to print the strings with a tab between each variable. This can be done with an f-string.

<i>Note that we are not writing the exact details of what exactly we are going to type. We are simply writing down what the objective is and how we can accomplish it, and we divide this process up into steps to make it easier for us to think about. Again, this might seem silly because this task is relativly straight forward, but this approach will be useful later.<i/>

In [8]:
# step 1 and step 2 are done at the same time
client = client_html_line.strip().lstrip("<p><u>Client</u>: ").rstrip("</p>")
location = location_html_line.strip().lstrip("<p>Location: 453 John Lewis Freedom Pkwy NE, ").rstrip(", GA 30307</p>")
employees = employees_html_line.strip().lstrip("<p>Employees: ").rstrip("; field office staff in more than a dozen countries</p>")

#step 3
output = f"{client}\t{location}\t{employees}"

print(output)

The Carter Center	Atlanta	175


### Practice Problems

<b>Problem 1</b>: Write a print statement that will print each of the following strings on a new line. Expected output:


<i>We are writing code</i>

<i>to scrape HTML code.</i>

<i>Gahh I love Python.</i>

In [9]:
line1 = "We are wiring code   | 5"
line2 = "to scrape HMTL code. | 7"
line3 = "Gahh I love Python   | 5"

In [10]:
#write your solution below




<b>Problem 2</b>: Correct the error in this code:

In [None]:
print('There's an error in this code somewhere.')

In [11]:
# write your solution below



<b>Problem 3:</b> Fix the order of the sentence. Expected output:


<i>Now this sentence is correct.</i>



In [12]:
sentence = 'sentence correct is now this'

In [13]:
# write your solution below




<b>Problem 4:</b> Remove the white space and underscores from the following string. Expected output: 

<i>main</i>

In [14]:
name = '   __main__   '

In [15]:
# write your solution below




<b>Problem 5:</b> Comma separated files (each column is separated with a comma) are also among the most widely used file formats. Take the following data and print it as a line that could be inserted into a comma separated table. Expected output:

<i>John, Doe, 1984-06-09, Fake Business, Contractor</i>

In [16]:
first_name_html_line = "    <p><u>Name</u>: John Doe</p>"
last_name_html_line = "        <p>Birthdate: 1984-06-09</p>"
organization = "        <p>Organization: <u>Fake Business<u></p>"
role = "        <p>Role: Contractor</p>"

In [17]:
# write your solution below


