## Unit 1B: How to Handle White Spaces

<b>What are spaces, tabs, and newlines?</b>

This might seem like an odd question. Spaces are the spaces between these words, tabs are slightly longer spaces, and new lines are, well... new lines. Although these seem obvious to us, it is important to understand that anytime there is a space between words, there is a invisible character that tells the computer to leave said space there. The same is true for tabs, new lines, and other kinds of spacings. This might seem like a very specific topic, but in this lesson we will look at how to leverage different spacing patterns to parse data.

<b>Spaces</b>

There really isn't such thing as a "space" in a string on a computer. There must be some invisible character there that tells Python to leave a blank space there. There are a few different characters that can be used. As seen in Unit 1A, we inserted a space between the words "Hello" and "World" by simply adding " " (quotations marks with a space between them) between the words. <br>
<br>

<b>Tabs</b>

Another import spacing character is the tab, denoted as <b>\t</b>. These characters are often used for indentation in paragraphs, as well as code in Python (which we will see later). In many cases, coding languages that do not require tabs will still be written with tabs to help with human readability.
<br>
<br>

Another reason to explore the usage of tabs is because tabs are often used to separate columns in a table. This file format, refered to as <b>tab-separated values</b> is one of the most widely used formats for storing data in a table structure. Say we have a table that contains information about several resturaunt ratings in the area. If you were to open said table in a tool like Excel, it might look something like this:

| Resturaunt | Stars | Comment |
| --- | --- | --- |
| Flying Biscuit | 4.7 | Great breakfast |
| Tin Drum | 4.9 | Amazing variety |
| Subway | 3.1 | Nice and quick lunch |

<br>
Although this table looks nice and clean to us, if we were to read this table into Python strings, it would look like:<br>
<br>
<i>
Resturaunt\tStars\tComment\n<br>
Flying Biscuit\t4.7\tGreat breakfast\n<br>
Tin Drum\t4.9\tAmazing variety\n<br>
Subway\t3.1\tNice and quick lunch<br>
</i>
<br>
This isn't near as nice looking. However, learning to handle data like this will prepare you for more complicated tasks.<br>
<br>

<b>New Lines</b>

Characters are read by a computer in a straight line. Anytime we see a new line, there is actually an invisible character telling the computer to show a new line on our screens. This is often denoted as <b>\n</b>. Notice how in the example above, each line (except the last one) has a "<i>\n</i>" at the end. When we are processing most files, we will want to remove these characters from the ends of lines so that we can better manipulate the data into the output that we want.

### Some practice with tabs

One easy way to output tab-separated values (TSV) in Python is to use an f-string. For example, say we have the following variables:

    first_name = 'John'
    last_name = 'Doe'
    organization = 'Generic Business'

We can output this data in TSV format like:

    data = f"{first_name}\t{last_name}\t{organization}"

and if we were to pring this data:

    print(data)
    
    >>John    Doe    Generic Business

<br>
Now imagine we have 100 names and organizations. This data is much easier to deal with if we know that all of the names in the first column are first names, all of the names in second columns are last names, and all of the entries in the third column are business. We'll explore the reasoning for that later. For now, let's get some practice. 
<br>
<br>
In the code cell below, assign a few strings to some variables and print them out in TSV format.

In [None]:
#Designate string variables here


#Put them all together in an f-string here


#print your f-string here




### Some practice with new lines

In the previous example, you used tabs to separate your data into columns. However, we also want to be able to separate our data into rows as well. For this, we use a new line (\n) character. Say we have the following f-strings:

    data1 = f"{'John'}\t{'Doe'}\t{'Generic Business'}"
    data2 = f"{'Jimmy'}\t{'Carter'}\t{'Former President'}"

One options for printing this data out in two lines is to write two print statements. However, this won't be very practical when we start dealing with data sets that have dozens or even hundreds of rows. A better option would be:

    print(f"{data1}\n{data2}"

    >>John    Doe      Generic Business
      Jimmy   Carter   Former President

Now that we have looked at how to use white spaces, let's get some practice. In the code cell below, make four f-strings in TSV format and print them out on new lines.

In [None]:
#Designate your f-string here


#Print your data here




### White Spaces and .strip()

Recall how in the introduction to tabs at the beginning of this section, each of the rows in the example table had a \n character at the end. When we are parsing this type of data in Python, we will need to get rid of these new line character to clean up the data. Python has a functionality for this called <b>.strip()</b>. The default behavior of this function is to remove all leading and trailing spaces from a string, which is exactly what we need to do.

We will go over how to do this later, but for now, pretend we have read the table from a previous example into several strings in Python:

    header = 'Resturaunt\tStars\tComment\n'
    flying_biscuit = 'Flying Biscuit\t4.7\tGreat breakfast\n'
    tin_drum = 'Tin Drum\t4.9\tAmazing variety\n'
    subway = 'Subway\t3.1\tNice and quick lunch'

To get rid of the new line characters at the end of these string, we can apply the strip() function as:

    header = header.strip()
    flying_biscuit = flying_biscuit.strip()
    tin_drum = tin_drum.strip()
    subway = subway.strip()



Now let's get some practice. In the code cell below, write a few strings in TSV format and with new line characters at the end. Print these strings out using an f-string. Then use the strip() function to remove the new line characters and print the strings again.

In [None]:
#Define your stirng here


#Print your string here


#Remove the new line character here


#Re-print your strings here




Strip's default behavior is to remove all leading and trailing white spaces. However, strip can be used to remove any character. For example, say there are underscores at the front and back of a string:

    constructor = "__init__"

We can remove the underscores from the front and back of this string using strip:

    init_string = constructor.strip('_')

And if we print the stirng:

    print(init_string)
    >>init

Notice how we put a string consisting of a single underscore inside of the .strip() function. This string is called an <b>argument</b>, which is a value that provides input to the function. Not all functions have to have arguments, but arguments can be used to customize the behavior of a funtion, like we did in this example.

### .lstrip() and .rstrip()

.strip() alone performs the same action to both ends of a string. However, there will be some cases where we only want to strip characters from the start of end of a string. Therefore, we can use .lstrip(), which stands for "left strip," to remove characters from the front of a string, and .rstrip, which stands for "right strip," to remove characters from the end of a string.

Say you are cleaning up a data set of Atlanta phone numbers, and you want to remove the area code for all of the phoen numbers. The table might look something like this:

| Resturaunt | Phone Number |
| --- | --- |
| Flying Biscuit | 404-835-2072 |
| Tin Drum | 404-205-5650 |
| Subway | 404-815-2977 |

We'll go over how to do this later, but say you have picked out all of the phone numbers and saved them as stirngs:

    flying_biscuit_phone = '404-835-2072'
    tin_drum_phone = '404-205-5650'
    subway_phone = '404-815-2977'

You can delete the area code from each of these strings using lstrip():

    flying_biscuit_phone = flying_biscuit_phone.lstrip('404')
    tin_drum_phone = tin_drum_phone.lstrip('404')
    subway_phone = subway_phone.lstrip("404")

And if you print each of these (let's use an f-string to make it 1 line):

    print(f"{flying_biscuit_phone}\n{tin_drum_phone}\n{subway_phone}")
    >>835-2072
      205-5650
      815-2977

Of course we could also just slice off the first 4 characters in the strings as discussed in the previous unit. However, this is more so to illustrate a concept.

Now let's get some practice. In the code cell below write several strings that end with the same characters. Then use rstrip() to remove the characters at the end that the strings have in common.

In [None]:
#Define your string here


#Print your strings here


#Use .rstrip() here


#Print your strings again here



### .replace()

Although .strip(), .rstrip(), and .lstrip() will usually get the job done, there are instances where their functionality can be more complex than described in this lesson. This might result in some instances where your code does not work as it should. To avoide this issue, we can also use the <b>.replace()</b> function. This is exactly what it sounds line. The replace function can be applied to a string and takes two string arguments: 1) what to replace and 2) what to replace it with. For example, say we have a string that is in tab-separated values format, and we want to convert it to comma-separated values format:

    flying_biscuit = "Flying Biscuit\t4.7\tGreat breakfast"

We can replace all tabs with commas like:

    flying_biscuit = flying_biscuit.replace('\t', ',')

And if we print the string:

    print(flying_biscuit)
    >>Flying Biscuit,4.7,Great breakfast

### A Worked Example

One task commonly done with Python is webscraping, which is using programming to extract information from webpages. Webpages are usually structured using HTML, which contains many differnt tags that tell the browser how to arrange the webpage. However, we usually want to get rid of these tags and only want the information between the tags. In this example, we will look at how we can use the tools we have learned so far to extract information from lines of HTML.

Now that we have covered the basics of print statements, variable assignment, and string manipulations, we will go over a worked example, and then into some practice problems. This example will be an early introduction to one of the most important aspects of programming: developing an approach before writing code. This might seem a bit silly because this is not a very complicated example. It would be easy enough to just write the code very quickly. However, it is good to start practicing this because it will make more complex tasks more approachable. That being said, let's get into the example.

We have to following strings:

In [None]:
client_html_line = f"    <p><u>Client</u>: The Carter Center</p>"
location_html_line = f"        <p>Location: 453 John Lewis Freedom Pkwy NE, Atlanta, GA 30307</p>"
employees_html_line = f"        <p>Employees: 175; field office staff in more than a dozen countries</p>"

print(client_html_line)
print(location_html_line)
print(employees_html_line)

    <p><u>Client</u>: The Carter Center</p>
        <p>Location: 453 John Lewis Freedom Pkwy NE, Atlanta, GA 30307</p>
        <p>Employees: 175; field office staff in more than a dozen countries</p>


Say we are trying to populate a tab-separated file that can be opened in Excel. Our column headers are <i>Client</i>, <i>City</i>, and <i>Employee Number</i>. Our objective is to print a tab-separated string that has "The Carter Center" for the <i>Client</i> column, "Atlanta" for the <i>City</i> column, and "175" for the employee column.

<u>Approach</u>

Step 1: We want to remove the white space from the front of each line. This can be done with .strip().

Step 2: We want to remove the unwanted characters from the beginning and end of each line. This can be done with .rstrip() and .lstrip()

Step 3: We want to print the strings with a tab between each variable. This can be done with an f-string.

<i>Note that we are not writing the exact details of what exactly we are going to type. We are simply writing down what the objective is and how we can accomplish it, and we divide this process up into steps to make it easier for us to think about. Again, this might seem silly because this task is relativly straight forward, but this approach will be useful later.<i/>

In [None]:
# step 1 and step 2 are done at the same time
client = client_html_line.strip().lstrip("<p><u>Client</u>: ").rstrip("</p>")
location = location_html_line.strip().lstrip("<p>Location: 453 John Lewis Freedom Pkwy NE, ").rstrip(", GA 30307</p>")
employees = employees_html_line.strip().lstrip("<p>Employees: ").rstrip("; field office staff in more than a dozen countries</p>")

#step 3
output = f"{client}\t{location}\t{employees}"

print(output)

The Carter Center	Atlanta	175
