## Module 2B: How to Read Files into Lists

The big picture of this course is to introduce you to Python through the lens of webscraping. To reiterate what was mentioned in the <i>Introduction</i> module, when we scrape a webpage, we are really just searching through each line in the HTML file and searching for key tags and terms that identify different elements. Before we get started, lets recall a few key topics.

First, most webpages are just HTML files that have additional style added to them (which can be done a variety of ways). Each line of this file can be treated as plain text and saved as a string, which we can then manipulate like we would any string (covered in Module 1).

Second, since we can make each line in an HTML file into a string, we can save these lines into a list. This will become very important when we talk about how to loop through each item in a list to scrape it. For now, we are just going to cover how to construct said list.

### Unit 1: File Paths and Working Directories

Since we are dealing with files that are now going to be saved onto our computer, we need to go over how to tell Python where to find these files. In order to do this, we need to go over what exactly files and directories are.

<b>Directories</b>, simply put, are locations for storing files on a computer. This is easy to illustrate by looking at the folders on your desktop (directories and folders are the same thing). As an exercise, say we have a folder called "Projects" on our Desktop. Within this folder, there are 5 folders: Baby_Kicks, Concrete_Jungle, GreenLight_Fund, The_Bakery, and The_Carter_Center. Directories are set up in a herirachical system. Therefore, the <b>path</b> to the "Baby_Kicks" folder is <b><i>/Desktop/Projects/Baby_Kicks</i></b>. A path is just the directions to your directory or file, where each location is separated by a "/". In the path to the Baby_Kicks directory, the first location is the "Desktop." After "Desktop" comes "Projects." This means that the Projects directory is inside of the Desktop directory. Likewise, the Baby_Kicks folder is inside of the Projects directory. You can think of this as a sort of tree-like structure:

                       |
                    Desktop
                       |
               |---------------|
          Other_folders     Projects
                               |
               |---------------|------------------|-----------|----------------| 
           Baby_Kicks  Concrete_Jungle  GreenLight_Fund    The_Bakery     The_Carter_Center
                         
                         
You may have noticed that in our example file path, there was a "/" infront of the "Desktop" directory as well. This implies that there is some directory that the "Desktop" directory is inside of. Your entire computer is set up in this, and your Desktop is just your main interface. There are many other directories further up the tree (towards the root), but we don't have to worry about these right now.


<b>Files</b> are objects that simply hold various kinds of data. The path to a file is just the path to the directory is it inside of, with it's name tagged onto the end. For example, say there is a file inside of the Baby_Kicks directory called "Results.txt." The path to this file would be <b><i>/Desktop/Projects/Baby_Kicks/Results.txt</i></b>

There is a lot more depth that we can go into on this topic. However, this is a Python course, not a general computer course. The simplest approach for your purposes as a beginner will be to work from your Desktop. That way you can physically see all of the directories and files that you will be dealing with.

### Unit 2: Opening and Reading Files in Python

Now that we have covered how to tell Python where your files are, let's look at how to open these files with Python and save their contents into data structures that we can work with.

In the directory that this notebook was distributed with, there is a simplified verison of the "Recent_Projects_Simple.html" that we looked at in the <i>Introduction</i> module. This file is called "GreenLight_Fund.html" and it contained just the information about the GreenLight Fund project. This file will be used in the following examples.

To open a file with Python, we use the <b>open()</b> function. This function takes two arguments: the name of the file, and what you want to do witht he file. For this lesson, we will be reading the file, so we will put 'r' (for read) in this argument. However in later lessons we will use this same function to write and add to files as well.

In [1]:
file = open('GreenLight_Fund.html', 'r')

Although we have opened the file, we can't really work with it in this form. Notice what happens if you try to print the file.

In [2]:
print(file)

<_io.TextIOWrapper name='GreenLight_Fund.html' mode='r' encoding='UTF-8'>


When we save the open file to a variable, it is a type called an _io.TextIOWrapper. We don't have go into the details about what this means. For now, just remember that you can use the <b>.readlines()</b> function to save the file to a list, where each line is a different element in said list.

In [3]:
file = open('GreenLight_Fund.html', 'r')
lines = file.readlines()
print(lines)

['<!DOCTYPE html>\n', '\n', '<html>\n', '\n', '    <body style="background-color: white;">\n', '\n', '        <h1>Dataworks Recent Projects</h1>\n', '        <hr>\n', '\n', '        <h2><b>GreenLight Fund</b></h2>\n', '        <p><u>Client</u>: The GreenLight Fund - Atlanta</p>\n', '        <p><u>Website</u>: <a href = "https://greenlightfund.org/sites/atlanta/">https://greenlightfund.org/sites/atlanta/</a></p>\n', '        <p><u>Main Location</u>: 33.7672720, -84.4001850</p>\n', '        <p><u>Tools Used:</u> Excel, Power BI</p>\n', '        <p><u>Description</u></p>The GreenLight Fund supports non-profits that directly address issues important to local community members. To help identify those issues, DataWorks conducted interviews with 100 people from Fulton, Dekalb, Gwinnett, Clayton, and Cobb counties. The team <b>recruited participants</b> by canvassing locations like MARTA stations, shopping centers, and neighborhoods. The interview responses were <b>documented in a spreadsheet<

Now we can start to see the contents of our file. However, it is a bit messy. Since each line is a separate element in the list, there are some cases where the line is just a newline charcter (\n). Also you can start to see the relevance of <i>Module 1</i> come in. We will need a way to clean up the elements in this list.

As a side note, there is another popular way of performing this same function. As with other cases like this, it does not necessairly matter which way you accomplish these tasks. However, we are covering alternative ways because different people will perfer one way over the other, and some methods are more human readable than other. See the example below:

In [4]:
with open('GreenLight_Fund.html', 'r') as file:
    lines = file.readlines()

print(lines)

['<!DOCTYPE html>\n', '\n', '<html>\n', '\n', '    <body style="background-color: white;">\n', '\n', '        <h1>Dataworks Recent Projects</h1>\n', '        <hr>\n', '\n', '        <h2><b>GreenLight Fund</b></h2>\n', '        <p><u>Client</u>: The GreenLight Fund - Atlanta</p>\n', '        <p><u>Website</u>: <a href = "https://greenlightfund.org/sites/atlanta/">https://greenlightfund.org/sites/atlanta/</a></p>\n', '        <p><u>Main Location</u>: 33.7672720, -84.4001850</p>\n', '        <p><u>Tools Used:</u> Excel, Power BI</p>\n', '        <p><u>Description</u></p>The GreenLight Fund supports non-profits that directly address issues important to local community members. To help identify those issues, DataWorks conducted interviews with 100 people from Fulton, Dekalb, Gwinnett, Clayton, and Cobb counties. The team <b>recruited participants</b> by canvassing locations like MARTA stations, shopping centers, and neighborhoods. The interview responses were <b>documented in a spreadsheet<

Let's break this down. First, we used a <b>with statement</b>, which tells Python to use the open function. Instead of saving the file to a variable with an equals sign, we used the <b>as</b> keyword to assign our opened file to the name 'file.' Finally, we used the <b>readlines()</b> function to save the lines in the file to a list, which we named 'lines.'

There might be some cases where you only want to read one line in the file. Python offers a simiar function called <b>.readline()</b> that will read the first line of the file and return it as a string.

In [5]:
file = open('GreenLight_Fund.html', 'r')
first_line = file.readline()
print(first_line)

<!DOCTYPE html>



There might be some cases where you want to read only the first line. For example, say we have 100 files and I want to check to see if they are HTML files or not. Instead of reading all of the lines for each file, we can just look at the first one to decide. The scenario in this example will not always hold (if for some reason the first line was skipped), but instances like this illustrate when .readline() is good to use.

### Unit 3: A Worked Example

For this example, we want to get the GreenLight Fund email and store it as a variable called "email." There are many ways to go about this.

One approach could be to:

1) Read the file and save the lines into a list

2) Get the line that contains the the email address and remove the white space using .strip()

3) Use the positions of the first letter of the email and the last letter to slice out the email address and save this to a variable

This sound good. Let's give it a try.

In [6]:
with open('GreenLight_Fund.html', 'r') as file:
    lines = file.readlines()

# to get this line number, I opened up the HTML file and looked for the line I needed
email_line = lines[11]

print(email_line)

        <p><u>Website</u>: <a href = "https://greenlightfund.org/sites/atlanta/">https://greenlightfund.org/sites/atlanta/</a></p>



So far so good. I read the file and saved it into a list. Then I looked at the HTML file and saw that I needed line 13. Then I used this line number - 1 (because Python starts counting at 0) to slice this line out of the list. 

In [7]:
#now to remove the white space
email_line = email_line.strip()
print(email_line)

<p><u>Website</u>: <a href = "https://greenlightfund.org/sites/atlanta/">https://greenlightfund.org/sites/atlanta/</a></p>


In [8]:
# and finally, to extract the email I want
email = email_line[30:71]
print(email)

https://greenlightfund.org/sites/atlanta/


You may have noticed one issue with this approach. How do you know the positions you need? Counting every time seems tedious. Wile I did not count the positions, I did take a guess and check approach to get the specific frame I needed. It's easy to imageine a scenario where this approach might not be sufficient. Say you want to do this for 100 different email lines, each with different lengths. We won't actually do these one by one. We will be writing a loop to iterate through them (which we will discuss later). The "counting approach" and the "guess and check" approaches are not suitable for putting inside of loops because they require your input each time. Our objective is to automate this process. Let's see if we can come up with an approach that might be more applicable to more complex situations:

1) Our first several steps will be the same as before. Read in the HTML file, get the line we want, and remove the whitespace.

2) If we look at the HTML line we are interested in, we see that there are various symbols that we might be able to use the .split() function on. Since our target is between a ">" and a "<", we can split the string by ">" and use lstrip to remove the tailing HTML tags.

3) In the split, our target text will be in the 5th (4th in Python) field in the list. We can get this element and remove the trailing HTML tags.




In [9]:
# Step 1
with open('GreenLight_Fund.html', 'r') as file:
    lines = file.readlines()

# we can add the .strip() function directly to this line of code for simplicity
email_line = lines[11].strip()

print(email_line)

<p><u>Website</u>: <a href = "https://greenlightfund.org/sites/atlanta/">https://greenlightfund.org/sites/atlanta/</a></p>


In [10]:
# Step 2
split_line = email_line.split('>')
print(split_line)

['<p', '<u', 'Website</u', ': <a href = "https://greenlightfund.org/sites/atlanta/"', 'https://greenlightfund.org/sites/atlanta/</a', '</p', '']


In [11]:
# Step 3

email = split_line[4].rstrip('/a')
print(email)

https://greenlightfund.org/sites/atlanta/<


Sometimes Python has a diffucult time stripping '<' characters. Therefore, we will just remove this by selecting all of the character except for the last one. We can do this using the len() function.

In [12]:
# Final Output
email = email[:len(email)-1]
print(email)

https://greenlightfund.org/sites/atlanta/


This approach was somewhat less straight forward. However, this same approach will hold up for other lines formatted like this one as well. Let's have a quick look at another email line from the original "Recent_Projects_Simple.html" page:

\<p>\<u>Website\</u>: \<a href = "https://www.cartercenter.org/">https://www.cartercenter.org/\</a>\</p>

Pretend I have already read in the file, extracted this line from the list, and removed the white space.

In [13]:
email_line = '<p><u>Website</u>: <a href = "https://www.cartercenter.org/">https://www.cartercenter.org/</a></p>"'

# the following code is exactly the same as above
split_line = email_line.split('>')
email = split_line[4].rstrip('/a')
email = email[:len(email)-1]
print(email)

https://www.cartercenter.org/


Above, we used the exact same method for processing the email line. I just copied and pasted the code. Overall, this example was to reinforce what we covered in this module, but also to demonstrate a wider application of concepts that we covered in previous modules. Most importantly, we are developing how to think like a programmer constructing a solution that is reusable and more applicable to different situations (in our case, different HTML lines).

### Unit 4: Practice Problems

<b>Problem 1</b>: In the GreenLight_Fund.html file, the header of the page is "Dataworks Recent Projects." This header is enclosed by the \<h1> and \</h1> HTML tags. Save this header to the "page_header" variable.

Expected output:

page_header = 'Dataworks Recent Projects'

In [14]:
page_header = ''

# type your code here



<b>Problem 2</b>: In the GreenLight_Fund.html file, there are several terms that are bolded in the description that indicate what tasks Data Works performed for this project. The words are made bold by enclosing them in the \<b> and \</b> HTML tags. The objective of this exercise is to save each of the bolded terms into the "tasks" list. 

Expected output:

tasks = ['recruited participants', 'documented in a spreadsheet', 'identified patterns', 'composed data visualizations']

<i>Note: It might be useful to write out an approach before beginning the task</i>

#### Type your approach here
___________________________

1)

2)

3)

...

In [15]:
tasks = []

# type your code below


