# Web Scraping with Beautiful Soup — A Use Case

https://towardsdatascience.com/web-scraping-with-beautiful-soup-a-use-case-fc1c60c8005d

In this notebook, I will give a brief introduction to obtaining data from a webpage,
i.e., web scraping, using Python and libraries such as Requests to get the data and
Beautiful Soup to parse it. Web scraping becomes necessary when a website does not 
have an API, or one that suits your needs.

As an example, I use a webpage that has a consistent HTML structure, but this approach
can be generalized. While there are some frameworks, such as Scrapy, that can provide 
such service, I decided to this as a learning experience.

*The Use Case*

A not-for-profit organization wants to reach out to the Community Foundations of Canada
(CFC) sites across the nation. They asked me to find each contact person and their mailing 
address, and put all the information in a special format in a spreadsheet.

Doing this task manually, by copy-pasting each required field into the spreadsheet, would
mean doing this 195 (foundations) * 11 (fields) = 2145 times! So my next thought was to
automate the procedure by scraping the CFC website.

Below is the code used to scrape their website, get the requested information, and write it
in the format requested into a CSV file.

First let's import the libraries we will be using:

*Requests* to query, request, and get all that is contained in the webpage with a particular url and change the header of the request.

*Beautiful Soup (bs4)* to be able to manipulate the information obtained.

*RegEx* to be able to find text within strings.

*Pandas* to be able to create dataframes and be able to manipulate them.

*Time* to space-out the requests of information from each url (195 of them). We want our requests to behave as humanly as possible while opening these pages. It would not be polite to create a problem with the website. 

*Genderize* which connects to a webservice to find if a first name corresponds to a female or a male. And therefore, it would help us create the gender pronoun to address the contact person. Unfortunately, it has a maximum number of requests per hour, so one should not debug a code while calling it.

And both *spaCy* and *NLTK* both natural language libraries. If needed.

In [1]:
# import libraries
import requests
from requests import get
from bs4 import BeautifulSoup
import regex as re
import csv
import pandas as pd
import time
from genderize import Genderize
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

Starting a request session

In [2]:
session = requests.Session()

In [6]:
session.headers['User-Agent']

'python-requests/2.19.1'

Setting headers to look more human - check https://www.whatismybrowser.com/developers/what-http-headers-is-my-browser-sending

In [3]:
my_headers={"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3)\
              AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98\
              Safari/537.36",
              "Accept":"text/html,application/xhtml+xml,application/xml;\
              q=0.9,image/webp,image/apng,*/*;q=0.8"}

Let's get the information from their website that contains a list of all the CFCs:

In [4]:
url = 'https://communityfoundations.ca/find-a-community-foundation/'
response = session.get(url, headers=my_headers)

One can view the server response's headers using a Python dictionary:

In [9]:
response.headers

{'Connection': 'Keep-Alive', 'Date': 'Wed, 06 Mar 2019 21:41:05 GMT', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=5', 'Content-Type': 'text/html; charset=UTF-8', 'Vary': 'Accept-Encoding,User-Agent', 'Link': '<https://www.communityfoundations.ca/wp-json/>; rel="https://api.w.org/", <https://www.communityfoundations.ca/?p=1070>; rel=shortlink', 'X-Pingback': 'https://www.communityfoundations.ca/xmlrpc.php/', 'Content-Length': '32385', 'Server': 'Apache', 'X-Powered-By': 'PHP/5.6.38'}

As for the first 500 characters obtained:

In [16]:
response.text[:500]

'<!doctype html>\n<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->\n<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->\n<!--[if IE 9]>    <html class="no-js ie9 oldie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->\n<!--[if gt IE 9]><!--> <html class="no-js" lang="en-US" pref'

Now let's get BeautifulSoup to parse it, creating a *bs4.BeautifulSoup* object.

In [20]:
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

By inspecting the source of the website using Chrome, one notices the foundations' names and urls are within *h3* html headers. Therefore, let's find those using bs4.

In [21]:
info_containers = html_soup.find_all('h3')
print(type(info_containers))
print(len(info_containers))

<class 'bs4.element.ResultSet'>
195


This gives us 195 containers, and inspecting the first one of them:

In [22]:
first_cfc = info_containers[0]
first_cfc

<h3><a href="https://www.communityfoundations.ca/cfc_locations/wood-buffalo-community-foundation/">Wood Buffalo Community Foundation</a></h3>

Shows that we are in the right track, it both gives us the foundation name, by extracting it by appending a *.text* and a url link to further information.

In [23]:
location_title = first_cfc.text
location_title

'Wood Buffalo Community Foundation'

Let's take a peak at the last entry.

In [24]:
cfc2 = info_containers[193]
cfc2

<h3><a href="https://www.communityfoundations.ca/cfc_locations/south-saskatchewan-community-foundation-inc/">South Saskatchewan Community Foundation Inc.</a></h3>

The urls that need further inspection are in *<a href = ...* html containers, so if we find them all via bs4:

In [25]:
a_containers = html_soup.find_all('a')

In [26]:
for tag in a_containers:
    print(tag.get('href'))

None
None
/feed
https://twitter.com/CommFdnsCanada
https://www.facebook.com/CommunityFdnsCanadaHome
https://flickr.com/communityfoundationsofcanada
https://youtube.com/user/cfcteam
https://www.communityfoundations.ca/fr/find-a-community-foundation/
https://www.communityfoundations.ca/
https://communityfoundations.ca/news/
https://www.communityfoundations.ca/contact-us/
https://www.communityfoundations.ca
#
https://www.communityfoundations.ca/about/
https://communityfoundations.ca/wp-content/uploads/2018/08/CFC046_AR2017_Digital_Aug28.pdf
https://www.communityfoundations.ca/2016-annual-report/
https://www.communityfoundations.ca/contact-us/
https://www.communityfoundations.ca/board-of-directors/
https://www.communityfoundations.ca/champions/
https://www.communityfoundations.ca/find-a-community-foundation/
https://www.communityfoundations.ca/careers/
#
https://www.communityfoundations.ca/our-work/
https://www.communityfoundations.ca/conference-2019/
https://www.communityfoundations.ca/vi

We realize it gives us more information than just the needed urls. So, let's put a constraint to the findings, the url needs to say *cfc_locations*.

In [27]:
aCF_containers = html_soup.find_all("a", href=re.compile("cfc_locations"))


In [28]:
len(aCF_containers)

195

And now, we get our 195 objects, but we need to see what they look like:

In [29]:
aCF_containers[98]

<a href="https://www.communityfoundations.ca/cfc_locations/souris-glenwood-foundation-inc/">Souris Glenwood Foundation Inc.</a>

In [30]:
for tag in aCF_containers:
    print(tag.get('href'))

https://www.communityfoundations.ca/cfc_locations/wood-buffalo-community-foundation/
https://www.communityfoundations.ca/cfc_locations/airdrie-and-district-community-foundation/
https://www.communityfoundations.ca/cfc_locations/the-banff-community-foundation/
https://www.communityfoundations.ca/cfc_locations/battle-river-community-foundation/
https://www.communityfoundations.ca/cfc_locations/community-foundation-of-lethbridge-and-southwestern-alberta/
https://www.communityfoundations.ca/cfc_locations/community-foundation-of-northwestern-alberta/
https://www.communityfoundations.ca/cfc_locations/community-foundation-of-medicine-hat-and-southeastern-alberta/
https://www.communityfoundations.ca/cfc_locations/drayton-valley-community-foundation/
https://www.communityfoundations.ca/cfc_locations/edmonton-community-foundation/
https://www.communityfoundations.ca/cfc_locations/red-deer-district-community-foundation/
https://www.communityfoundations.ca/cfc_locations/st-albert-community-foundat

And yes, it seems we have the all only urls we need to inspect in order to get the mailing information. 

Unfortunately, on further inspection, these urls, while having the contact's name and title in the organization, they have an incomplete mailing address, as it lacks the name of the province or territory. And with that, we need need to find a way to keep the location of the foundation from the main url. 

By inspecting the source of the page, we know they are in *h2* html headers. However, finding *h2* gives the lines that have *h2* but not the *h3*s that are in between:


In [31]:
info_containers_byProv = html_soup.find_all('h2')
print(type(info_containers_byProv))
print(len(info_containers_byProv))

<class 'bs4.element.ResultSet'>
16


In [32]:
info_containers_byProv

[<h2 class="hidden">Social Profiles</h2>,
 <h2>AB</h2>,
 <h2>Alberta</h2>,
 <h2>British Columbia</h2>,
 <h2>Manitoba</h2>,
 <h2>New Brunswick</h2>,
 <h2>Newfoundland and Labrador</h2>,
 <h2>Northwest Territories</h2>,
 <h2>Nova Scotia</h2>,
 <h2>Ontario</h2>,
 <h2>Prince Edward Island</h2>,
 <h2>Québec</h2>,
 <h2>Saskatchewan</h2>,
 <h2>Yukon</h2>,
 <h2 class="hidden">Mailing List</h2>,
 <h2 class="hidden">Social Profiles</h2>]

And even looking for *h2* alone gives other lines. Fortunately, those extra lines have something that can identify them, a *class = "hidden"*

Therefore, let's look for all the *h2* and *h3* instances that do not have the word 'hidden' (one can use lambda functions!).

In [33]:
info_containers_all = html_soup.find_all(["h2", "h3"], 
                                         class_=lambda x: x != 'hidden')
print(type(info_containers_all))
print(len(info_containers_all))

<class 'bs4.element.ResultSet'>
208


In [34]:
info_containers_all

[<h2>AB</h2>,
 <h3><a href="https://www.communityfoundations.ca/cfc_locations/wood-buffalo-community-foundation/">Wood Buffalo Community Foundation</a></h3>,
 <h2>Alberta</h2>,
 <h3><a href="https://www.communityfoundations.ca/cfc_locations/airdrie-and-district-community-foundation/">Airdrie and District Community Foundation</a></h3>,
 <h3><a href="https://www.communityfoundations.ca/cfc_locations/the-banff-community-foundation/">Banff Canmore Community Foundation</a></h3>,
 <h3><a href="https://www.communityfoundations.ca/cfc_locations/battle-river-community-foundation/">Battle River Community Foundation</a></h3>,
 <h3><a href="https://www.communityfoundations.ca/cfc_locations/community-foundation-of-lethbridge-and-southwestern-alberta/">Community Foundation of Lethbridge And Southwestern Alberta</a></h3>,
 <h3><a href="https://www.communityfoundations.ca/cfc_locations/community-foundation-of-northwestern-alberta/">Community Foundation Of Northwestern Alberta</a></h3>,
 <h3><a href="h

Now we have the lines that contain the province/territory, foundation name, and the urls to be inspected. 

Let's find out how to extract those pieces of information. The *.text* of the *h2* lines gives us the province/territory name -- from now, it would be identified only by *province*.

The *h3* lines' text will be the foundation name, but not the url. Therefore, within each of those lines, get the *href* in the *<a*  block if it has the phrase *cfc_locations*.

In [48]:
for lines in info_containers_all:
    if lines.name == 'h2': 
        province = lines.text
        print('In', province, "\n")
    if lines.name == 'h3':
        foundation = lines.text
        print('Foundation name:', foundation)   
        print('Foundation url:', lines.find_all("a", 
                        href = re.compile("cfc_locations"))[0].get('href'),"\n")
        


In AB 

Foundation name: Wood Buffalo Community Foundation
Foundation url: https://www.communityfoundations.ca/cfc_locations/wood-buffalo-community-foundation/ 

In Alberta 

Foundation name: Airdrie and District Community Foundation
Foundation url: https://www.communityfoundations.ca/cfc_locations/airdrie-and-district-community-foundation/ 

Foundation name: Banff Canmore Community Foundation
Foundation url: https://www.communityfoundations.ca/cfc_locations/the-banff-community-foundation/ 

Foundation name: Battle River Community Foundation
Foundation url: https://www.communityfoundations.ca/cfc_locations/battle-river-community-foundation/ 

Foundation name: Community Foundation of Lethbridge And Southwestern Alberta
Foundation url: https://www.communityfoundations.ca/cfc_locations/community-foundation-of-lethbridge-and-southwestern-alberta/ 

Foundation name: Community Foundation Of Northwestern Alberta
Foundation url: https://www.communityfoundations.ca/cfc_locations/community-found

The next step is to inspect a single website of those 195. Get the info and parse it via *bs4*.

In [49]:
url = 'https://communityfoundations.ca/cfc_locations/the-banff-community-foundation/'
subresponse = session.get(url, headers=my_headers)
html_subsoup = BeautifulSoup(subresponse.text, 'html.parser')

Again, by inspecting the source of the page with Chrome (or any other browser), one finds that the address is in a ('div', class_ = 'single-meta single-event') container:

In [50]:
addr_containers = html_subsoup.find_all('div', class_='single-meta single-event')
print(type(addr_containers))
print(len(addr_containers))

<class 'bs4.element.ResultSet'>
1


So, by inspecting one, we can also tell that:

The street number, P.O. Box or Box, City and Postal code are in a class called *meta-line location*, all separated by vertical bars. 

The phone is in a class called *meta-line phone*.

The particular foundation's website is in a class called *meta-line link*.

The contact's name and title are in a class called *meta-line contact*. The title is separated, in almost all cases, by a comma.

In [51]:
first_subcfc = addr_containers[0]
first_subcfc

<div class="single-meta single-event">
<p class="meta-line location">214 Banff Avenue/Box 3100  | Banff | T1L 1C7</p>
<p class="meta-line phone"><a href="tel:403-762-8549">403-762-8549</a></p>
<p class="meta-line link"><a href="http://www.banffcanmorecf.org">www.banffcanmorecf.org</a></p>
<p class="meta-line contact">Rob Buffler, Executive Director</p>
</div>

Therefore, if we find the html paragraph that includes the phrase 'meta-line contact' and split it at the comma, we would have an array with two elements, first the name, then the title:

In [52]:
c_contact = html_subsoup.find_all('p', class_='meta-line contact')
print(type(c_contact))
print(len(c_contact))
ctext_contact = c_contact[0]
ctext_contact.text
c_contact[0].text
nameArray = re.split(r', ', c_contact[0].text)
nameArray

<class 'bs4.element.ResultSet'>
1


['Rob Buffler', 'Executive Director']

In [53]:
c_location = html_subsoup.find_all('p', class_='meta-line location')
print(type(c_location))
print(len(c_location))
ctext_location = c_location[0]
ctext_location.text

<class 'bs4.element.ResultSet'>
1


'214 Banff Avenue/Box 3100  | Banff | T1L 1C7'

In [57]:
address_split = re.split(r' \| ', c_location[0].text)
print(address_split)

['214 Banff Avenue/Box 3100 ', 'Banff', 'T1L 1C7']


While one can extract the address by spliting it via the vertical bars '|', perhaps this would be a test for both NTK and SpaCy. 

In [65]:
address_test = c_location[0].text
address_test

'214 Banff Avenue/Box 3100  | Banff | T1L 1C7'

In [66]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/bertaerodriguez-milla/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [70]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /Users/bertaerodriguez-
[nltk_data]     milla/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [73]:
def preprocess_without_stopwords(sent):
    sent = nltk.word_tokenize(sent)
    sent = [word for word in sent if word not in en_stop]
    sent = nltk.pos_tag(sent)
    return sent

In [74]:
sent = preprocess_without_stopwords(address_test)
sent

[('214', 'CD'),
 ('Banff', 'NNP'),
 ('Avenue/Box', 'NNP'),
 ('3100', 'CD'),
 ('|', 'NNP'),
 ('Banff', 'NNP'),
 ('|', 'NNP'),
 ('T1L', 'NNP'),
 ('1C7', 'CD')]

Tokenizing and tagging with NLTK did not give the proper tags for an address. Banff was identified as a noun, and not as a place, while the postal code was split into two, a proper noun and a number. What about spaCy?

In [76]:
article = nlp(address_test)
len(article.ents)

5

In [77]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 2, 'FAC': 1, 'PERCENT': 1, 'PERSON': 1})

In [78]:
sentences = [x for x in article.sents]
print(sentences[0])

214 Banff Avenue/Box 3100  | Banff | T1L 1C7


In [79]:
displacy.render(nlp(str(sentences[0])), jupyter=True, style='ent')

In this case, Banff was identified as a person, with the Box number as a percent. While one may be able to polish both approaches to yield better results, in this particular example, it is easier to split the fields by the vertical bar.

We are now ready to obtain the information from all the websites. We only need to extract this information once as the information will be stored in containers, and therefore, we would leave the section below commented out. Notice we have a 10 second delay between requests. Since this process was not immediate, a print statement was added to make sure the script was still running and not having problems. 

In [80]:
# Run only once, do not run again
'''
# Get urls container
subresponse=[]

html_soup = BeautifulSoup(response.text, 'html.parser')
info_containers_all = html_soup.find_all(["h2", "h3"], 
                                         class_=lambda x: x != 'hidden')
#print(len(info_containers_all))

for lines in info_containers_all:
    if lines.name == 'h3': 
        url_fou = lines.find_all("a", href=re.compile("cfc_locations"))[0].get('href')
        print(url_fou)
        subresponse.append(session.get(url_fou, headers=my_headers))
        time.sleep(10)
'''

https://www.communityfoundations.ca/cfc_locations/wood-buffalo-community-foundation/
https://www.communityfoundations.ca/cfc_locations/airdrie-and-district-community-foundation/
https://www.communityfoundations.ca/cfc_locations/the-banff-community-foundation/
https://www.communityfoundations.ca/cfc_locations/battle-river-community-foundation/
https://www.communityfoundations.ca/cfc_locations/community-foundation-of-lethbridge-and-southwestern-alberta/
https://www.communityfoundations.ca/cfc_locations/community-foundation-of-northwestern-alberta/
https://www.communityfoundations.ca/cfc_locations/community-foundation-of-medicine-hat-and-southeastern-alberta/
https://www.communityfoundations.ca/cfc_locations/drayton-valley-community-foundation/
https://www.communityfoundations.ca/cfc_locations/edmonton-community-foundation/
https://www.communityfoundations.ca/cfc_locations/red-deer-district-community-foundation/
https://www.communityfoundations.ca/cfc_locations/st-albert-community-foundat

https://www.communityfoundations.ca/cfc_locations/selkirk-district-community-foundation/
https://www.communityfoundations.ca/cfc_locations/shoal-lake-community-foundation/
https://www.communityfoundations.ca/cfc_locations/souris-glenwood-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/sturgeon-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-altona-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-boissevain-and-morton-foundation-incorporated/
https://www.communityfoundations.ca/cfc_locations/the-carman-area-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-cartwright-and-area-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-glenboro-area-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-interlake-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-killarney-foundation-inc/
https://www.communityfoundations.ca/cfc_locatio

https://www.communityfoundations.ca/cfc_locations/battlefords-and-district-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/family-friends-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/prince-albert-and-area-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/saskatoon-community-foundation/
https://www.communityfoundations.ca/cfc_locations/south-saskatchewan-community-foundation-inc/
https://www.communityfoundations.ca/cfc_locations/the-yukon-foundation/


Below is an example of one of the containers' information that was scraped. 

In [81]:
#subresponse[10].text

And just to check we scraped 195 pages:

In [90]:
len(subresponse)

195

And before getting into the full script, here is a snippet about how the genderize library is used. A dictionary was created, since depending on the name the library returns either "male" or "female". However, we are looking to write "Mr". or "Ms." to the file. This test case used the name "John", giving "Mr."

In [83]:
genderDict = {"male": 'Mr.',
              "female": 'Ms.'}
gen = Genderize().get(['John'])[0]['gender']
print(genderDict.get(gen, "None"))

Mr.


The full script is below. (Pretty much just stitching together the parts discussed above.)

In [86]:
# Creating containers for the information that will be written to file.
organization = []
person = []
person_title = []
street = []
pobox = []
municipality = []
provinces = []
postalCode = []
phone = []
org_url = []
gender_title = []

# A dictionary so that we use the two letter abbreviation for the mailing addresses.
provincesDict = {"Alberta": 'AB', 
                 "British Columbia": 'BC',
                 "Manitoba": 'MB',
                 "New Brunswick": 'NB',
                 "Newfoundland and Labrador": "NL",
                 "Northwest Territories": 'NT',
                 "Nova Scotia": 'NS',
                 "Ontario": 'ON',
                 "Prince Edward Island": 'PE',
                 "Québec": "QC",
                 "Saskatchewan": 'SK',
                 "Yukon": 'YT',
                 "Nunavut": 'NU',
                 "AB": 'AB'
                }

genderDict = {"male": 'Mr.',
             "female": 'Ms.'}

html_soup = BeautifulSoup(response.text, 'html.parser')
info_containers_all = html_soup.find_all(["h2", "h3"], 
                                         class_=lambda x: x != 'hidden')
#print(type(info_containers_all))
#print(len(info_containers_all))

counter = 0

html_subsoup=[]

for lines in info_containers_all:
    if lines.name == 'h2': 
        province = lines.text
        #print ('In Province', lines.text)
        
    if lines.name == 'h3': 
        #print('Foundation: ', lines.text)
        foundation = lines.text
        organization.append(foundation)
                
        html_subsoup.append(BeautifulSoup(subresponse[counter].text, 
                                          'html.parser'))
        
        # Get Address
        c_location = html_subsoup[counter].find_all(
            'p', class_ = 'meta-line location')
 
        address_array = re.split(r' \| ', c_location[0].text)
        # If three pieces, it does not have P.O. Box
        #print("address_array length: ", len(address_array))
        for i in range(0,len(address_array)):
            address_array[i] = address_array[i].strip()
            #print(address_array[i])
        if len(address_array) == 3:
            municipality.append(address_array[1])
            provinces.append(provincesDict.get(province, "None"))
            postalCode.append(address_array[2])
            
            if "box" in address_array[0].lower():
                #print(address_array[0], " has Box")
                if "," in address_array[0]:
                    #print(address_array[0], " is not only a po box ")
                    # Split by comma
                    # Find which one has the box, assign accordingly
                    sub_address = address_array[0].split(',', 1)
                    for i in range(0,len(sub_address)):
                        sub_address[i] = sub_address[i].strip()
                        #print(sub_address[i])
                        if "box" in sub_address[i].lower():
                            pobox.append(sub_address[i])
                        else:
                            street.append(sub_address[i])
                else:
                    street.append('')
                    pobox.append(address_array[0])
                    
            else: 
                street.append(address_array[0])
                pobox.append('')
        else:
            print("Something went wrong with address for foundation: ",
                  foundation)
         
        # Get person
        c_contact = html_subsoup[counter].find_all(
                                      'p', class_='meta-line contact')
        if len(c_contact) > 0: 
            #Means name consists of 'name, position'
            nameArray = re.split(r', ', c_contact[0].text) 
            #Means name consists of 'name - position'
            if " - " in c_contact[0].text: 
                nameArray = re.split(r' - ', c_contact[0].text)

            for i in range(0,len(nameArray)):
                #print(len(nameArray))
                nameArray[i] = nameArray[i].strip()
                nameArray[i] = nameArray[i].strip(',')

            if len(nameArray) == 1:
                name = nameArray[0]
                person.append(name)
                person_title.append('')
            elif len(nameArray) == 2:
                name = nameArray[0]
                person.append(name)      
                name.strip('\'')
                person_title.append(nameArray[1])
            else:
                print("Something went wrong with person's name for foundation: ", foundation)
                
            if len(nameArray) == 1 or len(nameArray) == 2:
                first_name = name.split(' ')
                if len(first_name) <= 3: 
                    gen = Genderize().get([first_name[0]])[0]['gender']
                    #print(gen)
                    gender_title.append(genderDict.get(gen, ""))
                    # In case it is easier to look for the word "None"
                    #print(genderDict.get(gen, "None")) 
                elif "." in c_contact[0].text:
                    gender_title.append(first_name[0])
                else: 
                    print("Something went wrong with person's gender for foundation: ", foundation)
            else:
                gender_title.append('')
                
        else:
            person.append('')      
            person_title.append('')
            gender_title.append('')
            
        # Get phone
        c_phone = html_subsoup[counter].find_all('p', 
                                            class_='meta-line phone')
        if len(c_phone) > 0: 
            phone.append(c_phone[0].text)
        else:
            phone.append('')
            
        # Get website
        c_org_url = html_subsoup[counter].find_all('p', 
                                            class_='meta-line link')
        if len(c_org_url) > 0: 
            org_url.append(c_org_url[0].text)
        else:
            org_url.append('')
                  
        #print("Counter: ", counter)
        #print("Currently at: ", foundation)
        counter += 1

# Making sure all the containers have equal length
print(len(gender_title))
print(len(organization))
print(len(person))
print(len(person_title))
print(len(street))
print(len(pobox))
print(len(municipality))
print(len(provinces))
print(len(postalCode))
print(len(phone))
print(len(org_url))


195
195
195
195
195
195
195
195
195
195
195


Now let's write all to a csv file in the format requested by the company.

In [88]:
# Put the info into frame
test_df = pd.DataFrame({'Organization': organization,
        'Title': gender_title,
        'Addressee (First Name, Last Name)': person,
        'Additional Info (Addressee Job Title, Dept, Etc.)': person_title,
        'Civic Address 1 (Apt/Suite #, Building #, Street Name)': street,
        'Civic Address 2 (PO Box #/RR #, or GD (General Delivery) and STN ID)': pobox,
        'Municipality': municipality,
        'Province or Territory': provinces, 
        'Postal Code': postalCode,
        'Phone': phone,
        'Website': org_url
        })
print(test_df.info())
test_df

cols = ['Organization',
        'Title',
        'Addressee (First Name, Last Name)',
        'Additional Info (Addressee Job Title, Dept, Etc.)',
        'Civic Address 1 (Apt/Suite #, Building #, Street Name)',
        'Civic Address 2 (PO Box #/RR #, or GD (General Delivery) and STN ID)',
        'Municipality',
        'Province or Territory', 
        'Postal Code',
        'Phone',
        'Website'
        ]

# Use pandas to write to csv
test_df.to_csv('data/cfcMailingAddresses.csv', 
               encoding='utf-8', index=False, columns=cols)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 11 columns):
Additional Info (Addressee Job Title, Dept, Etc.)                       195 non-null object
Addressee (First Name, Last Name)                                       195 non-null object
Civic Address 1 (Apt/Suite #, Building #, Street Name)                  195 non-null object
Civic Address 2 (PO Box #/RR #, or GD (General Delivery) and STN ID)    195 non-null object
Municipality                                                            195 non-null object
Organization                                                            195 non-null object
Phone                                                                   195 non-null object
Postal Code                                                             195 non-null object
Province or Territory                                                   195 non-null object
Title                                                              

There is room for improvement. The code did not account for French titles, M or Mme, an address had the municipality twice in it, and a name already had a "Ms." and we did not check for duplications. Also, few names were not identified as male of female. Overall, the code accomplished what was asked.