# Web scraping practice, adapted from an intro to web scraping exercise

I was looking to map the list of countries represented in the Gainesville Language Exchange (part of my digital storytelling <a target="_blank" rel="noopener noreferrer" href="https://languages.fdelaguerra.com">languages project</a>). Since the list was very long, I thought it best to return a list of countries not represented instead. 

By comparing to a list of all countries from the US State Dept, I was able to figure out the non-represented countries. The logic behind this was to take the large list of represented countries and subtract it from a list of total countries, thus separating the non-represented countries (a much smaller number, and therefore less work for me when visualizing).

*In other words:<br>*
**data I could find - data I had = data I needed** *, or  <br>* `total countries` - `countries represented` = `countries not represented`


## Importing libraries
Instructions from original exercise by Mindy McAdams:

    First we import two libraries: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is for scraping, and [Requests](http://docs.python-requests.org/en/master/user/quickstart/) is for making the HTTP request to the server where we want to scrape.

In [4]:
from bs4 import BeautifulSoup
import requests

After importing, I requested the target URL

In [5]:
url = 'https://history.state.gov/countries/all'
html = requests.get(url)
page = BeautifulSoup(html.text, 'html.parser')

My goal was to scrape all of the countries from that one page and put them in a CSV file.

By using Chrome Dev Tools' "Inspect," I saw that each letter heading had a `<ul>` element with countries starting with that letter as `<li>`s.

In order to target the relevant `<ul>`s, I saw that the main content was separated into two main `<div>`s using bootstrap's `col-md-6` class.

In [6]:
div = page.find_all('div', class_="col-md-6")

#after getting the main content, I went into each object and got all the <ul>s
for i in div:
    ul_list = i.find_all('ul')
    break
    
## --disregard - from a previous attempt to filter:
    ## if p.get_text() == 'An asterisk indicates former countries, previously recognized by the United States, that have been dissolved or superseded by other states.':
      ##  div = head.find_next('ul')
## --

    
# printing to check the results
print(ul_list)
    

[<ul>
<li><a href="/countries/afghanistan">Afghanistan</a></li>
<li><a href="/countries/albania">Albania</a></li>
<li><a href="/countries/algeria">Algeria</a></li>
<li><a href="/countries/andorra">Andorra</a></li>
<li><a href="/countries/angola">Angola</a></li>
<li><a href="/countries/antigua-barbuda">Antigua and Barbuda</a></li>
<li><a href="/countries/argentina">Argentina</a></li>
<li><a href="/countries/armenia">Armenia</a></li>
<li><a href="/countries/australia">Australia</a></li>
<li><a href="/countries/austria">Austria</a></li>
<li><a href="/countries/austrian-empire">Austrian Empire</a></li>
<li><a href="/countries/azerbaijan">Azerbaijan</a></li>
</ul>, <ul>
<li><a href="/countries/baden">Baden*</a></li>
<li><a href="/countries/bahamas">Bahamas, The</a></li>
<li><a href="/countries/bahrain">Bahrain</a></li>
<li><a href="/countries/bangladesh">Bangladesh</a></li>
<li><a href="/countries/barbados">Barbados</a></li>
<li><a href="/countries/bavaria">Bavaria*</a></li>
<li><a href="/c

In [8]:
# after getting each ul, I wanted to target each <a> nested within the <li> elements
for i in ul_list:
    a_list = i.find_all_next('a')
    break
    
# printing to check results
print(a_list)

[<a href="/countries/afghanistan">Afghanistan</a>, <a href="/countries/albania">Albania</a>, <a href="/countries/algeria">Algeria</a>, <a href="/countries/andorra">Andorra</a>, <a href="/countries/angola">Angola</a>, <a href="/countries/antigua-barbuda">Antigua and Barbuda</a>, <a href="/countries/argentina">Argentina</a>, <a href="/countries/armenia">Armenia</a>, <a href="/countries/australia">Australia</a>, <a href="/countries/austria">Austria</a>, <a href="/countries/austrian-empire">Austrian Empire</a>, <a href="/countries/azerbaijan">Azerbaijan</a>, <a href="/countries/baden">Baden*</a>, <a href="/countries/bahamas">Bahamas, The</a>, <a href="/countries/bahrain">Bahrain</a>, <a href="/countries/bangladesh">Bangladesh</a>, <a href="/countries/barbados">Barbados</a>, <a href="/countries/bavaria">Bavaria*</a>, <a href="/countries/belarus">Belarus</a>, <a href="/countries/belgium">Belgium</a>, <a href="/countries/belize">Belize</a>, <a href="/countries/benin">Benin (Dahomey)</a>, <a h

In [9]:
# here, I tried to clean the list to get each country's inner text 
for i in a_list:
    countries = i.get_text()
    country_list = countries.splitlines()
    
    for i in country_list:
        clean_list = []
        clean_list.append(i)
    print(clean_list)
    
##for i in countries:
  ##  country_list.append(i)
    
##print(country_list)

['Afghanistan']
['Albania']
['Algeria']
['Andorra']
['Angola']
['Antigua and Barbuda']
['Argentina']
['Armenia']
['Australia']
['Austria']
['Austrian Empire']
['Azerbaijan']
['Baden*']
['Bahamas, The']
['Bahrain']
['Bangladesh']
['Barbados']
['Bavaria*']
['Belarus']
['Belgium']
['Belize']
['Benin (Dahomey)']
['Bolivia']
['Bosnia and Herzegovina']
['Botswana']
['Brazil']
['Brunei']
['Brunswick and Lüneburg']
['Bulgaria']
['Burkina Faso (Upper Volta)']
['Burma']
['Burundi']
['Cabo Verde']
['Cambodia']
['Cameroon']
['Canada']
['Cayman Islands, The']
['Central African Republic']
['Central American Federation*']
['Chad']
['Chile']
['China']
['Colombia']
['Comoros']
['Congo Free State, The']
['Costa Rica']
['Cote d’Ivoire (Ivory Coast)']
['Croatia']
['Cuba']
['Cyprus']
['Czechia']
['Czechoslovakia']
['Democratic Republic of the Congo']
['Denmark']
['Djibouti']
['Dominica']
['Dominican Republic']
['Duchy of Parma, The*']
['East Germany (German Democratic Republic)*']
['Ecuador']
['Egypt']
['E

In [10]:
# this returned lists of each country, but what I needed was a list of all countries together, so I added each to a new list
country_list = []

for i in a_list:
    country = i.get_text().splitlines()
    country_list.append(country)
print(country_list)



In [11]:
# after experimenting a few ways, I found this way get rid of the nested list of lists, making a clean list 
clean_list = []
for sublist in country_list:
    for i in sublist:
        clean_list.append(i)
        
print(clean_list)



In [12]:
# there were a few extraneous results at the bottom, but they were easily edited in a csv after transferring the data
# here I imported a new library to make a csv file to hold the scraped data
import csv

csvfile = open("new-country-list.csv", 'w', newline='', encoding='utf-8')
c=csv.writer(csvfile)

c.writerow(clean_list)

csvfile.close()

In [13]:
# checked the location of the csv file.
pwd

'C:\\Users\\fdela\\Documents\\Grad School\\UF\\Classes\\Coding and Web Apps\\python\\python_work\\conda-python-intro-master'

## Results:
I had now returned a .csv file with the list of countries from the State Dept. 

After transposing the data in excel, I imported the represented country list (given to me by a Gainesville Language Exchange representative in Excel format) and used Excel to find cell matches. 

I had to clean the data somewhat (the State Dept data included former states and territories, marked with an asterisk). I removed cells containing asterisks and corrected formatting for countries written as `some_country , The`. This helped identify missed matches.

Finally, after filtering the remaining countries (non-matches), I used my svg map to color all countries in the accent color (marking represented countries), using the original base color to mark the remaining non-matches. 

**Note:** There were non-matches from the Gainesville Language Exchange list, but these were ignored because it included non-recognized territories, languages or even fictional locations that would not be visualized on the map. 
