# Bootcamp B5: Scraping small activity

Make sure that you **all** self-enrol to the corresponding group (Bootcamp5 *X*) via canvas (canvas group number *X*) <br>
Add all names below <br>
Only one should submit the file (till the end of the day) to claim points for the participation bonus. <br>
All team members get the same number of points unless otherwise communicated to Jerry

**Canvas group number**: 27

**Collaborators**: Imanol Mugarza, Youssef Ben Massour, Claudia Sanchez 

## FIFA World Cup 
We would like to investigate the history of football and find out how many times countries met each other in FIFA finals. Let's stat by fetching the table from Wikipedia! We are interested in "Winners" and "Runner-ups" for each championship.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
#this time we use the "requests" library from PyPI, which has the typical GET and POST functions
url = "https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"
req = requests.get(url)
print(f"Request terminated with status code {req.status_code}")
print(f"Response encoded with {req.encoding}")
# as before, we can add the HTML to our soup
fifa_soup = BeautifulSoup(req.text, 'html.parser')

Request terminated with status code 200
Response encoded with UTF-8


In [2]:
fifa_soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of FIFA World Cup finals - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=docume

### Task 1 Turning HTML into a DataFrame
Here is where you step in for real. The goal is to create a DataFrame out of the 3rd table that contains the Winners and Runner-Ups. 

1st step: get all the tables and select the correct one
2nd step: convert that table to DF (*Hint*: use indexing and panda's HTML reading capacity)

In [23]:
# we can retrieve all tables, you have already seen how to do that!
tables = fifa_soup("table") # your code goes here

# and here we need to look the table titles to find id of the table that interests us
[{'id':i, 'caption':table("caption")} for i, table in enumerate(tables)]

[{'id': 0,
  'caption': [<caption class="infobox-title">FIFA World Cup final</caption>]},
 {'id': 1, 'caption': []},
 {'id': 2,
  'caption': [<caption>Key to the list
   </caption>]},
 {'id': 3,
  'caption': [<caption>List of finals of the FIFA World Cup
   </caption>]},
 {'id': 4,
  'caption': [<caption>Results by nation
   </caption>]},
 {'id': 5,
  'caption': [<caption>Results by confederation
   </caption>]},
 {'id': 6, 'caption': []},
 {'id': 7, 'caption': []}]

In [24]:
# turning HTML into a DataFrame
# your code goes here
athletes = pd.read_html(str(tables), header=0)[0]
athletes.head(15)

ImportError: lxml not found, please install it

### Task 2 Cleaning the table

What do you think about how the table looks?

It's very nice how soup+pandas handled "Editions not played..." and preserved the table structure. But there is some extra mess in form of Unnamed columns. So we will need to do some cleaning.

Would you resolve all those issues? Whenever you approach a new dataset, always start with your research question. Here we are just interested in pairs of countries and we want to count how many times each pair occurs. So we actually do not need to clean that much! Let's start by listing all the pairs, we will worry about counting later. Report your result as a DataFrame.

In [20]:
# your code goes here

### Task 3 Getting pairs of countries

This can be tricky but bear with me, I do most of the work for you so please hang on :) 

Our dataframe's records now show which two countries met in every given championship. But we would like to know how many times they met over the years! Sadly we cannot simply count records, because ordering inside the pair matters. There are many ways to do this. For example, you could one-hot-encode the names of participating countries. Or you could write a pandas function. I tried the latter, failed and resortet to fundamental python functionality: Python's sets() type https://realpython.com/python-sets/.

So why {} sets instead of [] lists?
Beacuse of how Python compares instances of those data types

    ['Argentina','West Germany']==['West Germany','Argentina']
    
is False because order in lists matters! But:

    {'Argentina','West Germany'}=={'West Germany','Argentina'}
    
is True because order in sets does no matter :) We can use this to count pairs!

(note that sets {} and dictionaries {} looks the same. In fact, dictionary keys are sets!)

In [21]:
# you might need to tweak names of your dataframe and its columns in the df2pairs function call
def df2pairs(df, col1, col2):
    pairs = []
    for index in df.index:
        element1, element2 = (df.loc[index, col1], df.loc[index, col2])
        pairs.append(set([element1, element2]))
    return pairs

pairs = df2pairs(df_fifa, 'Winners', 'Runners-up')

for pair in pairs:
    print(f"Countries {pair} met {pairs.count(pair)} times.")

NameError: name 'df_fifa' is not defined

Now you step in again, converting the amount of meetings into a DataFrame. So we want to know how many times each pair of countries met and we do not want any duplicate records!

In [22]:
# your code goes here

### Task 4 Thinking about the data...
Would you combine "Germany" and "West Germany" into one entity? Why? Why not?

If you want to (and have time), re-run the code above after merging those two into one. This last step is not mandatory.

**Your answer goes here**

In [None]:
#your code goes here