# Explore here

It's recommended to use this notebook for exploration purposes.

Some documentation to read:

* [beautifulsoup - navigating the tree](https://beautiful-soup-4.readthedocs.io/en/latest/#navigating-the-tree)
* [beautifulsoup - output](https://beautiful-soup-4.readthedocs.io/en/latest/#get-text)

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
# we have defined our imports and functions we need to use

In [None]:
url = "https://es.wikipedia.org/wiki/Leucocito"
html_data = requests.get(url, time.sleep(2)).text
html_data
# we have retrieved the wikipedia page data and stored in a variable


In [None]:
soup = BeautifulSoup(html_data)
html_table = soup.find("table",{"class":"wikitable"})
html_table
# we have found the body of the table we want


In [None]:
for row_index, row in enumerate(html_table.tbody.find_all('tr')):
   print(row_index, row)

# pay attention that the first row (index==0) contains the TH tags with are the headers of our table
# in HTML:
# - the tag TH (it is not always used) are usually for table cells as headers
# - the tag TD are usually for normal cells

# according to the beautiful soup docs, I have found three ways to get a tag's children (content).
# .children
# .contents
# .find_all(<tag-name>)

# we have now achieved how to iterate all the rows in the html table, let's try to extract each cell text and put it in an list, so we can created a pandas DataFrame

In [None]:
arr_data = []

# iterating every tr tag of the table tbody
for row_index, row in enumerate(html_table.tbody.find_all('tr')):
   arr_cells_data = []
   if row_index == 0:

      # here we are inside a TR row with some TH cells, we need to iterate it and add its text to the result array
      for cell in row.find_all('th'):
         arr_cells_data.append(cell.get_text(strip=True))
   else:

      # here we are inside a TR row with some TD cells, we need to iterate it and add its text to the result array
      for cell in row.find_all('td'):
         arr_cells_data.append(cell.get_text(strip=True))
   
   arr_data.append(arr_cells_data)


arr_data

# we have now extracted all the text of table and stored in a bidimensional array, like a DataFrame
# we know the first row (row 0) are the headers
# (of course I could easily make this code smaller by using list map functions, there are some parts of the code that I feel like I am repeating myself)

# ps.: during the analysis of this code I noticed that the methods '.children()' and '.contents()' retrieved some empty nodes instead of the 'td' and 'th' tags
# so, looking at the documentation I realized they are called NavigableStrings, not texts. Long story short, better to use the find_all() and look for the tags we want as return

In [None]:
# now we create a pandas datafrma by using our array of data
# we can also specify the colum names by slicing the array (getting the first row only for the column names)

df = pd.DataFrame(arr_data[1:], columns=arr_data[:1])
df