## Extract data from web

With the boom of internet there is so much data lying in the web in the form of websites. 
There are many ways to extract data from the web. APIs are probably the best way to extract data from a website. 
Most of the big websites like Twitter, Facebook, amazon, New York Times provide APIs to access their data.But not all websites have an API. 
Some websites don't provide one because of privacy concerns or they lack technical knowledge to provide one. 

Web scraping is a technique of extracting information from websites. 
It focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

Python has rich eocsystem to scrape data from web and is easy to use. 
The library ‘BeautifulSoup’ assists this task.

#### LIbraries used

**`requests`**: 
This library is used for fetching data from web pages. 
[Click here for documentation](http://docs.python-requests.org/en/master/)

**`BeautifulSoup`**: 
Use this library to extract tables, lists, paragraph from html web pages. 
It also allows filters to extract information from web pages. 
[Click here for documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
#import the library to query a website
import requests

In [None]:
# specify the url
url = "https://en.wikipedia.org/wiki/List_of_World_Series_champions"

In [None]:
# Open website URL and return the html to the variable 'response'
response = requests.get(url)

In [None]:
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup

The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a BeautifulSoup object is create from an html response, we explicitly reference the text format(`response.text`). 
Because the default encoding format is 'UTF-8' as shown below. 
[Click here for documentation](http://docs.python-requests.org/en/master/user/quickstart/#response-content)

In [None]:
response.encoding

In [None]:
response

In [None]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "lxml")

Use prettify function to print the data in nested html structured format.

In [None]:
print(soup.prettify)

We need to extract the table which has list of all baseball world series champions. This table should be present in one of the html tags. Work with the tags to extract data present in them.  "**soup.tag**": will return the content between opening and closing tag including tag. 

In [None]:
soup.title

In [None]:
# Return string within given tag 
soup.title.string

**Identify the html tag**: The data is in a table. You can use inspect element option when you right click the mouse to identify the tag which has the data. 

 * [Additional guide on webpage inspection](../../../datasets/AnalyzingHTMLwithTheWebInspector.pdf)


<img src="../images/table.png">

**Find the right table:** As we are seeking a table to extract information about baseball champions, we should identify the right table first. Let’s write the command to extract information within all table tags. 

In [None]:
all_tables=soup.find_all('table')

Now to identify the right table, we will use attribute “class” of table and use it to filter the right table. In chrome, you can check the class name by right click on the required table of web page –> Inspect element –> Copy the class name OR go through the output of above command find the class name of right table.

In [None]:
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table

In [None]:
#Generate lists
Year=[]
Winning_team=[]
Winning_Manager=[]
Games=[]
Losing_team=[]
Losing_Manager=[]
Ref=[]

# skip first iteration as we dont need headers 
for row in right_table.findAll("tr")[1:]: 
    game_year=row.findAll('th') # To store game year which is in <th> tag
    cells = row.findAll('td') # To store all other details
    if len(cells)>2: # Only extract information if there is table body not heading
        Year.append(game_year[0].find(text=True))
        Winning_team.append(cells[0].find(text=True))
        Winning_Manager.append(cells[1].find(text=True))
        Games.append(cells[2].find(text=True))
        Losing_team.append(cells[3].find(text=True))
        Losing_Manager.append(cells[4].find(text=True))
        Ref.append(cells[5].find(text=True))

Extract the information to DataFrame:
Here, we need to iterate through each row (tr) and then assign each element of tr (td) to a variable and append it to a list. Let’s first look at the HTML structure of the table

In [None]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(Year,columns=['Year'])
df['Winning_team']=Winning_team
df['Winning_Manager']=Winning_Manager
df['Games']=Games
df['Losing_team']=Losing_team
df['Losing_Manager']=Losing_Manager
df['Ref']=Ref
df