### What is Web Scraping?
Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form. There are different ways to scrape websites such as online Services, APIs or writing your own code. In this article, we’ll see how to implement web scraping with python. 

### Why is Web Scraping Used?
Web scraping is used to collect large information from websites. But why does someone have to collect such large data from websites? To know about this, let’s look at the applications of web scraping:

- <b>Price Comparison:</b> Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.
- <b>Email address gathering:</b> Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.
- <b>Social Media Scraping:</b> Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending.
- <b>Research and Development:</b> Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used to carry out Surveys or for R&D.
- <b>Job listings:</b> Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.

### Web Crawling v/s Web Scraping
The terms Web Crawling and Scraping are often used interchangeably as the basic concept of them is to extract data. However, they are different from each other. We can understand the basic difference from their definitions.

Web crawling is basically used to index the information on the page using bots aka crawlers. It is also called indexing. On the hand, web scraping is an automated way of extracting the information using bots aka scrapers. It is also called data extraction.

### APIs: An Alternative to Web Scraping
Some website providers offer Application Programming Interfaces (APIs) that allow you to access their data in a predefined manner. With APIs, you can avoid parsing HTML and instead access the data directly using formats like JSON and XML. HTML is primarily a way to visually present content to users.

When you use an API, the process is generally more stable than gathering the data through web scraping. That’s because APIs are made to be consumed by programs, rather than by human eyes. If the design of a website changes, then it doesn’t mean that the structure of the API has changed.

However, APIs can change as well. Both the challenges of variety and durability apply to APIs just as they do to websites. Additionally, it’s much harder to inspect the structure of an API by yourself if the provided documentation is lacking in quality.

The approach and tools you need to gather information using APIs are outside the scope of this tutorial. To learn more about it, check out API Integration in Python.

### Data Scraping using Beautiful Soup
- Import Beautiful Soup, For Installation <b> pip install bs4</b>
- Make a Get request to fetch page Data
- Parse HTML
- Filter Relvant Parts

In [9]:
!pip install bs4



### Task1 : To scrape the andriod version history wikipedia website

In [16]:
from urllib.request import urlopen


In [17]:
android_url = "https://en.wikipedia.org/wiki/Android_version_history"

In [18]:
# Make a Get request to fetch page Data
android_data = urlopen(android_url)

In [19]:
print(type(android_data))

<class 'http.client.HTTPResponse'>


In [21]:
# Read the website in HTML form
android_html = android_data.read()
#print(android_html)
android_data.close()

### Parse the website using BeautifulSoup tool

In [22]:
# Now Parse the HTML page, To Parse the website we have to used Beautiful soup 
# tool, To install that we have to import :
from bs4 import BeautifulSoup as soup

In [23]:
android_soup = soup(android_html, 'html.parser')

In [24]:
# print(android_soup)

In [25]:
print(type(android_soup))

<class 'bs4.BeautifulSoup'>


In [26]:
# To find the heading tage we can use .h1 menully or findAll('tag') function 
print(android_soup.findAll('h1'))

[<h1 class="firstHeading" id="firstHeading" lang="en">Android version history</h1>]


In [27]:
# To find the heading tage we can use .h1 menully or findAll('tag') function 
# print(android_soup.findAll('h2'))

In [28]:
# To find the heading tage we can use .h1 menully or findAll('[]')
# function it will return all tags that present in list list
# print(android_soup.findAll(['h1','h2']))

In [29]:
# to find the table 
tables = android_soup.findAll('table', {'class':'wikitable'})
# print(len(tables))

In [31]:
android_table = tables[0]
#print(android_table)

### Extract Useful Information
- Remove undesired tags
- Extract table header & data

In [32]:
# Extract Table  Header
header = android_table.findAll('th')
print(len(header))

6


In [33]:
print(header[0].text)

Name



In [34]:
# Now whole Header we will use loop
column_title = [ct.text[:-1] for ct in header]
print(column_title)

['Name', 'Version number(s)', 'Initial stablerelease date', 'Supported (security fixes)', 'API level', 'References']


In [35]:
# Now Extract Row Data 
row_data = android_table.findAll('tr')
# print(row_data)

In [36]:
# Now Extract Row Data 
row_data = android_table.findAll('tr')[1:]
print(len(row_data))
first_row = row_data[0].findAll('td',{})
for d in first_row:
    print(d.text[:-1])

18
No official codename
1.0
September 23, 2008
No
1
[9]


In [37]:
# Now Extract 2nd Row Data 
row_data = android_table.findAll('tr')[1:]
first_row = row_data[1].findAll('td',{})
for d in first_row:
    print(d.text[:-1])

1.1
February 9, 2009
No
2
[9][14]


In [38]:
# Now Extract all the columns corresponding the rows
table_rows = []
for count, row in enumerate(row_data):
    current_row = []
    if count == 1:
        current_row.append("No official codename")
    row_data = row.findAll('td', {})
    for idx, data in enumerate(row_data):
        current_row.append(data.text[:-1])
    table_rows.append(current_row)
print(table_rows)

[['No official codename', '1.0', 'September 23, 2008', 'No', '1', '[9]'], ['No official codename', '1.1', 'February 9, 2009', 'No', '2', '[9][14]'], ['Cupcake', '1.5', 'April 27, 2009', 'No', '3', '[15]'], ['Donut', '1.6', 'September 15, 2009', 'No', '4', '[16]'], ['Eclair', '2.0 – 2.1', 'October 26, 2009', 'No', '5 – 7', '[17]'], ['Froyo', '2.2 – 2.2.3', 'May 20, 2010', 'No', '8', '[18]'], ['Gingerbread', '2.3 – 2.3.7', 'December 6, 2010', 'No', '9 – 10', '[19]'], ['Honeycomb', '3.0 – 3.2.6', 'February 22, 2011', 'No', '11 – 13', '[20]'], ['Ice Cream Sandwich', '4.0 – 4.0.4', 'October 18, 2011', 'No', '14 – 15', '[21]'], ['Jelly Bean', '4.1 – 4.3.1', 'July 9, 2012', 'No', '16 – 18', '[22]'], ['KitKat', '4.4 – 4.4.4', 'October 31, 2013', 'No', '19 – 20', '[23]'], ['Lollipop', '5.0 – 5.1.1', 'November 12, 2014', 'No', '21 – 22', '[24]'], ['Marshmallow', '6.0 – 6.0.1', 'October 5, 2015', 'No', '23', '[25]'], ['Nougat', '7.0 – 7.1.2', 'August 22, 2016', 'No', '24 – 25', '[26][27][28][29]'

### Convet the Extracted Data in CSV Format

In [39]:
filename = 'Android_version_history.csv'
with open(filename, 'w', encoding = 'utf-8') as f:
    # Write the header
    header_string = ','.join(column_title)
    header_string += '\n'
    f.write(header_string)
    
    for row in table_rows:
        row_string = ""
        for w in row:
            w = w.replace(',','')
            row_string += w + ','
        row_string = row_string[:-1]
        row_string += '\n'
        f.write(row_string)

###  Cleaning the DataSets
- Removing unwanted commas & symbols
- undesired information

In [40]:
import pandas as pd

In [41]:
df = pd.read_csv('Android_version_history.csv')

In [42]:
df.head(n=10)

Unnamed: 0,Name,Version number(s),Initial stablerelease date,Supported (security fixes),API level,References
0,No official codename,1.0,September 23 2008,No,1,[9]
1,No official codename,1.1,February 9 2009,No,2,[9][14]
2,Cupcake,1.5,April 27 2009,No,3,[15]
3,Donut,1.6,September 15 2009,No,4,[16]
4,Eclair,2.0 – 2.1,October 26 2009,No,5 – 7,[17]
5,Froyo,2.2 – 2.2.3,May 20 2010,No,8,[18]
6,Gingerbread,2.3 – 2.3.7,December 6 2010,No,9 – 10,[19]
7,Honeycomb,3.0 – 3.2.6,February 22 2011,No,11 – 13,[20]
8,Ice Cream Sandwich,4.0 – 4.0.4,October 18 2011,No,14 – 15,[21]
9,Jelly Bean,4.1 – 4.3.1,July 9 2012,No,16 – 18,[22]


### For Accessing the data in Data frame 

In [43]:
print(df.iloc[0][1])

1.0


In [44]:
print(df.iloc[1][2])

February 9 2009
