# Web Scraping
### Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website.Unlike the conventional process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.

# Importing Libraries

In [16]:
import pandas as pd
import requests

#pip install urllib
from urllib.request import urlopen

#pip install beautifulsoup4
from bs4 import BeautifulSoup

# Reading URL into our notebook

###### Urllib module is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols.

In [17]:
url = "https://en.wikipedia.org/wiki/Data_science"

#Reading url into a variable using urllib
html = urlopen(url)

###### Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

###### We create a BeautifulSoup object by passing two arguments:
###### html : It is the raw HTML content.
###### lxml :Specifying the HTML parser we want to use.

In [18]:
#Creating parse tree for the html page
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

# Implementing soup functionalities

In [19]:
# Get the title of the page
title = soup.title
print(title)

<title>Data science - Wikipedia</title>


###### pretiffy() gives the visual representation of the parse tree created from the raw HTML content.

In [20]:
#avoiding it here to eliminate longer prints 
#print(soup.prettify())

In [21]:
#Finding all anchor tags and printing first 20
soup.find_all('a')[:20]

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a href="/wiki/Information_science" title="Information science">information science</a>,
 <a href="/wiki/Machine_learning" title="Machine learning">Machine learning</a>,
 <a href="/wiki/Data_mining" title="Data mining">data mining</a>,
 <a class="image" href="/wiki/File:Kernel_Machine.svg"><img alt="Kernel Machine.svg" data-file-height="233" data-file-width="512" decoding="async" height="100" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/220px-Kernel_Machine.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/330px-Kernel_Machine.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/440px-Kernel_Machine.svg.png 2x" width="220"/></a>,
 <a href="/wiki/Statistical_classification" title="Statistical classification">Classification</a>,
 <a href="/wiki

In [22]:
#Finding all outsource links
all_links = soup.find_all("a")
for link in all_links[:20]:
    print(link.get("href"))

None
#mw-head
#p-search
/wiki/Information_science
/wiki/Machine_learning
/wiki/Data_mining
/wiki/File:Kernel_Machine.svg
/wiki/Statistical_classification
/wiki/Cluster_analysis
/wiki/Regression_analysis
/wiki/Anomaly_detection
/wiki/Automated_machine_learning
/wiki/Association_rule_learning
/wiki/Reinforcement_learning
/wiki/Structured_prediction
/wiki/Feature_engineering
/wiki/Feature_learning
/wiki/Online_machine_learning
/wiki/Semi-supervised_learning
/wiki/Unsupervised_learning


# Another way to read urls into our file

###### When one makes a request to a URI, it returns a response. Python requests provides inbuilt functionalities for managing both the request and response.

In [23]:
url = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_200516.txt'

#Command to use url in our script using requests
resp = requests.get(url)

###### We create a BeautifulSoup object by passing two arguments:
###### resp.text : It is the raw HTML content.
###### html.parser :Specifying the HTML parser we want to use.

In [24]:
#Creating the parse tree for new link
soup_data = BeautifulSoup(resp.text, 'html.parser')

In [25]:
#Creating file handler to store data
output = open('data.csv', 'wb' ) 

#writing contents of url to file
output.write(resp.content) 

output.close()

In [26]:
#Reading csv into the dataframe
df = pd.read_csv("data.csv")

In [27]:
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,05/09/2020,00:00:00,REGULAR,7417122,2518946
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,05/09/2020,04:00:00,REGULAR,7417123,2518948
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,05/09/2020,08:00:00,REGULAR,7417133,2518959
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,05/09/2020,12:00:00,REGULAR,7417146,2518977
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,05/09/2020,16:00:00,REGULAR,7417178,2518996


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206905 entries, 0 to 206904
Data columns (total 11 columns):
C/A                                                                     206905 non-null object
UNIT                                                                    206905 non-null object
SCP                                                                     206905 non-null object
STATION                                                                 206905 non-null object
LINENAME                                                                206905 non-null object
DIVISION                                                                206905 non-null object
DATE                                                                    206905 non-null object
TIME                                                                    206905 non-null object
DESC                                                                    206905 non-null object
ENTRIES                           

# Selenium

###### Selenium is a powerful tool for controlling web browser through program. It is functional for all browsers, works on all major OS
###### Mastering Selenium will help you automate your day to day tasks like controlling your tweets, Whatsapp texting and even just googling without actually opening a browser in just 15-30 lines of python code. The limits of automation is endless with selenium.
###### Selenium allows Python to interact with webpages by opening a web browser (e.g. FireFox, Google Chrome, Safari) with either the browser window opening on screen or without the browser window (in a mode called headless)
###### Selenium requires a web driver to interface with the chosen browser.Web drivers is a package to interact with web browser. It interacts with the web browser or a remote web server through a wire protocol which is common to all.