### Web Scrapping:

#### Web scraping is the process of extracting data from websites by using automated software or bots. It involves fetching and parsing the HTML code of web pages to extract specific information, such as text, images, links, or other structured data.

#### Here's a simplified overview of how web scraping typically works:

#### 1. Fetching: The web scraper sends an HTTP request to a target website and retrieves the HTML content of the web page.

#### 2. Parsing: The HTML content is then parsed to identify and extract the desired data. This is usually done using libraries or tools like BeautifulSoup in Python.

#### 3. Data extraction: The scraper locates specific elements within the HTML, such as HTML tags or CSS selectors, and extracts the relevant data. For example, it could extract product names, prices, or reviews from an e-commerce website.

#### 4. Data manipulation: Once the data is extracted, it can be further processed or manipulated as per the requirements. This might involve cleaning the data, removing duplicates, or converting it into a structured format like JSON or CSV.

#### 5. Storage or analysis: The scraped data can be stored in a database, a spreadsheet, or any other storage medium for future use. Alternatively, it can be used for immediate analysis or integration with other applications.

#### It's important to note that when web scraping, it's essential to respect the website's terms of service, comply with legal regulations, and be mindful of the website's server load. Some websites may have specific rules regarding scraping or may not permit it at all, so it's advisable to review a website's terms of service or contact the website owner before scraping.

### How Web Scrapers work:

#### Web scrappers can extract all the data on particular sites or the specific data that a user wants. Ideally its best if you specify the data you want so that the web scraper only extracts that data quickly

### What is web scrapping used for:

#### 1. Price monitoring
#### 2. Market Research
#### 3. News Monitoring
#### 4. Sentiment Analysis
#### 5. Email Marketing

### Website: Quotes to Scrap
#### url: https://quotes.toscrape.com/

In [1]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests

In [3]:
# here is the url

url = "https://quotes.toscrape.com/"

In [4]:
response  = requests.get(url)

In [5]:
response.status_code

200

#### The status code provides information about whether the request was successful, encountered an error, or requires further action.
#### 2xx (Successful): The request was successful, and the server has fulfilled it. For example, 200 OK indicates a successful request.

#### 3xx (Redirection): The server is redirecting the client to a different location. For example, 301 Moved Permanently indicates a permanent redirect.

#### 4xx (Client Error): The server could not process the client's request due to an error on the client's side. For example, 404 Not Found indicates that the requested resource was not found on the server.

#### 5xx (Server Error): The server encountered an error while processing the request. For example, 500 Internal Server Error indicates an unexpected server error.

In [6]:
# import BeautifulSoup

from bs4 import BeautifulSoup

In [7]:
pip install lxml




In [9]:
# create instance or object for BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

#### lxml is a popular library in Python for processing XML and HTML documents. It provides a fast and efficient way to parse and manipulate XML and HTML data. Lxml supports different parsers for parsing XML and HTML document
#### But lxml is not working in python, thats why using html.parser, lxml is working in Google colab

In [10]:
quote = soup.find('span', class_='text')

In [11]:
quote.text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [12]:
# Now we have to collect all quotes present the website

quotes = soup.find_all('span', class_='text')

In [13]:
# collected all quotes interms of elements

quotes

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [14]:
quotes = [quote.text for quote in quotes]

In [15]:
quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [16]:
quotes[0]

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [17]:
# remove double quotes

quotes[0][1:-1]

'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.'

In [18]:
# run the cell again

quotes = soup.find_all('span', class_='text')

In [19]:
quotes = [quote.text[1:-1] for quote in quotes]

In [20]:
quotes

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities.',
 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.',
 "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
 'Try not to become a man of success. Rather become a man of value.',
 'It is better to be hated for what you are than to be loved for what you are not.',
 "I have not failed. I've just found 10,000 ways that won't work.",
 "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
 'A day without sunshine is like, you know, night.']

In [22]:
# Extract authors

authors = soup.find_all('small', class_ = "author")

In [23]:
authors

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

In [24]:
# use list comprehension to get list of authors

authors = [author.text for author in authors]

In [25]:
authors

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [26]:
tags = soup.find_all('div', class_='tags')

In [27]:
tags

[<div class="tags">
             Tags:
             <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
 <a class="tag" href="/tag/change/page/1/">change</a>
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>
 <a class="tag" href="/tag/world/page/1/">world</a>
 </div>,
 <div class="tags">
             Tags:
             <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>
 <a class="tag" href="/tag/choices/page/1/">choices</a>
 </div>,
 <div class="tags">
             Tags:
             <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
 <a class="tag" href="/tag/life/page/1/">life</a>
 <a class="tag" href="/tag/live/page/1/">live</a>
 <a class="tag" href="/tag/miracle/page/1/">miracl

In [28]:
# Lets see first quote tags

tags[0]

<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>

In [29]:
# Lets extract first quote tags

for i in tags[0].find_all('a', class_='tag'):
    print(i.text)

change
deep-thoughts
thinking
world


In [30]:
# Now extract all tags, for that create one empty list

total_tags = []
for i in range(len(tags)):
    k = []
    for j in tags[i].find_all('a', class_ = 'tag'):
        k.append(j.text)
    total_tags.append(','.join(k))

In [31]:
total_tags

['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational',
 'adulthood,success,value',
 'life,love',
 'edison,failure,inspirational,paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor,obvious,simile']

In [32]:
# Lets create data frame

import pandas as pd

In [33]:
dataset = pd.DataFrame()

In [34]:
dataset['Quote'] = quotes
dataset

Unnamed: 0,Quote
0,The world as we have created it is a process o...
1,"It is our choices, Harry, that show what we tr..."
2,There are only two ways to live your life. One...
3,"The person, be it gentleman or lady, who has n..."
4,"Imperfection is beauty, madness is genius and ..."
5,Try not to become a man of success. Rather bec...
6,It is better to be hated for what you are than...
7,"I have not failed. I've just found 10,000 ways..."
8,A woman is like a tea bag; you never know how ...
9,"A day without sunshine is like, you know, night."


In [35]:
dataset['Tags'] = total_tags

In [36]:
dataset

Unnamed: 0,Quote,Tags
0,The world as we have created it is a process o...,"change,deep-thoughts,thinking,world"
1,"It is our choices, Harry, that show what we tr...","abilities,choices"
2,There are only two ways to live your life. One...,"inspirational,life,live,miracle,miracles"
3,"The person, be it gentleman or lady, who has n...","aliteracy,books,classic,humor"
4,"Imperfection is beauty, madness is genius and ...","be-yourself,inspirational"
5,Try not to become a man of success. Rather bec...,"adulthood,success,value"
6,It is better to be hated for what you are than...,"life,love"
7,"I have not failed. I've just found 10,000 ways...","edison,failure,inspirational,paraphrased"
8,A woman is like a tea bag; you never know how ...,misattributed-eleanor-roosevelt
9,"A day without sunshine is like, you know, night.","humor,obvious,simile"


In [38]:
dataset['Authors'] = authors

In [39]:
dataset

Unnamed: 0,Quote,Tags,Authors
0,The world as we have created it is a process o...,"change,deep-thoughts,thinking,world",Albert Einstein
1,"It is our choices, Harry, that show what we tr...","abilities,choices",J.K. Rowling
2,There are only two ways to live your life. One...,"inspirational,life,live,miracle,miracles",Albert Einstein
3,"The person, be it gentleman or lady, who has n...","aliteracy,books,classic,humor",Jane Austen
4,"Imperfection is beauty, madness is genius and ...","be-yourself,inspirational",Marilyn Monroe
5,Try not to become a man of success. Rather bec...,"adulthood,success,value",Albert Einstein
6,It is better to be hated for what you are than...,"life,love",André Gide
7,"I have not failed. I've just found 10,000 ways...","edison,failure,inspirational,paraphrased",Thomas A. Edison
8,A woman is like a tea bag; you never know how ...,misattributed-eleanor-roosevelt,Eleanor Roosevelt
9,"A day without sunshine is like, you know, night.","humor,obvious,simile",Steve Martin


In [40]:
# push the created dataset to csv file

dataset.to_csv('quotes.csv')

## END