## Webscrapping 

By executing your python code we can make a request to target website get the data inform of html and extract the data we need. We can achieve the same using Beautifulsoup and some request operations in python. we will do few examples to demonstrate.

Before going in make sure Beautifulsoup4 is installed in your virtualenv, if not install it by doing `pip install beautifulsoup4`

In [2]:
import urllib
from bs4 import BeautifulSoup

In [5]:
# following is a time of india news article. we will extract the news text from the html.
article_url = 'https://timesofindia.indiatimes.com/city/kolhapur/homeopath-wife-held-in-kolhapur-for-selling-babies/articleshow/62814634.cms'

In [8]:
page_data = urllib.request.urlopen(article_url).read().decode('utf8', 'ignore')
# print(page_data)

In [11]:
# Lets soup it. by making a soup of the html data we will be able to do operations on that using python. 
# we will get python object of all html tags here.
soup = BeautifulSoup(page_data, 'lxml')
# print(soup)

In [14]:
# The main content of the page is inside the div with class Normal. so lets filter that out.
target_div = soup.find_all("div", {"class": "Normal"})
# print(target_div)
# if you extract the text of the target div it will give you the content
text = target_div[0].text
print(text)

Maharashtra police on Tuesday arrested a homeopathic doctor and his wife for allegedly selling newborns in Kolhapur district. Police said Arun Bhupal Patil sold two babies over the last three months for Rs 2 lakh each. According to Priyadarshini Chorage, district head of the child welfare committee, Patil would help unmarried women deliver children and give all the cash from the sale to the mothers.

Chorage said Patil has confessed to selling a new-born delivered by a minor mother to a family in Chhattisgarh on December 23, 2017 for Rs 2 lakh. “He has also confessed to selling a new-born baby to a couple in Mumbai but has not revealed the details,” she added.

Police said Patil, his wife Ujwalla and a couple who took the baby had been booked under relevant sections of IPC and the Juvenile Justice Act, 2015.




In [16]:
# if you do some formating like removal of new line it will make a paragraph
text = text.replace('\n', '').replace('  ', '')
text

'Maharashtra police on Tuesday arrested a homeopathic doctor and his wife for allegedly selling newborns in Kolhapur district. Police said Arun Bhupal Patil sold two babies over the last three months for Rs 2 lakh each. According to Priyadarshini Chorage, district head of the child welfare committee, Patil would help unmarried women deliver children and give all the cash from the sale to the mothers.Chorage said Patil has confessed to selling a new-born delivered by a minor mother to a family in Chhattisgarh on December 23, 2017 for Rs 2 lakh. “He has also confessed to selling a new-born baby to a couple in Mumbai but has not revealed the details,” she added.Police said Patil, his wife Ujwalla and a couple who took the baby had been booked under relevant sections of IPC and the Juvenile Justice Act, 2015.'

In [17]:
# Lets put all this into a function to make it modular.
def getTextWaPo(url):
    page = urllib.request.urlopen(url).read().decode('utf8', 'ignore')
    soup = BeautifulSoup(page, 'lxml')
    text=''.join(map(lambda x: x.text, soup.find_all("div", {"class": "Normal"})))
    text = text.replace('\n', '').replace('  ', '')
    return text
text = getTextWaPo(article_url)
text

'Maharashtra police on Tuesday arrested a homeopathic doctor and his wife for allegedly selling newborns in Kolhapur district. Police said Arun Bhupal Patil sold two babies over the last three months for Rs 2 lakh each. According to Priyadarshini Chorage, district head of the child welfare committee, Patil would help unmarried women deliver children and give all the cash from the sale to the mothers.Chorage said Patil has confessed to selling a new-born delivered by a minor mother to a family in Chhattisgarh on December 23, 2017 for Rs 2 lakh. “He has also confessed to selling a new-born baby to a couple in Mumbai but has not revealed the details,” she added.Police said Patil, his wife Ujwalla and a couple who took the baby had been booked under relevant sections of IPC and the Juvenile Justice Act, 2015.'

### More Scrapping

In [60]:
# Lets do some more web scrapping. lets go to some e commerce site to get data. 
# we will use flipkart's url to get data

def scrap_walmart_in(item):
    _url = 'https://www.walmart.com/search/?query={0}'.format(item)
    page = urllib.request.urlopen(_url).read().decode('utf8', 'ignore')
    soup = BeautifulSoup(page, 'lxml')
    products = soup.find_all('li', {'class': 'search-gridview-first-col-item'})
    
    result = []
    for p in products:
        link = p.find('a', {'class': 'product-title-link'})
        currency = p.find('span', {'class': 'price-currency'}).text
        char = p.find('span', {'class': 'price-characteristic'}).text
        mark = p.find('span', {'class': 'price-mark'}).text
        mantis = p.find('span', {'class': 'price-mantissa'}).text
        price = '{0}{1}{2}{3}'.format(currency, char, mark, mantis)
        data = {
            'name': link.findChild(),
            'link': 'https://www.walmart.com/{}'.format(link.attrs['href']),
            'price': price,
            'img': p.find('img', {'class': 'Tile-img'}).attrs['src']
        }
        result.append(data)
    return result

def display_products(data):
    print ('----------------------------------')
    for i in data:
        print('Name: ', i['name'])
        print('Price: ', i['price'])
        print('link: ', i['link'])
        print('-------------------------')


In [61]:
display_products(scrap_walmart_in('flower'))

----------------------------------
Name:  <span><mark>1-800-</mark>Flowers<mark></mark>: Fresh <mark>Flowers</mark> - Assorted Roses &amp; Peru  ...<br/></span>
Price:  $54.99
link:  https://www.walmart.com//ip/1-800-Flowers-Fresh-Flowers-Assorted-Roses-Peruvian-Lilies-with-Clear-Vase/893142472
-------------------------
Name:  <span><mark>Flower</mark> Shimmer &amp; Shade Eyeshadow Palette, Warm Natur  ...<br/></span>
Price:  $14.98
link:  https://www.walmart.com//ip/Flower-Shimmer-Shade-Eyeshadow-Palette-Warm-Natural/162449435
-------------------------
Name:  <span>Rose and Gypso with Fluted Vase Silk <mark>Flower</mark> Arrangem  ...<br/></span>
Price:  $36.09
link:  https://www.walmart.com//ip/Rose-and-Gypso-with-Fluted-Vase-Silk-Flower-Arrangement/17688055
-------------------------
Name:  <span><mark>Flower</mark> Shimmer &amp; Shade Eyeshadow Palette, ES2 Cool N  ...<br/></span>
Price:  $14.98
link:  https://www.walmart.com//ip/Flower-Shimmer-Shade-Eyeshadow-Palette-ES2-Cool-Natur

In [62]:
display_products(scrap_walmart_in('towel'))

----------------------------------
Name:  <span>Mainstays Shark Beach <mark>Towel</mark></span>
Price:  $5.77
link:  https://www.walmart.com//ip/Mainstays-Shark-Beach-Towel/231410425
-------------------------
Name:  <span>MS 28X60 PRINTED SHEARED BEACH <mark>TOWEL</mark> VARIEGATED STRI  ...<br/></span>
Price:  $5.77
link:  https://www.walmart.com//ip/MS-28X60-PRINTED-SHEARED-BEACH-TOWEL-VARIEGATED-STRIPE-MULT/894963326
-------------------------
Name:  <span>MS 28X60 PRINTED SHEARED BEACH <mark>TOWEL</mark> PINEAPPLE</span>
Price:  $5.77
link:  https://www.walmart.com//ip/MS-28X60-PRINTED-SHEARED-BEACH-TOWEL-PINEAPPLE/792561776
-------------------------
Name:  <span>Mainstays 34" x 64" Fiber Reactive Print Beach <mark>Towel</mark></span>
Price:  $8.94
link:  https://www.walmart.com//ip/Mainstays-34-x-64-Fiber-Reactive-Print-Beach-Towel/614740376
-------------------------
Name:  <span>Mainstays Round <mark>Towel</mark> Pizza</span>
Price:  $9.97
link:  https://www.walmart.com//ip/Mainst

In [63]:
display_products(scrap_walmart_in('flower'))

----------------------------------
Name:  <span><mark>Flower</mark> Shimmer &amp; Strobe Highlighting Palette, SP1</span>
Price:  $12.98
link:  https://www.walmart.com//ip/Flower-Shimmer-Strobe-Highlighting-Palette-SP1/169385725
-------------------------
Name:  <span><mark>Flower</mark> Draw the Line EP1 Blonde Eyebrow Pencil, 0.00  ...<br/></span>
Price:  $6.98
link:  https://www.walmart.com//ip/Flower-Draw-the-Line-EP1-Blonde-Eyebrow-Pencil-0-007-oz/49019680
-------------------------
Name:  <span><mark>Flower</mark> Shimmer &amp; Shade Eyeshadow Palette, ES4 Intens  ...<br/></span>
Price:  $14.98
link:  https://www.walmart.com//ip/Flower-Shimmer-Shade-Eyeshadow-Palette-ES4-Intense-Natural/172285066
-------------------------
Name:  <span><mark>FLOWER</mark> Kiss Me Twice Lip &amp; Cheek Chubby, Apricot-A-L  ...<br/></span>
Price:  $9.98
link:  https://www.walmart.com//ip/FLOWER-Kiss-Me-Twice-Lip-Cheek-Chubby-Apricot-A-Lot/31343530
-------------------------
Name:  <span><mark>Flower</m