Write a Python application for monitoring the status of all the auctions on Ebay.com about a given type of item.

Let X be the name of the item of interest (i.e., a 'Nokia 3310' smartphone):

At its first run, the program should scrape from ebay using BeautifulSoup4 the information about all the auctions related to X including:

   - title
   - number of bids
   - highest bid
   - description (to be extracted from the extended description available on the page related to that particular auction. Only the first 200 characters will be enough)
   - id
   - seller
   - seller's score
   - scraping date

In [1]:
#importing  all packages that I'll use in this project
import requests
import bs4
import re
import datetime
import time
import pandas as pd

###First of all I wrote an application for scraping the describtion of each auction, because it's contained in another html page.

In [2]:
def scrape_info(url):
  p = requests.get(url)
  page = bs4.BeautifulSoup(p.content,'lxml')
  description = page.find(id="ds_div")
  if description:
    description = re.sub(r'\n\s*\n', '.',page.find(id="ds_div").text.strip())[0:199]
  return description

###The application for scraping all information we need for a single auction:

In [3]:
def scrape_advertisement(url):
  p = requests.get(url)
  page = bs4.BeautifulSoup(p.content,'lxml')

  title= page.h1.contents[1]
  number_bids = int(page.find(class_="u-dspblk").find('a').find(id="qty-test").text)

  #all prices in euro
  convPrice=page.find(id="convbidPrice")
  if convPrice is not None:
    highest_bid = float(convPrice.text.split()[1].split('(')[0].replace('.','').replace(',','.'))
  else:
    highest_bid = float(page.find(id="prcIsum_bidPrice").text.split()[1].replace('.','').replace(',','.'))
  
  if page.find(id="desc_ifr"):
    description = scrape_info(page.find(id="desc_ifr")['src'])

  id=page.find(id="descItemNumber").text
  
  seller = page.find(class_="mbg-nw").text
  
  seller_feed = page.find(id='si-fb')
  if seller_feed:
    seller_feed = float(seller_feed.text.split('%')[0].replace(',','.'))


  when = datetime.date.today()

  return (title, number_bids, highest_bid, description, id, seller, seller_feed, when)

###This function will be used only for run> 1 and is used to verify the changes with respect to the previous run.



In [4]:
def report(output1,output2):
  #looking for Auctions deleted
  products_id_2= output2['id']
  output1['update']= False
  for id in products_id_2:
    output1.loc[output1['id']==id, 'update']=True
  deleted = output1[output1['update']==False][['id','title']]
  print('Auctions deleted:')
  print(deleted)
  print('\n')
  
  #looking for New auctions
  products_id= output1['id']
  output2['new']= True
  for id in products_id:
    output2.loc[output2['id']==id, 'new']=False
  new = output2[output2['new']==True][['id','title']]
  print('New auctions:')
  print(new)
  print('\n')

  #looking for Auctions with new bid
  old_id= output2[output2['new']==False]['id']
  new_bid= output2['new_bid'] = False
  
  print("Auctions with new bid: ")
  for id in old_id:
     n_bid1= output1.loc[output1['id']==id,'number_bids'].values
     n_bid2= output2.loc[output2['id']==id,'number_bids'].values
     if n_bid1 < n_bid2:
       title=output1[output1['id']==id]['title'].values
       old_hbid=output1[output1['id']==id]['highest_bid'].values
       new_hbid = output2[output2['id']==id]['highest_bid'].values
       print(id,title, old_hbid, new_hbid)


Scrape_from_ebay is a function that want in input a *keyword*, that is the product that you want to scrap from Ebay, the *maximum number of results* thay you'd like to have, and *run*, where you have to indicate if it is the firts run or not ( for default run=1).
If it is the first run (run = 1),  Scrape_from_ebay will get in output:
 - title
 - number of bids
 - highest bid (in EUR)
 - description (to be extracted from the extended     description available on the page related to that particular auction. Only the first 200 characters will be enough)
 - id
 - seller
 -  seller's score
 - scraping date

Starting from the second run, the program repeat the same scraping activity as the firt run and, then, report on screen:
- the id and the title of all the auctions that have been deleted, with respect to the previous run
- the id and the title of all the auctions that have been added, with respect to the previous run
- the id, the title, the old maximum bid and the new maximum bid of all the auctions where at least a new bid has been placed, with respect to the previous run

In [8]:
def scrape_from_ebay(keyword, maximum_number_results = 100, run=1):
  
  n_page=1
  actual_number_results = 0
  output = []
  while n_page>0:
    '''
    When the number of page is grater than the number of the last page, in ebay you will get the
    last page, for this reason we need to have the number of all items(max_results) present for 
    the product we're scraping, for stopping the application.
    '''
    url = 'https://www.ebay.it/sch/i.html?_nkw='+keyword+'&_sacat=0&rt=nc&LH_Auction=1&_pgn='+str(n_page)
 
    
    n_page += 1
    p = requests.get(url)
    page = bs4.BeautifulSoup(p.content)
    max_results=int(page.find(class_="srp-controls__control srp-controls__count").text.split(' ')[0])
    
     
    for ad in page.find_all(class_="s-item__link"):
       print('Scraping data from: {}'.format(ad['href']))
       output.append(scrape_advertisement(ad['href']))
      
       actual_number_results += 1
       if actual_number_results == max_results:
         output1 = pd.DataFrame(output, columns=['title', 'number_bids', 'highest_bid', 'description', 'id', 'seller', 'seller_feed', 'when'])
         if run==1:
           output1.to_excel('X.scrape.xlsx')
           type(output1)
           print('All items are scraped!')
           return output1
         else:#if it's not the first run
           #upload the previous dataframe
           prev_output= pd.read_excel('X.scrape.xlsx')
           #save and upload the new dataframe
           output1.to_excel('X.scrape.xlsx')
           output=pd.read_excel('X.scrape.xlsx')
           print('All items are scraped!')
           return report(prev_output,output)
                            
       if actual_number_results == maximum_number_results:
         output1 = pd.DataFrame(output, columns=['title', 'number_bids', 'highest_bid', 'description', 'id', 'seller', 'seller_feed', 'when'])
         if run==1:
           output1.to_excel('X.scrape.xlsx')
           print('{} items are scraped!'.format(maximum_number_results))
           
         else:
           prev_output= pd.read_excel('X.scrape.xlsx')
           output1.to_excel('X.scrape.xlsx')
           output=pd.read_excel('X.scrape.xlsx')
           return report(prev_output,output)
         
      #every 20 items the application take a break of 20 seconds
       if actual_number_results in range(20,maximum_number_results,20):
         print('Break!')
         time.sleep(20)
  



In [None]:
scrape_from_ebay('iphone10', 100,1)

In [9]:
scrape_from_ebay('iphone10', 100,2)

Scraping data from: https://www.ebay.it/itm/284270249537?hash=item422fd3c641%3Ag%3A434AAOSwRt9gcHkD&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/284273136598?hash=item422fffd3d6%3Ag%3AT%7E4AAOSwx2dggsZH&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/164825636868?hash=item26605fa004%3Ag%3AMKgAAOSwPkZgbMbM&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/324593822153?hash=item4b934ca9c9%3Ag%3AHmQAAOSwF3VghGVt&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/203386402339?hash=item2f5ac63623%3Ag%3AOZcAAOSwkctghSom&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/144022165123?hash=item218863d683%3Ag%3AqcYAAOSwHh5ggVLm&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/324583531754?hash=item4b92afa4ea%3Ag%3AQOYAAOSwB1VggBtG&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/184793878370?hash=item2b0692cf62%3Ag%3AoGIAAOSws5hgZi3y&LH_Auction=1
Scraping data from: https://www.ebay.it/itm/164832668333?hash=item2660caeaad%3Ag%3AMG8AAOSwH5J