# Selenium Web Scraper

*To Run - Ensure the chromedriver is installed & Open file in Jupyter Notebook - Kernel -> Restart and Run All*

Some websites are built with dynamic Aspx or JavaScript frameworks which Beautiful Soup cannot access. We can use Selenium in this case. This webscraper emulates a Google Chrome web browser controlled through Python code. I used this bot to collect data from various agriculture market across India. These are the steps which the website follows.

1. Ping the website using Selenium
2. Collect all the States present on the websits
3. Iterate through the States and Select required dates which data is present
4. Once the required date is selected - Press the submit button
5. Collect data from the table
6. Go back to the website and collect remaining dates
7. Close the browser once the data is collected

`time.sleep` is mentioned everywhere to ensure we do not hit the site too fast and to mimic a human user 


The script uses a `try` and `except` block to capture any error which may occur in the data collection process - ensuring data integrity


*For this sample - I have modified the script to go into a single state for a particular date and print a simple dataframe*

In [1]:
'''

Import required packages

'''
from selenium import webdriver
import sys
import re
import os
import sqlalchemy
from selenium.common.exceptions import NoSuchElementException
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
from pytz import timezone
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pickle

## Global Variables

In [2]:
india_time = timezone('Asia/Kolkata')
today      = datetime.datetime.now(india_time)
days       = datetime.timedelta(1)
yesterday = today - days
job_start_time = datetime.datetime.now(india_time)

## Global Functions

`input_fun` - This function helps us select the date given the browser, State and date

`clean_up` - Once the submit button is pressed using the underlying HTML source we can scrape the table and return a dataframe

In [3]:
def input_fun(driver, st, date):
    mon = date.strftime("%B")
    year = date.year

    st_xpath = '//select[@name="ctl00$cphBody$cboState"]'
    time.sleep(1)

    driver.find_elements_by_xpath(st_xpath + "//option[contains(text(), '" + st + "')]")[0].click()
    time.sleep(2)

    mon_xpath = '//select[@name="ctl00$cphBody$cboMonth"]'
    time.sleep(1)

    driver.find_elements_by_xpath(mon_xpath + '//option[contains(text(), "' + mon + '")]')[0].click()
    time.sleep(2)

    year_xpath = '//select[@name="ctl00$cphBody$cboYear"]'
    time.sleep(1)

    driver.find_elements_by_xpath(year_xpath + '//option[contains(text(), "' + str(year) + '")]')[0].click()
    time.sleep(2)

    return driver

def clean_up(chrome,st):

        html=chrome.page_source
        df1=pd.read_html(html)
        dsoup=BeautifulSoup(html,'html.parser')
        select = dsoup.find('font', {'color':'Maroon'})
        date=select.contents[0].strip()
        df=df1[3]
        df2 = df.copy()
        df2 = df2.iloc[:,0:len(df.columns)-3]
        df2.dropna(axis=0,inplace=True)
        df=df.drop_duplicates(keep=False)
        df = df.query("Market != Arrivals")
        new_cols = df.columns.to_list()
        new_cols = [x.replace(" ",'_').replace("-","_") for x in new_cols]
        df.columns = new_cols
        format_str = '%d/%m/%Y' # The format
        datetime_obj = datetime.datetime.strptime(date, format_str)
        df=df.rename(columns={'Arrivals':'Arrivals_String'})
        df['Arrivals']=np.where(df['Arrivals_String']=='NR',0,df['Arrivals_String'])
        df['Relevant_Date']=datetime_obj.strftime("%Y-%m-%d")
        df['Runtime'] = pd.to_datetime(today.strftime("%Y-%m-%d %H:%M:%S"))
        df['Last_Updated'] = ''
        df['State']=st
        df[['Arrivals','Minimum_Prices', 'Maximum_Prices','Modal_Prices']]=df[['Arrivals','Minimum_Prices', 'Maximum_Prices','Modal_Prices']].apply(pd.to_numeric,errors='coerce')
        df['Relevant_Date'] =  pd.to_datetime(df['Relevant_Date'], format='%Y-%m-%d')
        df=df[['State','Market', 'Arrivals_String','Arrivals', 'Unit_of_Arrivals', 'Variety', 'Minimum_Prices',
               'Maximum_Prices', 'Modal_Prices', 'Unit_of_Price'
               , 'Relevant_Date', 'Runtime', 'Last_Updated']]
        return df

## Main Script

*The Script below has the comments by every line to understand what each line does and its functionality*

In [4]:
'''
The Script is embedded into a While loop,
this ensure that it continues till it reaches
a point where all the requirements are achieved 
and the While loop stops
'''
main_limit=0
while True:
    try:
        # Monitoring the start time of the script
        start = time.process_time()
        #Website to scrape
        site_url = "https://agmarknet.gov.in/PriceAndArrivals/CommodityDailyStateWise.aspx"
        
        '''
        
        Downloading the Chrome Browser from https://developer.chrome.com/docs/chromedriver/downloads
        and placing it in the current working directory
        '''
        
        wd = r'C:\Personal_Project\Web_Scraping_Stuff\Selenium'
        driver = webdriver.Chrome(executable_path = wd+"\chromedriver.exe")
        driver.get(site_url)
        driver.maximize_window()
        # Acquiring all states from the HTML Sourcee
        
        eles = driver.find_elements_by_xpath('//select[@name="ctl00$cphBody$cboState"]')
        r = requests.Session()
        sesh1=r.get(site_url)
        soup1 = BeautifulSoup(sesh1.content , "lxml")
        states=[]
        select = soup1.find('select', id="cphBody_cboState")
        for value in select.stripped_strings:
            print (value)
            states.append(value)
        states = states[1:]  
        states1 = ['Tamil Nadu','Kerala','Andhra Pradesh','Karnataka','Telangana','Pondycherry']
        states = [elem for elem in states if elem in states1 ]
        states = [st.strip() for st in states if(("select" not in st.lower()) & (st.strip()!=''))]#list of states
        states = [states[0]]
        main=pd.DataFrame()
        all_dates = []
        
        # Giving a start date - Example is April of 2022
        
        st_mon = 4
        st_year = 2022
        
        #list of dates loop
        while True:
            if((st_mon==today.month) & (st_year==today.year)):
                break
            if(st_mon==13):
                st_mon = 1
                st_year += 1
            all_dates.append(datetime.date(st_year, st_mon, 1))
            st_mon += 1
        #All dates contains the dates from the start date above to todays date
        
        #For this example I am iterating through one State and one date
        
        for st in states:
            print(st)
            for date in [all_dates[0]]:
                print(st, date)
                
                # Given a State and date we can pass it onto the function
                driver = input_fun(driver, st, date)
                # Implicit wait - important function to ensure the emulater waits till the JS element has loaded
                driver.implicitly_wait(10)
                xpath = '//*[@id="cphBody_Calendar1"]/tbody//a'
                eles = driver.find_elements_by_xpath(xpath)
                all_text = []
                for ele in eles:
                    all_text.append(ele.text)
                
                # Starting a For Loop to iterate through the date - all_text[0] (Taking only 1 example)
                for text in [all_text[0]]:
                    code_rel_date = datetime.date(date.year, date.month, int(text))
                    print(code_rel_date)
                    driver = input_fun(driver, st, date)
                    print(text)
                    rel_date = date.strftime("%B %Y")
                    time.sleep(1)
                    limit = 0
                    #This While loop is to ensure the date is clicked by the bot and will be in loop 5 times before throwing an erro
                    while True:
                        try:
                            cal = driver.find_elements_by_xpath('//*[@id="cphBody_Calendar1"]/tbody/tr[1]/td/table/tbody//td')[0].text
                            if(cal.strip()==rel_date):
                                break
                            else:
                                raise Exception("Date Mismatch")
                        except:
                            limit += 1
                            if(limit>5):
                                raise Exception("Internet slow")
                            time.sleep(2)
                    ele = driver.find_elements_by_xpath(xpath + '[text() = "' + text + '"]')[0]
                    ele.click()
                    time.sleep(2)
                    
                    #Clicking the Submit button to access data                    
                    try:                        
                        xpath_submit='//*[@id="cphBody_btnSubmit"]'
                        time.sleep(3)
                        driver.implicitly_wait(10) # seconds
                        submit = driver.find_element_by_xpath(xpath_submit)
                        submit.click()
                        driver.implicitly_wait(5)
                    except NoSuchElementException:  
                        runtime=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                        continue                  
                    time.sleep(2)
                    
                    '''
                    Once the raw data is accessed we pass the browser to a clean up function
                    to gather a clean dataframe
                
                    '''
                    output=clean_up(driver,st)
                    print(output.head())
                    runtime=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                    
                    '''
                    Going back to the website to go to the next date and/or state
                    
                    '''
                    driver.get(site_url)
        print(time.process_time() - start)
        try:
            driver.quit()
        except:
            pass

        break

    except:
        try:
            driver.quit()
        except:
            pass
        main_limit += 1
        if(main_limit>4):
            error_msg = str(sys.exc_info()[1])
            raise Exception(error_msg)
        time.sleep(5)

-----Select State--------
Andaman and Nicobar
Andhra Pradesh
Arunachal Pradesh
Assam
Bihar
Chandigarh
Chattisgarh
Dadra and Nagar Haveli
Daman and Diu
Goa
Gujarat
Haryana
Himachal Pradesh
Jammu and Kashmir
Jharkhand
Karnataka
Kerala
Lakshadweep
Madhya Pradesh
Maharashtra
Manipur
Meghalaya
Mizoram
Nagaland
NCT of Delhi
Odisha
Pondicherry
Punjab
Rajasthan
Sikkim
Tamil Nadu
Telangana
Tripura
Uttar Pradesh
Uttrakhand
West Bengal
Andhra Pradesh
Andhra Pradesh 2022-04-01
2022-04-01
1
            State         Market Arrivals_String  Arrivals Unit_of_Arrivals  \
3  Andhra Pradesh           Alur            0.01      0.01           Tonnes   
4  Andhra Pradesh  Banaganapalli               1      1.00           Tonnes   
6  Andhra Pradesh        Atmakur            0.01      0.01           Tonnes   
8  Andhra Pradesh        Atmakur            0.01      0.01           Tonnes   
9  Andhra Pradesh  Banaganapalli               1      1.00           Tonnes   

          Variety  Minimum_Prices  Maximum

In [5]:
# VIEWING OUR DATASET
print(output.copy())

             State           Market Arrivals_String  Arrivals  \
3   Andhra Pradesh             Alur            0.01      0.01   
4   Andhra Pradesh    Banaganapalli               1      1.00   
6   Andhra Pradesh          Atmakur            0.01      0.01   
8   Andhra Pradesh          Atmakur            0.01      0.01   
9   Andhra Pradesh    Banaganapalli               1      1.00   
10  Andhra Pradesh      Rajahmundry               1      1.00   
11  Andhra Pradesh              NaN             NaN       NaN   
12  Andhra Pradesh           Tanuku               1      1.00   
16  Andhra Pradesh       Ambajipeta              44     44.00   
20  Andhra Pradesh         Cuddapah             1.2      1.20   
24  Andhra Pradesh         Chittoor              45     45.00   
25  Andhra Pradesh              NaN             NaN       NaN   
26  Andhra Pradesh              NaN             NaN       NaN   
30  Andhra Pradesh    Banaganapalli               1      1.00   
31  Andhra Pradesh       

## Key Takeaways

### Advantages 
- Selenium allows us to access difficult data corpuses which may not be accessed by Pythons BeautifulSoup or Scrapy
- It can be used to navigate through Captchas
- Helps emulate a human user to ensure legal compliance
- Easy to deploy

### Disadvantage
- Very slow and time consuming when compared to Scrapy or BeautifulSoup
- Difficult to debug when faced with errors
 