
# Exercise 2
# Web Scraping using BeautifulSoup


## <b>Problem Description</b>
To perform web scraping on psgtech.edu to list out all faculties with PhDs

### Step 1

Import the required packages requests, webbrowser and BeautifulSoup <br>
Using the requests module, we get the response of the index page of psgtech

In [1]:
import sys
print(sys.executable)
print(sys.version)
print(sys.version_info)

c:\users\balaj\appdata\local\programs\python\python36\python.exe
3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0)


In [2]:
import requests
import webbrowser
from bs4 import BeautifulSoup

r  = requests.get("http://psgtech.edu/index.php")

### Step 2

Import the HTML contents into BeautifulSoup format<br>
Find out the a tags with the class "list-group-item" since they list out the departments explicitly


In [3]:
soup = BeautifulSoup(r.content, 'html.parser')
department = soup.findAll('a',{"class":"list-group-item"})
for i in department:
    print(i['href'])

hme.php?var=AFD
hme.php?var=AMC
hme.php?var=AS
hme.php?var=AUT
hme.php?var=BIO
hme.php?var=BME
hme.php?var=CHE
hme.php?var=CIV
hme.php?var=MCA
hme.php?var=CSE
hme.php?var=EEE
hme.php?var=ECE
hme.php?var=ENG
hme.php?var=FAT
hme.php?var=HUM
hme.php?var=IT
hme.php?var=ICE
hme.php?var=MAT
hme.php?var=MEC
hme.php?var=MTL
hme.php?var=PHY
hme.php?var=PRO
hme.php?var=RAE
hme.php?var=TXT
http://www.psgim.ac.in/new/
http://www.psgtech.edu/sports/
#
#
Mainoffice.php
Accsec.php
Academ.php


### Step 3

Filter out the departments that we require, in this case we are analyzing all departments till CSE

In [4]:
department = department[:len(department)-21]
departmentList = []
for i in department:
    print(i['href'].split("=")[1])
    departmentList.append(i['href'].split("=")[1])

AFD
AMC
AS
AUT
BIO
BME
CHE
CIV
MCA
CSE


In [5]:
departmentList

['AFD', 'AMC', 'AS', 'AUT', 'BIO', 'BME', 'CHE', 'CIV', 'MCA', 'CSE']

### Step 4

Download the required files from the various department subdomains<br>
This can be downloaded by using the following script which uses puautogui and Webbrowser module<br>
Since the PHP pages are generated from the server side it is necessary to register certain mouse clicks to download the pages<br><br>

Note: This script should be run separately since we are automating mouse events

In [6]:
# Downloaded files will be found in PSGWEbsiteData
# webbrowser.open("http://www.psgtech.edu",new=2);
# for i in departmentList:
#     pyautogui.click(176,58)
#     pyautogui.typewrite("http://www.psgtech.edu/hme.php?var="+i)
#     pyautogui.typewrite(["enter"])
#     pyautogui.click(176,58)
#     pyautogui.typewrite("http://www.psgtech.edu/prografac.php")
#     pyautogui.typewrite(["enter"])
#     pyautogui.hotkey('ctrl','s')
#     pyautogui.typewrite(str(i))
#     pyautogui.typewrite(["enter"])
#     time.sleep(2)

### Step 5

Parse through all the downloaded files<br>
First we find all div tags which have the same class as that of name of the faculty<br>
We check whether the names start with Dr. to verify if that person has obtained the PhD

In [7]:
mainList = []
for j in departmentList:
    try:
        soup_main = BeautifulSoup(open("PSGWebsiteData\\"+str(j)+".html"), 'html.parser')
        for i in soup_main.findAll('div',{'class':'col-sm-6 col-md-8'}):
            name = i.text
            name = name.strip()
            name = name.replace("\n"," ")
            nameList = name.split(' ')
            if(nameList[0].startswith("Dr")) :
                if(len(nameList[0])>3):
                    mainList.append(nameList[0])
    except:
        continue
print(mainList)

['Dr.Vijayalakshmi', 'Dr.Nirmala', 'Dr.Mariyam', 'Dr.Prathiba', 'Dr.Nandhinee', 'Dr.Sudha', 'Dr.Venkatesan', 'Dr.Karpagam', 'Dr.Santhi', 'Dr.Suriya', 'Dr.Indumathi', 'Dr.Kavitha', 'Dr.Gopika', 'Dr.Uma', 'Dr.Sathiyapriya', 'Dr.Jayashree', 'Dr.Lovelyn', 'Dr.Arul', 'Dr.Sudha', 'Dr.Venkatesan', 'Dr.Karpagam', 'Dr.Santhi', 'Dr.Suriya', 'Dr.Indumathi', 'Dr.Kavitha', 'Dr.Gopika', 'Dr.Uma', 'Dr.Sathiyapriya', 'Dr.Jayashree', 'Dr.Lovelyn', 'Dr.Arul']


### Step 6

Print the results<br>

In [8]:
for i in mainList:
    print(i)

Dr.Vijayalakshmi
Dr.Nirmala
Dr.Mariyam
Dr.Prathiba
Dr.Nandhinee
Dr.Sudha
Dr.Venkatesan
Dr.Karpagam
Dr.Santhi
Dr.Suriya
Dr.Indumathi
Dr.Kavitha
Dr.Gopika
Dr.Uma
Dr.Sathiyapriya
Dr.Jayashree
Dr.Lovelyn
Dr.Arul
Dr.Sudha
Dr.Venkatesan
Dr.Karpagam
Dr.Santhi
Dr.Suriya
Dr.Indumathi
Dr.Kavitha
Dr.Gopika
Dr.Uma
Dr.Sathiyapriya
Dr.Jayashree
Dr.Lovelyn
Dr.Arul
