# ADVANCED PANDAS: DATA IMPORTING & WEB SCRAPING

Course Outline:
- Basic Data Importing
    - Flat Files (.csv, .tsv, .txt)
    - Excel Files (.xlsx)
    - Other Files (.dta, .mat, .. etc)
- Importing Data from Relational Databases
    - SQL Crash Course
    - Database Files (.db, .sqlite, .. etc)
- ***Importing Data from the Internet***
    - Case-study: Wuzzuf.com [Web Scraping]
    - HTML & CSS Crash Course
    - Working with JSON Data & APIs
    - ***Web Scraping Basics***
        - ***XPath & CSS Selectors***
        - ***Python Libraries for Web Scraping***

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

==========

## Importing Data from the Internet

- HTML & CSS Crash Course
- Working with JSON Data & APIs
- ***Web Scraping Basics***

==========

## Web Scraping Basics

- Web Scraping Fundamentals
- XPath & CSS Selectors
- Python Libraries for Web Scraping
    - Requests
    - BeautifulSoup
    - Scrapy Basics

### Web Scraping Fundamentals

In [None]:
from IPython.display import Image
Image("data/ws.jpg")

***Why? (Applications)***
- Price Comparison
- Email address gathering
- Social Media Scraping
- Research and Development
- Job listings

***Webscraping Process:***
- Figure out the target website
- Get the URL of the pages from which the data needs to be extracted.
- Obtain the HTML/CSS/JS of those pages.
- Find the locators such as XPath or CSS selectors or regex of those data which needs to be extracted.
- Save the data in a structured format such as JSON or CSV file.

***Web Crawler vs Web Scraper***

In [None]:
from IPython.display import Image
Image("data/wc-ws.png")

***Some files to read before scraping any site***
- Sitemap files
- robots.txt file

***Python Modules/Libraries for Web Scraping***
- Scrapy
- BeautifulSoup
- LXML
- Selenium
- Requests

==========

### XPath & CSS Selectors
- XPath
- CSS Locators

***XPath & CSS Selectors:***
- XPath Cheat Sheet: https://devhints.io/xpath

#### XPath

| Expression 	| Description                                                                                           	| Example 	| Result                                                                                                                              	|
|------------	|-------------------------------------------------------------------------------------------------------	|-----------------	|-------------------------------------------------------------------------------------------------------------------------------------	|
| nodename   	| Selects all nodes with the name "nodename"                                                            	| bookstore       	| Selects all nodes with the name "bookstore"                                                                                         	|
| /          	| Selects from the root node                                                                            	| /bookstore      	| Selects the root element bookstore Note: If the path starts with a slash ( / ) it always represents an absolute path to an element! 	|
| //         	| Selects nodes in the document from the current node that match the selection no matter where they are 	| bookstore/book  	| Selects all book elements that are children of bookstore                                                                            	|
| .          	| Selects the current node                                                                              	| //book          	| Selects all book elements no matter where they are in the document                                                                  	|
| ..         	| Selects the parent of the current node                                                                	| bookstore//book 	| Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element        	|
| @          	| Selects attributes                                                                                    	| //@lang         	| Selects all attributes that are named lang                                                                                          	|

In [None]:
# Can you understand this simple code?
xpath = '/html/body/div[1]//table'

In [None]:
xpath = '/html/body/div'

In [None]:
xpath = '//p'   # Select all P elements

In [None]:
xpath = '//table[2]'   # Select all table elements and get the second one

In [None]:
xpath = '//table/img'   # Select all img elements in a table

In [None]:
xpath = '//p.text()'   # Select all text elements in a p tag in a list

In [None]:
xpath = '/html/body/div[@class = "my_class"]'

#### CSS Locators

In [None]:
# Can you see any similarity here?
xpath = '/html/body/div'
css_locator = 'html > body > div'

In [None]:
xpath = '/html/body//div/p[2]'
css_locator = 'html > body > div > p[2]'

In [None]:
css_selector = 'html > body > div > p:nth-child(2)'
css_selector = 'title::text'


In [None]:
xpath = '//p.text()'   # Select all text elements in a p tag in a list
css_selector = 'p::text'

==========

## Python Libraries for Web Scraping

- Scrapy Basics
- Requests
- BeautifulSoup

### Scrapy

In [None]:
# pip install Scrapy

In [None]:
from scrapy import Selector

#### Selectors

In [None]:
html = '''
    <html>
        <head>
            <title> My First Web Page </title>
        </head>
        <body>
            <div class = 'my_class>
                <p>Hello, World! </p>
                <p> HTML is Awesome </p>
            </div>
            <p> Goodbye! :) </p>
        </body>
    </html>
'''

In [None]:
sel = Selector(text = html)

##### Using XPath

In [None]:
sel.xpath('//p')

In [None]:
sel.xpath('/html/body/div/p[2]').get()              # ' HTML is Awesome '

In [None]:
sel.xpath('/html/body/div/p[2]/text()').extract()   # [' HTML is Awesome ']

In [None]:
sel.xpath('/html/head/title').extract()     # <title> My First Web Page </title>

In [None]:
sel.xpath('/html').xpath('./body/div/p[1]')
# sel.xpath('/html/body/div/p[1]')

##### Using CSS Locators

In [None]:
sel.css("div > p")

In [None]:
sel.css("div > p").extract()

In [None]:
sel.css("title")[0].get()

In [None]:
sel.css("title::text")[0].get()

##### Using Scrapy Shell

- Getting Web Data


    - Open Anaconda Prompet

    - Let's open the scrapy debugger
            - scrapy shell
    
    - Now, let's fetch a specific URL
            - fetch("https://wuzzuf.net/search/jobs/?q=illustrator&a=hpb")
            
    - How about viewing this fetched data?!
            - view(response)


- Parsing HTML


    - Using css selectors to select the title of the page
            - response.css("title")
            - response.css("title::text")
            - response.css("title::text")[0].get()
            - response.css("title::text").get()
            - response.css("title::text").getall()
            

- Scrapy Project


    - Start a new project in scrapy, after entering the prefered folder
            - cd C:\Users\musta\OneDrive\Desktop
            - scrapy startproject wuzzuf
    
    - Go to the spider directory
            - cd wuzzuf
            - cd spiders
            
    - Create a spider for crawling
            - scrapy genspider jobs wuzzuf.net/search/jobs/?q=illustrator&a=hpb
              
    - Let's define a parse() function in the spider python file that parse our web page
            - def parse(self, response):
                print("\n")
                print("HTTP STATUS: "+str(response.status))
                print(response.css("title::text").get())
                print("\n")
                         
    - Test the crawler
            - scrapy crawl jobs
            
            
    - Inspecting a specific element in the web page
            - job_titles = response.css("h2.css-m604qf")
            - job_titles[0]
            
    - Complete the process as you have already learnt in 'Wuzzuf' BeautifulSoup Case-study

### Requests

In [None]:
import requests as rq

In [None]:
url = 'https://en.wikipedia.org/wiki/Aamir_Khan_filmography'

In [None]:
html = rq.get(url).content

In [None]:
html

##### Using Scrapy Selectors for Parsing

In [None]:
sel = Selector(text=html)

In [None]:
sel.xpath('//table').extract()

### BeautifulSoup

In [None]:
from bs4 import BeautifulSoup
import requests

url = 'https://simple.wikipedia.org/wiki/Computer_science'
response = requests.get(url)
html_doc = response.text
soup = BeautifulSoup(html_doc)

In [None]:
# Prettified Soup
soup.prettify()

In [None]:
# Exploring Beautiful Soup
soup.get_text()

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

==========

# THANK YOU!