# Machine Learning for Data Science: Scraping Framework

---

## Web Scraping with Beautiful Soup

### What is Beautiful Soup?


The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures. Once we grab elements, Python makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.
<br>
__First__, we have to turn the website code into a Python object. <br>
print response.text. Then turns the text into an Python object named soup. <br>
__Second__, the built in Python parser, which we can call using html.parser that Beautiful Soup uses to parse your text. 

#### Step 1. First of all, you will have to install Beautifulsoup and Requests<br>
Beautiful Soup 4 is published through PyPi, so you can install it with pip. <br>
> _pip install beautifulsoup4_

In [None]:
import requests
import bs4
import re

#### Step 2 and Step 3 
- Find the URL you want to scrape 
- Define what information you want 

ศูนย์ข้อมูลอุบัติเหตุ www.thairsc.com/th/BigAccidentAll.aspx?l-th

In [None]:
response = requests.get('http://www.thairsc.com/th/BigAccidentAll.aspx?l=th')
#print(response.text)

#### Step 4. Identify the structure of the sites HTML

In [None]:
soup = bs4.BeautifulSoup(response.text, "html.parser")
print(soup.prettify()) 

#### Step 5. Write the scraping code 
Extract the data from structure of the sites HTML

In [None]:
soup.title

#### Step 6. Extract the information from the “soup”

In [None]:
data = soup.findAll(attrs={'class' : 'text-detail'})
data[1].string

In [None]:
import pandas as pd

acc_text = []
acc_day = []
acc_month = []
acc_year = []
acc_time = []
acc_addr1 = []
acc_addr2 = []

#### 1. Pattern ข้อความแจ้งอุบัติเหตุ

In [None]:
re_text = re.compile(r"อุบัติเหตุ\s*[\"“]*([\s\w\-\.ก-๙]+)[\"”]*")

#### 2. Pattern วันที่เกิดอุบัติเหตุ

In [None]:
re_date = re.compile(r"วันที่\s*(\d+)[/-](\d+)[/-](\d+)")

#### 3. Pattern เวลาเกิดเหตุ

In [None]:
re_time = re.compile(r"(เวลา|เวลาประมาณ)\s*(\d+\.\d+)\s*")

#### 4. สถานที่เกิดเหตุ (อำเภอ และ จังหวัด)

In [None]:
re_addr = re.compile(r"(อ\.|อำเภอ|เขต)\s*([ก-๙]+)\s*(จ\.|จังหวัด)\s*([ก-๙]+)")

#### Extract the information from pattern 

In [None]:
for i in range(1, len(data)):
    mdate = re_date.search(data[i].string)
    acc_day.append(mdate.group(1))
    acc_month.append(mdate.group(2))
    acc_year.appendmdate.group(3))
    
    mtime = re_time.search(data[i].string)
    acc_time.append(mtime.group(2))
    
    maddr = re_addr.search(data[i].string)
    acc_addr1.append(maddr.group(2))
    acc_addr2.append(maddr.group(4))
    
    mtext = re_text.search(data[i].string)
    acc_text.append(mtext.group(1))

#### Write the information into DataFrame

In [None]:
acc_data = {'acc_text': acc_text, 
        'acc_day': acc_day, 
        'acc_month': acc_month, 
        'acc_year': acc_year, 
        'acc_time': acc_time, 
        'acc_add1':acc_addr1, 
        'acc_add2':acc_addr2}

df = pd.DataFrame(acc_data)
df

---

### Challenge : ศูนย์ข้อมูลอุบัติเหตุ Thai Rsc in Detail

In [None]:
response = requests.get('http://www.thairsc.com/th/BigAccDetail.aspx?qid=47053&l=th')
soup_2 = bs4.BeautifulSoup(response.text, "html.parser")
data_detail = soup_2.find('span', class_ = 'detail')
print(data_detail.text)

In [None]:
## Your practice homework
## Code here

<hr style="color: black">
<center>_Please get back to slide_</center>
<hr style="color: black">

## Write DataFrame to Sqlite3

In [None]:
Jupyter Notebook scraping_framework (autosaved) 
Python 3 
File
Edit
View
Insert
Cell
Kernel
Widgets
Help
import sqlite3
conn = sqlite3.connect("accident.db")
df.to_sql("accident", conn, if_exists="replace")
# https://www.dataquest.io/blog/python-pandas-databases/

#### Select all data from database (named accident)

In [None]:
data_all = pd.read_sql_query("select * from accident;", conn)

In [None]:
data_all.head()

#### Count the number of accident that occur in month 11

In [None]:
accident_11 = pd.read_sql_query("select count(*) from accident where acc_month=11;", conn)
accident_11

<hr style="color: black">
<center>_Please get back to slide_</center>
<hr style="color: black">

## Twitter Search API

In [None]:
# pip install python-twitter
# https://python-twitter.readthedocs.io/en/latest/installation.html
# https://www.alexkras.com/how-to-get-user-feed-with-twitter-api-and-python/
# Go to https://dev.twitter.com/apps/new and log in , to get secret key

In [None]:
from twitter import *

#### a) Twitter Authentication API

In [None]:
api = twitter.Api(
 consumer_key='xxxxxxxxxxxxxxxxxxxx',
 consumer_secret='xxxxxxxxxxxxxxxxxxxx',
 access_token_key='xxxxxxxxxxxxxxxxxxxx',
 access_token_secret='xxxxxxxxxxxxxxxxxxxx'
 )

#### b) Search by term

In [None]:
search = api.GetSearch(term='รถชน', lang='th',  result_type='recent', count=10000, max_id='')
for t in search:
    print(t.user.screen_name + ' (' + t.created_at + ')')
    print(t.text)
    #Add the .encode to force encoding
    #print(t.text.encode('utf-8'))
    print('')

In [None]:
screen_name=[]
created_at=[]
twitter_text=[]
search = api.GetSearch(term='อุบัติเหตุ', lang='th',  result_type='recent', count=10000, max_id='')
for t in search:
    screen_name.append(t.user.screen_name)
    created_at.append(t.created_at)
    twitter_text.append(t.text)
    #Add the .encode to force encoding
    #print(t.text.encode('utf-8'))

tw_data = {'screen_name': screen_name, 
        'created_at': created_at, 
        'twitter_text': twitter_text}    
tw_df = pd.DataFrame(tw_data)
tw_df.head()