by Graham Lim

# 1. Webscraping

You will need to run the following pip install commands in terminal or cmd line:

* `pip install bs4` (for BeautifulSoup)
* `pip install selenium` (for Selenium)
* `pip install webdriver-manager` (for the automated Selenium web driver to work)

In [1]:
#Standard Python DS imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#set column size to be larger
pd.set_option("display.max_colwidth", 1000)

We have to use `Selenium` because of the fact that all the clauses don't load in full in this website. The content only loads up in full via infinite scrolling down/paging down. 

Hence, we will import `Selenium` and the related `WebDriver Manager` tool to run a Chrome instance within Selenium that will keep scrolling down for us, so that we don't manually have to do this for our 15+ types of clauses.

In [17]:
#Selenium and WebDriver Manager imports:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

import time
from selenium.webdriver.common.keys import Keys

In [5]:
#let's assign the major URLS we want to scrape from LawInsider:

#these are our 2 target clauses - automatic and optional/manual renewal clauses:
url = "https://www.foodnavigator-asia.com/tag/keyword/Food/Rice"

## LawInsider.com Scraper Function

We then write a function that will scrape the clauses contained in the LawInsider site url(s) after scrolling down that page x number of pagedowns to load it in full.

It takes 3 arguments: the url/url list objects we previously assigned (`urls`), the number of pagedowns/scrolls downwards to execute (`pagedown_pushes`), and the delay between each page/scroll down (`pagedown_lag`), so that LawInsider doesn't get overwhelmed with too many requests.

In [58]:
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

driver.get(url)
time.sleep(5)
driver_body = driver.find_element_by_tag_name('body')
driver_pages = driver_body.find_elements_by_class_name('Pagination-item')

i = 0

intro_list = []
title_list = []
date_list = []
post_url_list = []

for i in range(1, len(driver_pages)):
    html = ""
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    intro_elems = driver.find_elements_by_class_name("Teaser-intro")

    for intro in intro_elems:
        intro_list.append(intro.text)

    title_elems = driver.find_elements_by_class_name("Teaser-title")
    
    for title in title_elems:
        title_list.append(title.text)
    
    date_elems = driver.find_elements_by_class_name("Teaser-date")
    for date in date_elems:
        date_list.append(date.text)
    
    linksdiv = soup.find_all('h3', {'class': 'Teaser-title'})
    for linkdiv in linksdiv:
        post_url_list.append('www.foodnavigator-asia.com'+(linkdiv.find('a')['href']))
    
#   div = soup.find('div', {'class': 'gsc-wrapper'})
#   linksdiv = div.find_all('div', {'class': 'gsc-webResult gsc-result'})
#   for linkdiv in linksdiv:
#       links.append(linkdiv.find('a')['href'])

    try:
        driver.execute_script("arguments[0].click();", driver_pages[i])
        time.sleep(2)

        driver_body = driver.find_element_by_tag_name('body')
        driver_pages = driver_body.find_elements_by_class_name('Pagination-item')
        i+1
        
    except exceptions.StaleElementReferenceException as e:
        pass

driver.close()

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


In [36]:
len(post_url_list)

75

In [38]:
#checking scraped length
print (len(intro_list))
print (len(title_list))
print (len(date_list))
print (len(post_url_list))

75
75
75
75


In [39]:
#We then write a simple function to convert and label these lists as DataFrames in pandas, and tells us what the `shape` of the dataframe is:

df = pd.DataFrame({'date': date_list,
                   'title': title_list,
                   'intro': intro_list,
                   'url': post_url_list
                  })

df.shape

(75, 4)

In [40]:
df.head()

Unnamed: 0,date,title,intro,url
0,20-Jul-2020,Legacy vs price: Rice exports from Vietnam and India vie for ASEAN trade post-COVID-19,"Vietnam and India are competing for ASEAN rice trade post-COVID-19, with the former having gained advantage due to support from traditional partner Philippines, and the latter having come out ahead with Malaysia in terms of price.",www.foodnavigator-asia.com/Article/2020/07/20/Legacy-vs-price-Rice-exports-from-Vietnam-and-India-vie-for-ASEAN-trade-post-COVID-19
1,25-Jun-2020,Trade-off: Rice and seafood the big winners for Vietnam under EU free trade deal,"The recently-ratified EU-Vietnam Free Trade Agreement (EVFTA) has seen major gains in food trade for both sides, particularly in rice and seafood for Vietnam as well as alcohol and meat for the EU.",www.foodnavigator-asia.com/Article/2020/06/25/Trade-off-Rice-and-seafood-the-big-winners-for-Vietnam-under-EU-free-trade-deal
2,22-Apr-2020,Rice-ing concern: COVID-19 creates supply and price volatility for Asia’s most ‘cost-sensitive’ crop,Lockdowns and trade barriers across Asia due to the COVID-19 pandemic have thrust rice - one of the region’s largest agricultural commodities – firmly into the spotlight potential volatility in both supply and cost.,www.foodnavigator-asia.com/Article/2020/04/22/Rice-ing-concern-COVID-19-creates-supply-and-price-volatility-for-Asia-s-most-cost-sensitive-crop
3,17-Mar-2020,The rice and fall: Vietnam eyes more global opportunities as Thai supply totters,"Thailand’s position as the largest exporter of rice from the South East Asian region is at risk as the country struggles to handle weather, economical and quality changes, whereas Vietnam looks to be going from strength to strength.",www.foodnavigator-asia.com/Article/2020/03/17/The-rice-and-fall-Vietnam-eyes-more-global-opportunities-as-Thai-supply-totters
4,25-Feb-2020,Beyond rice: Indian government urged to boost cereal production in security drive,"Researchers are pushing for India to focus on enhancing its production of crops other than rice, the country’s traditional staple, if it wishes to effectively address its triple threats of food security, climate change effects and malnutrition.",www.foodnavigator-asia.com/Article/2020/02/25/Beyond-rice-Indian-government-urged-to-boost-cereal-production-in-security-drive


In [43]:
cd /Users/grahamlim/Documents/FSX_and_Trainer_Stuff/scrape_risk_app/data/

/Users/grahamlim/Documents/FSX_and_Trainer_Stuff/scrape_risk_app/data


In [44]:
pwd

'/Users/grahamlim/Documents/FSX_and_Trainer_Stuff/scrape_risk_app/data'

In [45]:
df.to_csv("../data/df_raw.csv")

## 2. Cleaning and Labelling

In [47]:
#checking for null values

df.isnull().sum()

date     0
title    0
intro    0
url      0
dtype: int64

In [53]:
#combine title and intro text in one

df["combined"] = df["title"] + ". " + df["intro"]

df["combined"]

0              Legacy vs price: Rice exports from Vietnam and India vie for ASEAN trade post-COVID-19. Vietnam and India are competing for ASEAN rice trade post-COVID-19, with the former having gained advantage due to support from traditional partner Philippines, and the latter having come out ahead with Malaysia in terms of price.
1                                                     Trade-off: Rice and seafood the big winners for Vietnam under EU free trade deal. The recently-ratified EU-Vietnam Free Trade Agreement (EVFTA) has seen major gains in food trade for both sides, particularly in rice and seafood for Vietnam as well as alcohol and meat for the EU.
2               Rice-ing concern: COVID-19 creates supply and price volatility for Asia’s most ‘cost-sensitive’ crop. Lockdowns and trade barriers across Asia due to the COVID-19 pandemic have thrust rice - one of the region’s largest agricultural commodities – firmly into the spotlight potential volatility in both supply and cost

In [56]:
df["risk_rating"]=0

In [57]:
df.head()

Unnamed: 0,date,title,intro,url,combined,risk_rating
0,20-Jul-2020,Legacy vs price: Rice exports from Vietnam and India vie for ASEAN trade post-COVID-19,"Vietnam and India are competing for ASEAN rice trade post-COVID-19, with the former having gained advantage due to support from traditional partner Philippines, and the latter having come out ahead with Malaysia in terms of price.",www.foodnavigator-asia.com/Article/2020/07/20/Legacy-vs-price-Rice-exports-from-Vietnam-and-India-vie-for-ASEAN-trade-post-COVID-19,"Legacy vs price: Rice exports from Vietnam and India vie for ASEAN trade post-COVID-19. Vietnam and India are competing for ASEAN rice trade post-COVID-19, with the former having gained advantage due to support from traditional partner Philippines, and the latter having come out ahead with Malaysia in terms of price.",0
1,25-Jun-2020,Trade-off: Rice and seafood the big winners for Vietnam under EU free trade deal,"The recently-ratified EU-Vietnam Free Trade Agreement (EVFTA) has seen major gains in food trade for both sides, particularly in rice and seafood for Vietnam as well as alcohol and meat for the EU.",www.foodnavigator-asia.com/Article/2020/06/25/Trade-off-Rice-and-seafood-the-big-winners-for-Vietnam-under-EU-free-trade-deal,"Trade-off: Rice and seafood the big winners for Vietnam under EU free trade deal. The recently-ratified EU-Vietnam Free Trade Agreement (EVFTA) has seen major gains in food trade for both sides, particularly in rice and seafood for Vietnam as well as alcohol and meat for the EU.",0
2,22-Apr-2020,Rice-ing concern: COVID-19 creates supply and price volatility for Asia’s most ‘cost-sensitive’ crop,Lockdowns and trade barriers across Asia due to the COVID-19 pandemic have thrust rice - one of the region’s largest agricultural commodities – firmly into the spotlight potential volatility in both supply and cost.,www.foodnavigator-asia.com/Article/2020/04/22/Rice-ing-concern-COVID-19-creates-supply-and-price-volatility-for-Asia-s-most-cost-sensitive-crop,Rice-ing concern: COVID-19 creates supply and price volatility for Asia’s most ‘cost-sensitive’ crop. Lockdowns and trade barriers across Asia due to the COVID-19 pandemic have thrust rice - one of the region’s largest agricultural commodities – firmly into the spotlight potential volatility in both supply and cost.,0
3,17-Mar-2020,The rice and fall: Vietnam eyes more global opportunities as Thai supply totters,"Thailand’s position as the largest exporter of rice from the South East Asian region is at risk as the country struggles to handle weather, economical and quality changes, whereas Vietnam looks to be going from strength to strength.",www.foodnavigator-asia.com/Article/2020/03/17/The-rice-and-fall-Vietnam-eyes-more-global-opportunities-as-Thai-supply-totters,"The rice and fall: Vietnam eyes more global opportunities as Thai supply totters. Thailand’s position as the largest exporter of rice from the South East Asian region is at risk as the country struggles to handle weather, economical and quality changes, whereas Vietnam looks to be going from strength to strength.",0
4,25-Feb-2020,Beyond rice: Indian government urged to boost cereal production in security drive,"Researchers are pushing for India to focus on enhancing its production of crops other than rice, the country’s traditional staple, if it wishes to effectively address its triple threats of food security, climate change effects and malnutrition.",www.foodnavigator-asia.com/Article/2020/02/25/Beyond-rice-Indian-government-urged-to-boost-cereal-production-in-security-drive,"Beyond rice: Indian government urged to boost cereal production in security drive. Researchers are pushing for India to focus on enhancing its production of crops other than rice, the country’s traditional staple, if it wishes to effectively address its triple threats of food security, climate change effects and malnutrition.",0


In [None]:
#manually annotate