# Web Scraping & NLP - Central London Data Science Project Nights
Instead of relying on excel sheets and database admins to give us the data we need to do our data science, we can take the task into our own hands and collect the data ourselves just by going to webpages.

![see the data](images/neo.gif)
<p style="text-align:center">Stop seeing web pages and start seeing data.</p>

### First lets see what version of python we are on

In [1]:
import sys
if sys.version_info[0] == 3:
    print('Great! Python 3! Lets get on with scraping!')
else:
    print('Yikes! Python 2! This may not work for you! ')

Great! Python 3! Lets get on with scraping!


## Import the libraries for doing our scraping

In [2]:
# 'requests' is what we use to send web requests (to fetch the html files from websites)
import requests

# beautiful-soup will help us in navigating through the html in an easy way to find just the text we care about
from bs4 import BeautifulSoup

Now lets decide which page we want to scrap. We'll do https://techcrunch.com/ first. Open the page in your browser (by clicking on the link) to see the visual structure of the page.

In [47]:
WEB_PAGE_TO_SCRAPE_URL = "https://techcrunch.com/"

In [48]:
# send request for the web page
response = requests.get(WEB_PAGE_TO_SCRAPE_URL)

In [9]:
# lets look at some of the raw text (the html), more specificly the first 500 characters 
response.text[:500]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>TechCrunch - The latest technology news and information on startups</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge" />\n\t<meta charset="UTF-8">\n\t\t<meta name="p:domain_verify" content="6189ff68ce30e30f12b40b3b40873027"/>\n\t<meta name="HandheldFriendly" content="True">\n\t<meta name="MobileOptimized" content="320'

In [10]:
souped_page = BeautifulSoup(response.text, 'html.parser')

In [14]:
souped_page.find('title')

<title>TechCrunch - The latest technology news and information on startups</title>

In [16]:
souped_page.find('title').getText()

'TechCrunch - The latest technology news and information on startups'

## Use your browsers 'inspect' to help find the element you want to scrape

Most moder browsers allow you to find the exact code for the part of the webpage you are looking at.

In chrome: right click on the part of the page and select  *'inspect'*

![inspect element](images/inspect.png)



---



### Chrome will highlight the related part of the webpage as you move your mouse over the code

![find](images/element_find.png)

## BeautifulSoup query syntax

In [36]:
list_of_aticles = souped_page.find('ul',{"id": "river1"})

In [44]:
for a in souped_page.find_all('li', {'class':['river-block ']}):
    print(a['data-sharetitle'])

Apple moves iCloud encryption keys for Chinese users to China
GoBee Bike throws in the towel in France
Samsung MWC 2018 Liveblog
LG turns to EyeEm to add AI to its cameras
The curious case of the LG V30S ThinQ
This is the Samsung Galaxy S9 launch video
Liquid democracy uses blockchain to fix politics, and now you can vote for it
The Rite Press takes low-tech coffee making to high-tech highs
Equity shot: Dropbox is going public, and Aaron Levie has some advice
The FCC’s revamped internet speed map lets you covet nearby exotic broadband
Veil is private browsing for the ultra-paranoid
The Dropbox IPO filing is here
Twitch’s first live game show ‘Stream On’ debuts March 8
Can Ghostbusters copy Pokémon GO’s success with its own AR mobile game?


## Lets grab an article

In [45]:
ARTICLE_URL = 'https://techcrunch.com/2018/02/25/gobee-bike-throws-in-the-towel-on-france/'

In [49]:
article_response = requests.get(ARTICLE_URL)

In [50]:
article_response.text[:500]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>GoBee Bike throws in the towel in France  |  TechCrunch</title>\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge" />\n\t<meta charset="UTF-8">\n\t\t\t<script type="text/javascript">var _sf_startpt = (new Date()).getTime()</script>\n\t\t<meta name="p:domain_verify" content="6189ff68ce30e30f12b40b3b40873027"/>\n\t<meta name="Hand'

In [53]:
article_soup = BeautifulSoup(article_response.text, 'html.parser')

In [55]:
article_soup.find('title').getText()

'GoBee Bike throws in the towel in France  |  TechCrunch'

In [58]:
article_body = article_soup.find('div', {'class':['article-entry']})
article_body

<div class="article-entry text">
<!-- Begin: Wordpress Article Content -->
<img class="" src="https://tctechcrunch2011.files.wordpress.com/2017/08/gobee-2.jpg?w=738"/>
<p id="speakable-summary">Bike-sharing startup <a href="http://gobeebike.fr/en/" target="_blank">GoBee Bike</a> is giving up and <a href="http://gobeebike.fr/fr/goodbye-fr/" target="_blank">shutting down</a> in all French cities where it operates. GoBee Bike operates just like Chinese giants Ofo and Mobike. You open  the app, you find a bike on the map and you unlock it by scanning a QR code. Once you’re done, you lock it again and leave it there — there’s no dock.</p>
<p>And yet, the startup is blaming vandalism and says that the service would stop immediately. It’s worth noting that users will get a refund on their remaining balances and €15 deposit. This is a nice gesture.</p>
<p>According to the announcement, GoBee Bike managed to attract 150,000 users in Europe who used the service hundreds of thousands of times. Bu

In [97]:
article_text = article_body.getText().replace('\n', ' ')
article_text

'   Bike-sharing startup GoBee Bike is giving up and shutting down in all French cities where it operates. GoBee Bike operates just like Chinese giants Ofo and Mobike. You open  the app, you find a bike on the map and you unlock it by scanning a QR code. Once you’re done, you lock it again and leave it there — there’s no dock. And yet, the startup is blaming vandalism and says that the service would stop immediately. It’s worth noting that users will get a refund on their remaining balances and €15 deposit. This is a nice gesture. According to the announcement, GoBee Bike managed to attract 150,000 users in Europe who used the service hundreds of thousands of times. But the company’s bikes slowly became unusable. 3,200 bikes became dysfunctional, 1,000 bikes were illegally parked in someone’s home. Overall, GoBee Bike had to send someone in 6,500 cases. The startup couldn’t keep up and it became clear that the business model wasn’t scalable if you needed to fix the bikes all the time. 

<h1 style="text-align:center">Now you should see all of the internet as scrapable data </h1>
![new view](images/matrix.gif)

# Lets do some NLP now!

In [98]:
from textblob import TextBlob

In [99]:
def get_sentiment(text):
    return TextBlob(text).sentiment

In [100]:
processed_text = TextBlob(article_text)

From https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.sentiment

> TextBlob.sentiment

> Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [101]:
processed_text.sentiment

Sentiment(polarity=0.09011111111111111, subjectivity=0.41522222222222227)

## What are the more subjective (emotional) sentences?

In [103]:
sentences = article_text.split('.')

for sentence in sentences:
    
    sentence_sentiment = get_sentiment(sentence)
    
    if  sentence_sentiment[1] > 0.5:
        print(sentence, sentence_sentiment[1])

 This is a nice gesture 1.0
 Mobike has been around for a month and rides are free as well 0.8
 Even Obike gave you 50 free rides when you signed up 0.8
 It’s hard to compete with free 0.6708333333333334


## Can we plot the words by their sentiment? 

In [93]:
import matplotlib.pyplot as plt

In [104]:
words = article_text.split(' ')

In [1]:
x

NameError: name 'x' is not defined