# Lab1.4: Techcrunch as a source for text

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Techcrunch: https://techcrunch.com/ is a website with news on tech companies. It is linked to Crunchbase: https://www.crunchbase.com which contains structured information on tech companies. The TechCrunch site allows you to search on their database. If you type a key word in the search box, e.g. "apple", you will see the search results appear but also the URL address has been changed, e.g.:

https://techcrunch.com/search/apple


This is an example of an online database that can be accessed throug ha so-called REST API. REST stands for REpresentational State Transfer and allows people or software to make calls to a server. It is used by many websites to handle requests such as searches in databases.

Below, we will show how you can now create a URL that forms a request to search in a database such as Techcrunch to obtain search results.

We first define our function to process a URL with BeautifulSoup

In [4]:
from bs4 import BeautifulSoup
import requests 
import re

#Utility function to get the raw text from a web page. 
#It takes a URL string as input and returns the text.
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

Next, we create a URL by appending the search term to the search URL of Techcrunch. Note that other websites can use a different syntax. You should first try out a search manually to see how searches are specified. Below we search for "apple os x". Note that we need to represent spaces in the query through '%20'.

In [6]:
keyword="apple%20os%20x"
url = "techcrunch.com/search/"+keyword
print('The search request URL:', url)
r  = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data,'html5lib')

The search request URL: techcrunch.com/search/apple%20os%20x


The result of the search is now stored as a BeautifulSoup data structure. The BeautifulSoup documentation explains how we can access the result.

In [16]:
type(soup)

bs4.BeautifulSoup

We are going to iterate over all HTML anchors 'a' and obtain the hyperlink 'href' as an URL. We use our function 'url_to_string' to obtain the text from each URL and save it in a file in the folder 'textcrunch_search_results'. Make sure that this folder exists before you call the next cell.

In [None]:
for i, link in enumerate(soup.find_all('a'), 1):
    embeddedurl=(link.get('href'))
    print(embeddedurl)
    text=embeddedurl+'\n'+url_to_string(embeddedurl)
    filename="techcrunch_search_results/"+keyword+str(i)+".txt"
    f= open(filename,"w+")
    f.write(text)
    result+=1

https://techcrunch.com/2016/09/01/apple-patches-zero-day-vulnerabilities-in-safari-and-os-x/
https://techcrunch.com/author/devin-coldewey/
https://techcrunch.com/2016/09/01/apple-patches-zero-day-vulnerabilities-in-safari-and-os-x/
https://techcrunch.com/2016/01/27/apple-has-fixed-bug-that-was-crashing-safari-at-least-on-os-x/
https://techcrunch.com/author/romain-dillet/
https://techcrunch.com/2016/01/27/apple-has-fixed-bug-that-was-crashing-safari-at-least-on-os-x/
https://techcrunch.com/2016/01/20/apple-releases-ios-and-os-x-updates-with-bug-fixes-and-performance-improvements/
https://techcrunch.com/author/romain-dillet/
https://techcrunch.com/2016/01/20/apple-releases-ios-and-os-x-updates-with-bug-fixes-and-performance-improvements/
https://techcrunch.com/2015/09/30/os-x-el-capitan-review/
https://techcrunch.com/author/romain-dillet/
https://techcrunch.com/2015/09/30/os-x-el-capitan-review/
https://techcrunch.com/2018/11/05/review-ipad-pro-pencil-12-9-inch/


## End of this notebook