---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Getting and parsing web content 


🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 🌐 1. Getting Content from the Web

Our goal today is to extract all available research papers from the professor's website!

## The `HTTP` Protocol

When we want to grab data from a website, we need to ask the website's server for that data. The HTTP, or HyperText Transfer Protocol, is like a set of rules that determines how we talk to these servers and how they talk back to us. Here's a breakdown of some of the main ways (methods) we can communicate using HTTP:

- **`GET`**: This is like saying, "Hey server, can you show me this?" It's used to ask the server to show us specific data. For instance, when you type a website address into your browser, you're actually sending a GET request to view that website.
  
- **`POST`**: Think of this as telling the server, "Hey, I have some info for you." This is used when you fill out a form on a website, like when you sign up for a new social media account. You're sending your details (username, password, etc.) to the server.
  
- **`PUT`**: This method is like updating or editing something that's already on the server. For example, imagine you're changing your profile picture on a website. You'd use a PUT request to replace the old picture with the new one.
  
- **`DELETE`**: Simply put, this is asking the server to get rid of something. If you were to delete a post or a photo from a website, you'd be sending a DELETE request.


## The `requests` Module
While HTTP is great, writing raw HTTP can be tedious. The `requests` module streamlines this process: 
- it offers a much simpler interface that abstracts away from the complexities of creating HTTP requests
- it offers a much simpler way to send data, or interpret data returned by a server

### Note: APIs
Interfaces, such as the `requests`, are commonly called `APIs` (Application Programming Interfaces). 
- An API is a protocol enabling distinct software entities to converse. 
- In the requests example, the module offers an API that abstracts from the details of you getting what you want
- Analogy: Consider a restaurant menu. It lists and describes dishes. When you place an order, the kitchen (akin to a web server) prepares and serves the dish. Here, the menu symbolizes the API; your order is the request, and the dish is the response. You could describe in detail how to prepare the dish, but the menu abstracts from this complexity, enabling you to order with ease.

### **Key Features & Functionalities**:

- **HTTP Methods**: `requests` allows you to perform many actions usng HTTP: GET, POST, PUT, DELETE, and more.

- **Parameters**: You can add parameters to your url via the `params` keyword.

- **Response Handling**: After the request, you can extract the server's response in whatever format works for you (text, JSON, or bytes).

- **Cookies**:
  - You can seamlessly send/receive cookies during requests.
  - Note: A cookie is a small data file that the website writes into your system. On subsequent website visits, this data is retrieved, assisting the site in recollecting prior user activity.

- **Sessions**:
  - When communicating with a consistent host, you can create a Session.
  - Note: A session is an ephemeral, bidirectional data exchange. A session starts when accessing a website and ends upon exit or browser closure. Sessions retain user data across pages, exemplified when logged into a site.

- **Headers**:
  - Adding headers to your request is crucial for emulating browsers, managing authentication, or working with metadata.
  - Note: data transit on the internet fragments it into packets. Each packet has data and accompanying metadata, as well as headers. In the HTTP protocol, headers create information about the request/response, and the payload.

- **Redirections, Timeouts, and Exceptions**:
  - `requests` has tools to maneuver redirections and flag exceptions for scenarios like timeouts or excessive redirects.
  - Note: internet communication can be fickle – coding should account for this!

## Handling Errors with `try-except`

 `try-except` allows you to handle errors.

**Workflow**:
- Start with a `try` block that contains the code that may raise an error.
- If the code throws no error, nothing happens
- If the code throws an error, subsequent commands within the same block are not executed. Instead, Python moves to the `except` block, and runs the code it contains.

In [None]:
import requests
import time

url        = "https://www.apostolos-filippas.com"
my_headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

success=False
    
# try 5 times
for i in range(1,6):
    try:           
        #use the browser to access the url  
        response=requests.get(url,headers = my_headers)              
        # if an error does not occur, set success 
        success=True 
        # we got the file, break the loop
        break        
    except:
        # if we got an exception, the attempt to get the response failed
        print (f'failed attempt {i}')
        # wait for 2 seconds before you go to the next iteration of the loop
        time.sleep(2)
        
if success==True:
    print( 'Successfully retrieved the webpage!')
else:
    print( 'Did not manage to retrieve the webpage')

Try fetching a URL that's likely non-existent, such as "http://www.apostolos-filippas-sucks.com/". 

## More on `requests`:

### Status Validation:
- To ensure that you grabbed the website content, validate the response. 
- A `200 OK` indicates a successful fetch.

In [None]:
print(response.status_code)

### Look into the HTTP Headers:
- They contain interesting details about who you're communicating with

In [None]:
max_key_len = max(len(key) for key in response.headers)

# Then, format with alignment
for key, value in response.headers.items():
    print(f"{key:<{max_key_len}} : {value}")

To do what the server returned, simply use the text attribute of the response object.

In [None]:
html_content = response.text
print(html_content)

The data we grabbed is the same content that is accessible when you visit the website. To see,
1. Navigate to a website.
2. Right-click on the page.
3. Opt for the "View page source" option from the context menu.


## Question
The returned content is a string. How can we parse it to extract the information we want? Specifically, how can we extract the paper titles?

---
# 🔍 2 Parsing Content

After you retrieve the URL, the next step is to extract the content you want.

## Regular Expressions

**Regular Expressions** (REs, regexes, or regex patterns), are a compact, specialized programming language integrated into Python, accessible via the `re` module.
-  REs are used to identify and extract specific patterns from strings. 
-  You need to define the matching criteria — for example, all strings that have the email address format

**Example**:
To find all patterns like:
- `papers/bigcounter.pdf">Strength in numbers: Using big data to simplify sentiment classification<`
- `papers/sharing.pdf">Owning, using and renting: Some simple economics of the "sharing economy"<`
...and similar, you can use regular expressions.


**Key Regex Symbols**:
- `.`: Matches any character.
- `.*`: Matches any string (of zero or greater length).
- `.+`: Matches any string (of one or greater length).
- `()`: Specifies which segment of the pattern to return.

## Grabbing the titles of the papers

Try to find what is common in the html where the titles of the papers lie. 
Then, use regular expressions to extract the titles.

In [None]:
import re
papers = re.findall('papers/.+>(.+)<' ,html_content)
print(papers)


# <font color='red'>3. Challenge </font>

Parse the courses that I have taught, and return a set containg their titles


<div align="center">
    <img src="https://github.com/apostolosfilippas/wa/blob/main/assets/normal_regex.jpg?raw=true" width="600">
    <br><br>
</div>