Import necessary libraries

In [2]:
from bs4 import BeautifulSoup
import time
import requests
import re
import os

#### Interaction with the Page-Sorting
- On interacting with the page, it can be observed that yes, we can trigger the sorting change directly by modifying only the URL in browser’s address bar by editing sort = dateoldest
- On investingating network, it can be observed that sorting my oldest is a GET request
- The variable associated with sorting is the "sort" variable

#### Interaction with the Page-Pagination
- It can be observed that 120 items are dispayed on each page
- url = "https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0"
- the part of the url that changes when nagigating between pages is the variable seach = 1\~gallery\~0\~0. In value provided to the variable, the following is the format of change when clicked on next page 
    gallery\~"page number"\~0
- As mentioned above, the variable associated with page changes is the "search" variable in the url. Altering this variables results in:
    1. Navigation between pages when "number" in 1\~gallery\~"number"\~0 is changed.
    2. Scroll downward on the same page when the "position" in 1\~gallery\~0\~"position" is changed.

In [2]:
url = "https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0"
header = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, header)
soup = BeautifulSoup(page.content, "html.parser")

- The links to individual weblinks are housed in the a tags within the "li" tags (list item). i.e the list item houses the target items, and the "a" tag houses the link to the target item.
- So, i will select the "li" tag then the immediate "a" tag under it

In [4]:
a_tags = soup.select("li > a")
weblinks1 = [i['href'] for i in a_tags]
#weblinks1

- An alternative way would be to directly find all the "a" tags with an href and remove the first two links as they are not related to the items of interest

In [5]:
tags = soup.select("a[href]")
weblinks1 = [j['href'] for i,j in enumerate(tags) if i not in [0,1]]
#weblinks1

#### Fetch Listing URLs

The strategy to get the listings urls is as follows:
- Create an empty list to store the urls, create a flag variable, and declare initial page index
- We need 250 unique listings, in this case we know that each page shows 120 items, but that may not always be the case. Hence, a flag variable will remain false until 250 unique urls are scrapped.
- A while loop will run until 250 unique urls are collected
- After each round of page scrapped, we increase page index by one to navigate between pages
- On each page get the html content and create a beautiful soup instance.
- Access the "a" tag within the "li" tag. Get the link. 
- Check if list already has 250 items, if yes, set flag to true and stop scrapping, else, check if the link already exists in the list If not, add url to list else continue.

In [6]:
## Declare variables
weblinks = []  ## Empty list to collect web url of 250 unique listings
flag_items = False ## Declare a flag to run scrapping until 250 unique listing links are captured
page_index = 0  ## Page index for pagination

## Collect links for 250 unique listings
while flag_items == False:

    ## pagination by manipulating url
    url = f"https://sfbay.craigslist.org/search/zip#search=1~gallery~{page_index}~0"
    header = {'User-Agent': 'Mozilla/5.0'}
    page = requests.get(url, header)
    
    ## create BeautifulSoup instance
    soup = BeautifulSoup(page.content, "html.parser")
    
    ## Extract links
    a_tags = soup.select("li > a")
    for i in a_tags:
        link = i['href']
        if len(weblinks) < 250:
            if i['href'] not in weblinks:
                weblinks.append(link)
            else:
                continue
        else:
            flag_items = True
            break
    
    page_index+= 1
    
    time.sleep(5)

## Print links       
weblinks

['https://sfbay.craigslist.org/eby/zip/d/antioch-free-antique-floor-standing/7714536489.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-twin-size-bed-frame/7714536307.html',
 'https://sfbay.craigslist.org/scz/zip/d/freedom-ikea-laptop-stand-bed-tray/7714535545.html',
 'https://sfbay.craigslist.org/eby/zip/d/san-ramon-built-in-counter-depth/7714535179.html',
 'https://sfbay.craigslist.org/sby/zip/d/milpitas-non-working-chest-freezer/7712074474.html',
 'https://sfbay.craigslist.org/eby/zip/d/oakland-box-of-breastmilk-storage-bags/7714534032.html',
 'https://sfbay.craigslist.org/scz/zip/d/freedom-free-computer-monitor/7714533705.html',
 'https://sfbay.craigslist.org/eby/zip/d/oakland-kid-science-kit-free/7714532990.html',
 'https://sfbay.craigslist.org/eby/zip/d/berkeley-free-pregnancy-pillow/7707497306.html',
 'https://sfbay.craigslist.org/eby/zip/d/oakland-free-istanbul-metro-card/7706359982.html',
 'https://sfbay.craigslist.org/eby/zip/d/oakland-free-stuff/7711572302.html

In [7]:
len(weblinks)

250

#### Save HTML Pages

- Using library os to create a new directory to save the listings' page information
- Split the url and get the last string as name of the document as it identifies the listing id

In [8]:
## Create new directory named 'listings'
output_dir = 'listings'
os.makedirs(output_dir, exist_ok=True)

for i in weblinks:
    
    url = i
    page = requests.get(url)

    filename = os.path.join(output_dir, f'{i.split("/")[-1]}')  ## Save file name using the last string sequence in url

    with open(filename, 'w', encoding='utf-8') as file:
    
        file.write(page.text)

        time.sleep(5)   

- Find the .html files in the new directory.
- Loop through the directory and create a BeautifulSoup instance of each listing's html page
- Use regex to find the required information and print to screen

In [10]:
## Access new directory
directory_path = 'listings'

for filename in os.listdir(directory_path):
    if filename.endswith('.html'): ## Check for .html files
        file_path = os.path.join(directory_path, filename)
        
        with open(file_path, 'r', encoding='utf-8') as file:
            html_content = file.read()
            soup = BeautifulSoup(html_content, 'html.parser') ## Create BeautifulSoup instance
            
            ## Use regex to find required information
            if soup.find('span', {'id': 'titletextonly'}) is not None:
                title = soup.find('span', {'id': 'titletextonly'}).next
                
                if soup.find('img') is not None:

                    img_url = soup.find('img')['src']
                
                else:

                    img_url = "Image unavailable"

                desc_string = str(soup.find('section', {"id" : "postingbody"}))

                description = re.findall(r"\n([^<]+)<", desc_string)[0]

                pattern_id = re.compile("post id: ")
                post_id = re.findall(": (\d+)" ,soup.find_all(string=pattern_id)[0])[0]

                pattern_post = re.compile('posted:')
                posted_date = re.findall(r">([^<]+)</" ,str(soup.find(string = pattern_post).parent))[0]

                pattern = re.compile('updated')

                if soup.find(string = pattern) is not None:

                    update_date = re.findall(r">([^<]+)</" ,str(soup.find(string = pattern).parent))[0]

                else:
                    
                    update_date = "Last updated date not available"  
                
                #print(f"Post ID : {post_id}", "\n")
                print(f"Title : {title}\n")
                print(f"URL of first image : {img_url}\n")
                print(f"Description : {description}\n")
                print(f"Post ID : {post_id}", "\n")
                print(f"Posted Date : {posted_date}\n")
                print(f"Last Updated Date : {update_date}\n")
                print("*************************************************************************************************")
                
            else:
                print("Posting deleted")
                continue

Title : Rawlings baseball helmet 6 1/4-6 7/8

URL of first image : https://images.craigslist.org/00p0p_hZSMbBjPoJT_0tE0CI_600x450.jpg

Description : Used two seasons in little league

Post ID : 7704690348 

Posted Date : 2024-01-05 19:19

Last Updated Date : 2024-02-04 17:35

*************************************************************************************************
Title : Oak chair

URL of first image : https://images.craigslist.org/00n0n_baDAS4segUc_0t20CI_600x450.jpg

Description : Swivels, height adjusts, great condition

Post ID : 7705079606 

Posted Date : 2024-01-07 06:58

Last Updated Date : 2024-02-04 17:51

*************************************************************************************************
Title : Micro USB Car Charger

URL of first image : https://images.craigslist.org/00v0v_coblI9aG3YU_0t20CI_600x450.jpg

Description : Don't need this

Post ID : 7705881682 

Posted Date : 2024-01-09 12:45

Last Updated Date : 2024-02-04 18:09

**************************

### Automating Login on The Old Reader

- Account created with username: adiallmight@gmail.com; password: a7pute

In [89]:
## Get signing page and form

headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://theoldreader.com/users/sign_in'

page = requests.get(url, headers)

soup = BeautifulSoup(page.content, 'html.parser')

#### Exploring the Login Mechanism
- From the form, it can be observed that the username field has a name: user[login] and the password field has the name: user[password].
- The code below captures it

In [112]:
## Use form ID to access form body

form = soup.select_one('#new_user')

token = form.select_one('input[name = authenticity_token]') ## Get authenticity token

## Authenticity token is required in the POST request payload 

authenticity_token = token.get('value')

user_name = soup.select_one('#user_login').get('name')  ## Get username key

password = soup.select_one('#user_password').get('name')  ## Get password key

print(f'The following are the keys for the information needed in the payload for POST i.e log in: \nAuthenticity token : {authenticity_token} \nUser name: {user_name} \nPassword: {password}')

The following are the keys for the information needed in the payload for POST i.e log in: 
Authenticity token : WL79O6JXKexrL+SwPVFCTiVTdH5a5nyjrRLBfpZ5Dxs= 
User name: user[login] 
Password: user[password]


#### Analyzing Network Traffic for Login Request
- A POST request is made when signing in. 
- This is because the user name and password are being submitted with some other information to the website to start the log in session and create a cookie for it
- Also, it can be observed from the payload information that the name of the Username and Password fields were submitted along with the Authenticity token in the POST request.

#### Automating the Login Process

In [102]:
time.sleep(10)  ## Pause before next request

session = requests.session()

## Use POST method to get cookie for the session

session.post("https://theoldreader.com/users/sign_in",
            data = {
                'authenticity_token': authenticity_token,
                user_name : 'adiallmight@gmail.com',
                password : 'a7pute',
                'commit': 'Sign In'
            },
            timeout = 20)
cookies = session.cookies.get_dict()

#### Verifying Successful Login

In [103]:
print(cookies)

{'remember_user_token': 'BAhbB1sGVToaTW9wZWQ6OkJTT046Ok9iamVjdElkIhHivxufea%2FsmLJAvRxJIiIkMmEkMDUkbmUwUmlKMThVWEszdm5wRVhkUnNaLgY6BkVU--7545f382ff3d3b199f080ef4f4cb5a2fc94835c7', 'i_know_you': 'Aditya+Satpute', '_new_reader_session': 'BAh7CkkiD3Nlc3Npb25faWQGOgZFVEkiJTBlMGExNmY1NTc4MGZkNDYyOTgyZWNlZGZiOGQxNGJiBjsAVEkiGXdhcmRlbi51c2VyLnVzZXIua2V5BjsAVFsHWwZVOhpNb3BlZDo6QlNPTjo6T2JqZWN0SWQiEeK%2FG595r%2ByYskC9HEkiIiQyYSQwNSRuZTBSaUoxOFVYSzN2bnBFWGRSc1ouBjsAVEkiDWxhbmd1YWdlBjsARjoHZW5JIhByZWRpcmVjdF90bwY7AEZJIgYvBjsARkkiEF9jc3JmX3Rva2VuBjsARkkiMUNwc2RLWXNkcjNoWjhnb2hhRGxNSWJOVUtJOTZCYkIzQ3Y5Y3VBQVZ2TTQ9BjsARg%3D%3D--98c04f4a8832d184be74054c524b07f93694c22e', 'signed_at': '1707117683'}


In [104]:
## Check if successfully logged in by accesing the home page

time.sleep(10)

url2 = 'https://theoldreader.com/'
home_page = session.get(url, cookies = cookies)

soup2 = BeautifulSoup(home_page.content, 'html.parser')

In [105]:
pattern = re.compile("Aditya Satpute")
ele = soup2.find(string = pattern)
ele.parent

<script type="text/javascript">
  if (typeof Bugsnag !== "undefined" && Bugsnag !== null) {
    Bugsnag.releaseStage = "production";
    Bugsnag.notifyReleaseStages = ['production'];
    Bugsnag.user = {
      id: "e2bf1b9f79afec98b240bd1c",
      name: "Aditya Satpute",
      email: "adiallmight@gmail.com"
    }
  }
</script>

- Successfully logged in and verified by finding the html container containg ID, Name and email ID