# __Individual Project 1__
#### 001 - Yumi Jin 

### __Part 1: Scraping and Saving HTML Content__

#### __1.1 Identify the Target__  
_Start with navigating to the “free” section on the Craigslist San Francisco Bay Area site (https://sfbay.craigslist.org/search/zip).
This page lists items that people are giving away for free._

#### __1.2 Interact with the Page-Sorting__
_Initially, the listings might be sorted by “newest” first.  Try changing the sorting order to “oldest” first by interacting with the page’s UI.  
Observe any changes in the URL after you change the sorting order back and forth._
> __Finding:__  
'newest' first: https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0  
'oldest' first: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0  
The URL change from __'sort = date'__ to __'sort = dateoldest'__ after I change the sorting order back and forth.  



_Can you trigger the sorting change directly by modifying only the URL in your browser’s address bar?  If so, how?_    
> __Answer:__ Yes, by directly adding word 'oldest' after the word 'date' in the URL can we trigger the sorting change.  


_Explain what type of request is made when you change the sort order (GET or POST)._  
> __Answer: Get__ request is being made when changing the sort order.


_What is the variable in the URL associated with sorting?_
> __Answer:__ The variable in the URL associated with sorting on the Craigslist pages is __sort__. The value date sorts the listings to show the newest first, while dateoldest sorts to show the oldest listings first.

#### __1.3 Interact with the Page-Pagination__
_Craigslist paginates listings, typically displaying a limited number of items per page.
Navigate to the second and third pages of results and observe the changes in the URL._
> __Findings:__  
Page1: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0  
Page2: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~1~0  
Page3: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~2~0  
The variable after #search=1\~gallery\~ increases by 1 with each subsequent page, starting from 0 for the first page, 1 for the second, and 2 for the third. 


_Determine how to move between pages by only changing the URL.  What part of the URL changes as you navigate through different pages?  
This task will help you understand how pagination works on Craigslist and how you can programmatically access different pages of listings._  
> __Answer:__ By adjusting the number after #search=1\~gallery\~ in the URL can we navigate through different pages.


_Identify the variable associated with page changes.  How does altering this variable in the URL affect the page you’re viewing?_
> __Answer:__ The variable associated with page changes on Craigslist is the number that follows #search=1\~gallery\~ in the URL. Altering this number changes the page we're viewing. For instance, changing this number to 0 brings us to the first page, 1 to the second page, and so on. 

#### __1.4 Fetch Listing URLs__

In [294]:
from bs4 import BeautifulSoup, NavigableString
import requests
import time

In [14]:
headers = {'User-agent': 'Mozila/5.0' }

# Use requests to access the first page of the “free” section, ordered “newest” first
time.sleep(5)
url = 'https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0'
page = requests.get(url, headers = headers)

# Deploy BeautifulSoup to parse the HTML content
soup = BeautifulSoup(page.content,'html.parser')

# print(soup1.prettify())

In [124]:
# Identify the structure that holds the links to individual listing pages
# Choose selector 'li.class' to grab the link
listings = soup1.select('.cl-static-search-result')

links = []
for listing in listings:
    link = listing.find('a')['href']
    links.append(link) 
#     print(link)
# len(links)

In [126]:
# One more possible selection method to retrieve the link to the individual listing
# Using '.parent' to retrieve
listings_parent = soup1.select('.title')
links = []
for listing in listings_parent:
    listing = listing.parent
    listing = listing.parent
    link = listing.find('a')['href']
    links.append(link) 
#     print(link)
# len(links)

In [136]:
# Extract the first 250 unique listing URLs and save them to a list
links_250 = links[:250]

_Explain your strategy:_
> __(Method 1)__ I directly select elements with the '.cl-static-search-result' class and retrieve the href attribute of nested \<a> tags.  
__(Method 2)__ I select '.title' class elements and traverse up two levels in the DOM to find the parent element containing the \<a> tag, from which it extract the href.  

> Both methods populate a list with the extracted URLs. To manage the number of links and work with a subset, I then slice the list to get the first 250 unique links as required.



In [135]:
# Print the list to screen
links_250

['https://sfbay.craigslist.org/eby/zip/d/hercules-free-rowing-machine-working/7715249756.html',
 'https://sfbay.craigslist.org/nby/zip/d/santa-rosa-free-stuffbox-of-allen/7706014251.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-scrap-wood-redwood/7715249012.html',
 'https://sfbay.craigslist.org/sby/zip/d/san-jose-free-kenmore-freezer/7715248687.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-curb-alert-california/7715247484.html',
 'https://sfbay.craigslist.org/pen/zip/d/san-mateo-36-semi-round-mirror/7711654041.html',
 'https://sfbay.craigslist.org/eby/zip/d/berkeley-beautiful-quilted-queen-sized/7715247045.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-packing-materials/7715246741.html',
 'https://sfbay.craigslist.org/eby/zip/d/alameda-blue-task-chair-with-wheels/7715246629.html',
 'https://sfbay.craigslist.org/eby/zip/d/berkeley-outdoor-dining-chairs/7715246620.html',
 'https://sfbay.craigslist.org/nby/zip/d/el-verano-plastic-storage

#### __1.5 Save HTML Pages__

In [146]:
for link in links_250:
    time.sleep(6)
    
    # Extract the listing ID from the URL
    link_id = link.rstrip('.html').split('/')[-1]
    
    # Define the filename using the listing ID
    filename = f"{link_id}.html"
    
    # For each of the 250 listing URLs, use `requests` to fetch the listing page
    page = requests.get(link, headers = headers)
    soup = BeautifulSoup(page.content, 'html.parser')

    # Save each HTML content to a separate file on disk
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(soup.prettify())
    print(f"Successfully saved {filename}")

Successfully saved 7715249756.html
Successfully saved 7706014251.html
Successfully saved 7715249012.html
Successfully saved 7715248687.html
Successfully saved 7715247484.html
Successfully saved 7711654041.html
Successfully saved 7715247045.html
Successfully saved 7715246741.html
Successfully saved 7715246629.html
Successfully saved 7715246620.html
Successfully saved 7715246248.html
Successfully saved 7715245797.html
Successfully saved 7715245134.html
Successfully saved 7715244755.html
Successfully saved 7715244725.html
Successfully saved 7715241481.html
Successfully saved 7715244317.html
Successfully saved 7714415886.html
Successfully saved 7715243772.html
Successfully saved 7715243540.html
Successfully saved 7715243342.html
Successfully saved 7713101646.html
Successfully saved 7712180575.html
Successfully saved 7715242057.html
Successfully saved 7715241905.html
Successfully saved 7715241780.html
Successfully saved 7715241610.html
Successfully saved 7711916393.html
Successfully saved 7

In [148]:
# !tar czf myfiles.tar.gz *.html (download all the html files from Jupyter Notebook Server)

---

### __Part 2: Parsing and Displaying Information from Saved HTML__

#### __2.1 Read Saved HTML Files__

In [374]:
import os
# Get the current working directory
current_working_dir = os.getcwd()
current_working_dir

'/Users/yumi'

In [375]:
# Set the relative path
directory = 'ucdavis/Winter Quarter/BAX-422/Project'

# Loop through each saved HTML file from the path 
for filename in os.listdir(directory):
    # Check if the file ends with .html
    if filename.endswith(".html"):
        # Construct the full file path using the relative directory
        filepath = os.path.join(directory, filename)
        # Read the file content
        with open(filepath, 'r', encoding='utf-8') as file:
            html_content = file.read()

#### __2.2 Extract Information__

In [388]:
n = 0
# Loop through each saved HTML file from the path 
for filename in os.listdir(directory):
    # Check if the file ends with .html
    if filename.endswith(".html"):
        n = n + 1
        # Construct the full file path using the relative directory
        filepath = os.path.join(directory, filename)
        # Read the file content
        with open(filepath, 'r', encoding='utf-8') as file:
            html_content = file.read()
            # For each HTML file, use `BeautifulSoup` to parse the file content
            soup = BeautifulSoup(html_content, 'html.parser')
            
            # -------------------------------------------------------------
            # Extract and print the following details for each URL
            
            # Print the index of the websites
            print('\033[1mWebsite\033[0m ' + str(n))
            # -------------------------------------
            # Extract and Print Titles
            titles = soup.select('#titletextonly')
            # Check if a title was found and print it; otherwise, print 'Not Found'
            if titles:
                for title in titles:
                    print('Title: ' + title.text.strip())
            else:
                print('Title: Not Found')
            # -------------------------------------  
            # Extract and Print Image_urls
            image_urls = soup.select('img[title = "1"]')
            if image_urls:
                for image_url in image_urls:
                    print('Image_urls: ' + image_url['src'].strip())
            else:
                print('Image_urls: Not Found')
            # ------------------------------------- 
            # Extract and Print Descriptions
            posting_body = soup.find('section', id = 'postingbody')
            if posting_body:
                # Extract all text nodes directly under 'postingbody', not within child tags
                text_nodes = [text for text in posting_body if isinstance(text, NavigableString)]
                # Join the text nodes to form a single string, and strip it 
                description = ''.join(text_node.strip() for text_node in text_nodes).strip()
                # If text_content is empty, print 'Not Found', otherwise print the content
                if description:
                    print('Description: ' + description)
                else:
                    print('Description: Not Found')
            else:
                print('Description: Not Found')
            # ------------------------------------- 
            # Extract and Print Post ID
            post_ids = soup.select_one('.postinginfos > p')
            if post_ids:
                for post_id in post_ids:
                    print(post_id.text.strip())
            else:
                print('post id: Not Found')
            # ------------------------------------- 
            # Extract and Print Posted Date
            post_dates = soup.select_one('time')
            if post_dates:
                for post_date in post_dates:
                    print('Posted Date: ' + post_date.strip())
            else:
                print('Posted Date: Not Found')
            # ------------------------------------- 
            # Extract and Print Last Updated Date
            update_dates = soup.select('time')
            # Extract the content of the third time element, or return 'Not Found' if < 3                 
            if update_dates:
                if len(update_dates) >= 3:
                    update_date = update_dates[2].text.strip() 
                    print('Last Updated Date: ' + update_date)
                # For posts not updated after posted, return the last updated date to posted date
                else:
                    print('Last Updated Date: ' + post_date.strip())  
            else:
                print('Last Updated Date: Not Found')
        
            print('--------------------------------------------------------------------------')
            

[1mWebsite[0m 1
Title: Not Found
Image_urls: Not Found
Description: Not Found
post id: Not Found
Posted Date: Not Found
Last Updated Date: Not Found
--------------------------------------------------------------------------
[1mWebsite[0m 2
Title: portable crib (graco?)
Image_urls: https://images.craigslist.org/00C0C_jdpU8Utkg0Y_0t20CI_600x450.jpg
Description: portable crib
post id: 7712180575
Posted Date: 2024-01-28 15:30
Last Updated Date: 2024-02-06 22:19
--------------------------------------------------------------------------
[1mWebsite[0m 3
Title: Retro TV - Good for Retro Gaming
Image_urls: https://images.craigslist.org/00F0F_h3d0GBuVExe_0CI0t2_600x450.jpg
Description: FREEOlder  TV  -  Sony Trinitron XBRFrom the 1990'sA perfect monitor for retro gaming - LARGE screenWorks well with old gaming systems like  Super Mario Nintendo, Atari,In working conditionIncludes a Sony remote controllerTV overall measures 33" wide, 23"deep, 26" high - with a 32" screenNot a flat screen, h

---

### __Part 3: Automating Login on The Old Reader__

#### __3.1 Creating and Verifying a The Old Reader Account__  
_Account Creation:  Create an account on https://theoldreader.com. Use an email address and password that you are comfortable sharing with us._  

_Manual Login Verification: Before automating the login process, ensure you can manually log in to theoldreader.com with your new credentials.  This confirms that your account is active and your credentials are correct._

#### __3.2 Exploring the Login Mechanism__  
_Navigate to the login page of https://theoldreader.com._   
_Use your browser’s developer tools to inspect the page, focusing on the `<form>` tag involved in the login process._  

In [394]:
url = 'https://theoldreader.com/users/sign_in'
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')

In [413]:
# Extract and Print <form> tag
form = soup.select('form')
print(form)

[<form accept-charset="UTF-8" action="/users/sign_in" class="new_user" id="new_user" method="post" role="form"><div style="margin:0;padding:0;display:inline"><input name="utf8" type="hidden" value="✓"/><input name="authenticity_token" type="hidden" value="kbl8UpKY09LW2A+wbN00VLvDd1V8zi+E0dUQqakIAPM="/></div>
<div class="form-group">
<input autocapitalize="off" autocorrect="off" autofocus="autofocus" class="form-control" id="user_login" name="user[login]" placeholder="Username/Email" size="30" spellcheck="false" type="text"/>
</div>
<div class="form-group">
<input class="form-control" id="user_password" name="user[password]" placeholder="Password" size="30" type="password"/>
</div>
<input class="btn btn-primary btn-block" name="commit" type="submit" value="Sign In"/>
</form>]


_Document all `<input>` fields within the login form, paying special attention to their name attributes. These fields are crucial for submitting the login request programmatically._

In [406]:
# Extract and Print all <input> fields within the login form
input_fields = soup.select('input')
for input_field in input_fields:
    print(input_field.prettify())

<input name="utf8" type="hidden" value="✓"/>

<input name="authenticity_token" type="hidden" value="kbl8UpKY09LW2A+wbN00VLvDd1V8zi+E0dUQqakIAPM="/>

<input autocapitalize="off" autocorrect="off" autofocus="autofocus" class="form-control" id="user_login" name="user[login]" placeholder="Username/Email" size="30" spellcheck="false" type="text"/>

<input class="form-control" id="user_password" name="user[password]" placeholder="Password" size="30" type="password"/>

<input class="btn btn-primary btn-block" name="commit" type="submit" value="Sign In"/>



#### __3.3 Analyzing Network Traffic for Login Request__  
_With the network tab of your browser’s developer tools open, log in to the site again._  

_Identify the network request made when you submit the login form (GET or POST).  Explain why this method was chosen._
> __Answer:__ The network request made when submitting the login form is __POST__. Because POST request sends data to the server in the body of the request, making it more secure than a GET request. Thus, POST request is commonly seen when dealing with sensitive information like usernames and passwords.



_Carefully examine the payload that was submitted to the server during login.  Compare this payload to the `<form>` / `<input>` fields you previously analyzed. Explain your observation._
> __Payload:__   
utf8: ✓  
authenticity_token: +SkYTctswL5+ApXR74w8jnSs4wzYV8iS5itPuwF/6/Q=  
user[login]: loviajin@outlook.com  
user[password]: q1w2e3r4  
commit: Sign In 

>__Findings:__ The payload generally matches the input fields provided in the HTML form, with the exception of the authenticity token. This discrepancy is reasonable given the dynamic and session-specific nature of authenticity tokens. The payload accurately represents my input and the action taken, demonstrating a correct form submission process, assuming the authenticity token value in the payload is valid for the session in which it's being submitted.





 

#### __3.4 Automating the Login Process__

In [415]:
# Using Python and appropriate libraries like requests, simulate the login process

# Create a session object to maintain login state across multiple requests
session = requests.Session()

# Prepare a payload with login credentials
payload = {
    'utf8': '✓',
    'authenticity_token': '+SkYTctswL5+ApXR74w8jnSs4wzYV8iS5itPuwF/6/Q=',
    'user[login]': 'loviajin@outlook.com',
    'user[password]': 'q1w2e3r4',
    'commit': 'Sign In'
}

# Send a POST request to the login form’s action URL to log in, using the session object
response = session.post(url, data = payload, timeout = 20)

#### __3.5 Verifying Successful Login__

In [416]:
# Check if login was successful
if response.ok:
    print("Login successful!")
else:
    print("Login failed. Status code:", response.status_code)

Login successful!


_After attempting to log in, inspect the cookies saved in the session object to understand the information The Old Reader stores on your computer._

In [417]:
# Check cookies saved in the session object
cookies = session.cookies.get_dict()
print("Cookies saved after login:", cookies)
# Cookies usually include session identifiers and tokens that websites use 
# to recognize logged-in users

Cookies saved after login: {'remember_user_token': 'BAhbB1sGVToaTW9wZWQ6OkJTT046Ok9iamVjdElkIhFUNMKzubDJMOnbAs1JIiIkMmEkMDUkUEkxdnUyREZ3cjRtNG9FeThxQTB6ZQY6BkVU--bae700b4f6c02da099c2f721269ddc294014bc2a', 'i_know_you': 'yumi', '_new_reader_session': 'BAh7CkkiD3Nlc3Npb25faWQGOgZFVEkiJTgwOWQ2NWIwNjZmYjMzMThmYzM1ZGM3ZWY3MGVkMTZiBjsAVEkiGXdhcmRlbi51c2VyLnVzZXIua2V5BjsAVFsHWwZVOhpNb3BlZDo6QlNPTjo6T2JqZWN0SWQiEVQ0wrO5sMkw6dsCzUkiIiQyYSQwNSRQSTF2dTJERndyNG00b0V5OHFBMHplBjsAVEkiDWxhbmd1YWdlBjsARjoHZW5JIhByZWRpcmVjdF90bwY7AEZJIgYvBjsARkkiEF9jc3JmX3Rva2VuBjsARkkiMW1IZnZsQ2RqMmtZa2RMQjNObkZ0ZWU4czA5cVFibFR2RWFMV3REbEhTNm89BjsARg%3D%3D--4e42e2149ea37ab48e819430a29b46cfe0890eb6', 'signed_at': '1707372003'}


In [425]:
time.sleep(5)

# Use the session object to access the new url
url_new = 'https://theoldreader.com'
page = session.get(url_new, cookies = cookies)
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup)

In [443]:
# Verify successful login by finding elements that contain user information
user_info = soup.select_one('a.dropdown-toggle')
print(info.parent)

if user_info:
    print("\nSuccessfully logged in. \nUser information:", user_info.text)
else:
    print("Login failed or user information not found.")

<a class="dropdown-toggle" data-hover="dropdown" data-toggle="dropdown" href="#" title="yumi">yumi  <i class="fa fa-caret-down"></i></a>

Successfully logged in. 
User information: yumi  
