# 0_Scrape_Omega

## 1. Import Libraries

In [27]:
# Import libaries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

## 2. Scrape Omega Comments

We decided to scrape from watchuseek.com forum, because subreddit doesn't have enough comments. 

### 2.1. Scrape All Threads From 50 Pages

In [28]:
# Set base and root urls
base_url = 'https://www.watchuseek.com/forums/omega.20/'
root_url = 'https://www.watchuseek.com'

In [29]:
# Check response
response = requests.get(base_url)
response

<Response [200]>

In [30]:
# Initiate soup
soup = BeautifulSoup(response.text, 'lxml')

In [32]:
# CHANGE THIS - Set no of pages for main page
no_main_pages = 50

# Dictionary for thread links
thread_urls_dict = {}

# Get all thread links
for i in range(1,no_main_pages+1):
    
    # For page 1 use base_url
    if i == 1:
        response = requests.get(base_url)
        soup = BeautifulSoup(response.text, 'lxml')
        threads = soup.find_all('div', class_='california-thread-item')

        # Save urls in thread_urls
        thread_urls = []
        for thread_item in threads:
            thread_urls.append(thread_item.find('a', class_="thread-title--gtm")['href'])

        # Add thread urls to dict
        thread_urls_dict[i] = thread_urls
    
    # For other pages use https://www.watchuseek.com/forums/cartier.481/page-n format
    else:
        response = requests.get(base_url+'page-'+str(i))
        soup = BeautifulSoup(response.text, 'lxml')
        threads = soup.find_all('div', class_='california-thread-item')

        # Save urls in thread_urls
        thread_urls = []
        for thread_item in threads:
            thread_urls.append(thread_item.find('a', class_="thread-title--gtm")['href'])
        
        # Add thread urls to dict
        thread_urls_dict[i] = thread_urls
        
# Print thread_urls_dict        
thread_urls_dict

{1: ['/threads/omega-forum-members-face-photo-gallery.190533/',
  '/threads/the-seamaster-story-from-1957-to-2014-a-pictorial-identification-guide.1394618/',
  '/threads/another-phishing-scam.930424/',
  '/threads/we-do-not-provide-valuations-on-this-forum.1039326/',
  '/threads/authorized-dealer-information-and-prices.323618/',
  '/threads/which-one-would-you-choose-and-why.5554126/',
  '/threads/official-speedmaster-club-thread.399373/',
  '/threads/how-to-engage-more-in-watches-one-watch-guy-problems.5554665/',
  '/threads/the-wruw-mega-thread.4537437/',
  '/threads/the-aqua-terra-and-railmaster-photo-thread.535524/',
  '/threads/question-about-a-seamaster-300-value.5554626/',
  '/threads/when-you-cheat-on-omega-who-is-it-with.5357215/',
  '/threads/which-omega-are-you-wearing-today.5525721/',
  '/threads/2022-seamaster-300m-losing-time.5554435/',
  '/threads/watch-polishing.5553321/',
  '/threads/seamaster-300-master-co-axial-owners-thread.1982290/',
  '/threads/official-seamaster-

We got all the threads link from 50 pages of forum. Now we will scrape the comments from each threads.

### 2.2 Scrape comments From Each threads

In [35]:
# List for comments data
comments_data = []

# Go through each thread url
for i, thread_urls in thread_urls_dict.items():
    for thread_url in thread_urls:
        thread_page_url = root_url + thread_url
        response = requests.get(thread_page_url)
        soup = BeautifulSoup(response.text, 'lxml')
        
        # Scrap all comments
        comments = soup.find_all('div', class_="message-cell message-info-block")
        
        # Go through each comment
        for comment in comments:
            # Splitting the string to extract name and date, and ignoring the time
            parts = comment.find('div', class_="message-userContent lbContainer js-lbContainer")['data-lb-caption-desc'].split('·')
            name = parts[0].strip()
            date_time_part = parts[1].strip()
            date = ' '.join(date_time_part.split()[:3])

            # Save comment in comments_data
            comments_data.append({
                'body': comment.find('div' ,class_="bbWrapper", itemprop="text").text.replace('\n',''),
                'author': name,
                'date': date,
                'like': False
            })    

            # Check if there're likes on the comment
            if len(comment.find_all('bdi')) > 1:
                comments_data[0]['like'] = True
        
comments_data

[{'body': "Don't by shy! You can show friends and family too. Can you make this a stickey Joe?                                                              My favorite nephew and his daughter my great niece                                                           ",
  'author': 'john wilson',
  'date': 'Oct 5, 2008',
  'like': True},
 {'body': 'Me and the Niece (and the PO)',
  'author': 'davieg10c',
  'date': 'Oct 5, 2008',
  'like': False},
 {'body': "Good call John .....it's nice to put faces with the names.Me ....The Skirt ....The Outlaws ...- David",
  'author': 'DMB',
  'date': 'Oct 5, 2008',
  'like': False},
 {'body': "Great thread I'm sure Joe or Al will do the honours and stick this?Have to say JW you are one hell of a poser!!Me and the wifeMy pride and joyAnd one of me and him",
  'author': 'spogehead',
  'date': 'Oct 5, 2008',
  'like': False},
 {'body': "You look like Ron Jeremy in that pic':-d",
  'author': 'Fatpants',
  'date': 'Oct 5, 2008',
  'like': False},
 {'body':

We got all comments from all threads from 59 pages. Now we will save it into json file for cleaning.

### 2.3 Save Comments To JSON File

In [36]:
# Save the comments to a JSON file
with open('data/omega.json', 'w') as f:
    json.dump(comments_data, f, indent=4)

print("Comments have been saved to omega.json")

Comments have been saved to omega.json


In [37]:
# Check comments length
with open('data/omega.json','r') as f:
    totalcomments = json.load(f)
    
len(totalcomments)

27801

We got around 27,000 comments from Omnega forum.  
We will sample Omega comments to match Cartier Comments