# How to Scrape TikTok Sounds + Hashtags

This tutorial does the following:

1. [Connects to an existing (open) Chrome instance](#sec1)
2. [It shows how we can get videos from a TikTok **sound** page](#sec2)
3. [It shows how we can get videos from a TikTok **hashtag** page](#sec3)

<a id="sec1"></a>
## Create Chrome Instance

**Important:** For this to work, you should already have the Google instance running on your computer. To do that, open a **new terminal** and run the command for your browser (see below).


**On Mac:**
```
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/tmp/chrome_dev_test"
```

**On Windows:**

```
C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\selenium\ChromeTestProfile
```

**New installation**

If you don't have the following package, install it once.

In [24]:
pip install webdriver_manager

Note: you may need to restart the kernel to use updated packages.


Now we are ready to scrape!

<a id="sec2"></a>
## Getting Videos From a TikTok *Sound* Page

*Change the first line in the following code block to match your desired sound.*

In [57]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

# Desired sound link
sound = 'https://www.tiktok.com/music/Wait-a-Minute-6523887078063739904?_d=secCgYIASAHKAESPgo8sqbRZTZ2zX9x%2B9rJuB8Cnjw65govDdGujaewNbXUHG7ntU2cVcCoNEgG1nbHd6AVrsvLVbVHbngFdGvEGgA%3D&_r=1&checksum=6831d41e6ef8e4fbde2a3b7eb60a92b702c0ad66ab0afda65128d52476991915&sec_user_id=MS4wLjABAAAAJ1nZQIh2C_NE4ByhWilkuou1mEX3kNOEPOO1MGZWLRMUyr5LJS4Tff9axmcQhR-M&share_app_id=1233&share_link_id=45645C49-D41F-4540-9FAA-4FE5518E2741&share_music_id=6523887078063739904&sharer_language=en&social_share_type=7&source=h5_m&timestamp=1713914993&tt_from=sms&u_code=daflb3i3ebfm06&ug_btm=b2878%2Cb5171&user_id=6786801162035053574&utm_campaign=client_share&utm_medium=ios&utm_source=sms'

# Set up Chrome options
options = Options()
options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
options.add_argument("--headless")  # Run Chrome in headless mode

# Path to your ChromeDriver
service = Service(ChromeDriverManager().install())

# Connect to the existing Chrome browser session
driver = webdriver.Chrome(service=service, options=options)

# Interact with the existing browser session
driver.get(sound)

In [58]:
# Making sure we got the correct sound link
driver.title

'WILLOW - Wait a Minute! | TikTok'

We will use the following class names to find useful HTML elements for scraping. We are creating variables for them so that if these classes change, we can insert the new class names here:

CONTAINER_CLASS = "eegew6e2"\
VIDEO_CLASS = "e19c29qe8"\
DESC_CLASS = "eih2qak4"\
VIDEO_COUNT = "ekmpd5l8"

In [44]:
CONTAINER_CLASS = "eegew6e2" 
VIDEO_CLASS = "e19c29qe8"
DESC_CLASS = "eih2qak4"

### Extracting Videos

Here is a function that will get the posts (both URLs and descriptions of each video):

In [59]:
def getVideosAndDescriptions(driver):
    """ 
    Given an open driver instance on a TikTok account page, 
    Get the list of accessible video URLs.
    """
    # Get the container of the videos
    try:
        container = driver.find_element(By.CLASS_NAME, CONTAINER_CLASS)
    except Exception as e:
        print(f"Container: An unexpected error occurred: {e}")
        return []

    # Get the video elements
    try:
        posts = container.find_elements(By.CLASS_NAME, VIDEO_CLASS)
    except Exception as e:
        print(f"Post: An unexpected error occurred: {e}")
        return []

    # Get the URLs of the videos
    try:
        urls = [post.find_element(By.TAG_NAME, "a").get_attribute('href') for post in posts]
    except Exception as e:
        print(f"URL: An unexpected error occurred: {e}")
        return []

    # Get the description of each post. Since some of them don't have one, we'll add an empty string
    descriptions = []
    for post in posts:
        try:
            desc = post.find_element(By.CLASS_NAME, DESC_CLASS).text
            descriptions.append(desc)
        except:
            descriptions.append('')

    # Combine together urls and descriptions
    print("Done extracting video data!")
    return list(zip(urls))

Now we can extract the videos for our desired sound!

By default, when visiting the page of a TikTok sound, we only get a subset of the posts. If we want more, we need to scroll down.

**Note:** The following block of code may take a couple of minutes to run!

In [60]:
def scroll_to_bottom(driver):
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait for a short interval to load more content
    time.sleep(0.1) 
    
for x in range(1000):
    # Wait for a short interval before checking the page height again
    time.sleep(1)
    
    # Scroll to the bottom of the page
    scroll_to_bottom(driver)
        
    # Check if we have reached the bottom of the page
    # By comparing the current and previous page heights
    prev_height = driver.execute_script("return document.body.scrollHeight;")
    new_height = driver.execute_script("return document.body.scrollHeight;")

    # Scroll to the bottom of the page
    scroll_to_bottom(driver)

print("Done scrolling!")

Done scrolling!


As we can see, by scrolling down, our document was able to access more videos, hopefully, the majority of posts. 

When scrolling, the posts don't disappear from the DOM; once they have been seen, they remain there. Thus, we can scroll and then stop and save all the posts.

We can now call the function to get the posts and make sure we have all the videos. 

**Note:** The following block of code may take a couple of minutes to run!

In [61]:
posts = getVideosAndDescriptions(driver)
print(len(posts))

Done extracting video data!
1087


In [62]:
print(len(posts))

1087


If you were *not* able to get all the videos under your desired sound, run the following block of code; otherwise, skip this step!

In [63]:
for x in range(1000):
    # Scroll to the bottom of the page
    scroll_to_bottom(driver)
    
    # Check if we have reached the bottom of the page
    # By comparing the current and previous page heights
    prev_height = driver.execute_script("return document.body.scrollHeight;")
    new_height = driver.execute_script("return document.body.scrollHeight;")

    # Scroll to the bottom of the page
    scroll_to_bottom(driver)

print("Done scrolling!")

Done scrolling!


In [28]:
posts = getVideosAndDescriptions(driver)
print(len(posts))

Done extracting video data!
1185


Now that we have all the available videos we can save them under a json file!

In [56]:
import json
with open("pyktok/raw/2022/gobadb.json", 'w') as fout:
    json.dump(posts, fout)