# Analyzing Trends on TikTok with Topic Modeling & Account Scraping

This tutorial does the following:

1. Connects to an existing (open) Chrome instance. **[Part 1](#sec1)**
2. It shows how we can get information from a TikTok account page. **[Part 2](#sec2)**
3. Shows how to use topic modeling analysis to underlying themes. **[Part 3](#sec3)**

<a id="sec1"></a>
## Part 1: Create Chrome Instance

**Important:** For this to work, you should already have the Google instance running on your computer. To do that, open a console and run the command for your browser (see below).

**On Mac:**

In [None]:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/tmp/chrome_dev_test"

**On Windows:**

In [None]:
C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:\selenium\ChromeTestProfile

**New installation**

If you don't have the following package, install it once.

In [None]:
pip install webdriver_manager

Now you are ready to run the code below:

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

# Set up Chrome options
options = Options()
options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")

# Path to your ChromeDriver
service = Service(ChromeDriverManager().install())

# Connect to the existing Chrome browser session
driver = webdriver.Chrome()

<a id="sec2"></a>
## Part 2: Getting Information from a TikTok Page

The following function scrapes a given account for its username, name, follower count, and total number of likes.

In [None]:
def getAccountInformation(driver):
    """
    Given an open driver instance on a TikTok account page, 
    get the account metrics that are accessible.
    """
    time.sleep(2) # in case the page hasn't loaded yet

    account_info = {}

    # Get the username 
    try:
        account_info['author_username'] = driver.find_element(By.XPATH, '//*[@id="main-content-others_homepage"]/div/div[1]/div[1]/div[2]/h1').text
    except Exception as e:
        print(f"Username: An unexpected error occurred: {e}")

    # Get the name 
    try:
        account_info['author_name'] = driver.find_element(By.XPATH, '//*[@id="main-content-others_homepage"]/div/div[1]/div[1]/div[2]/h2').text
    except Exception as e:
        print(f"Name: An unexpected error occurred: {e}")

    # Get the bio
    try:
        account_info['author_bio'] = driver.find_element(By.XPATH, '//*[@id="main-content-others_homepage"]/div/div[1]/h2').text
    except Exception as e:
        print(f"Likes: An unexpected error occurred: {e}")

    # Get the number of followers  
    try:
        account_info['author_followers'] = driver.find_element(By.XPATH, '//*[@id="main-content-others_homepage"]/div/div[1]/h3/div[2]/strong').text
    except Exception as e:
        print(f"Followers: An unexpected error occurred: {e}")

    # Get the number of likes 
    try:
        account_info['author_likes'] = driver.find_element(By.XPATH, '//*[@id="main-content-others_homepage"]/div/div[1]/h3/div[3]/strong').text
    except Exception as e:
        print(f"Likes: An unexpected error occurred: {e}")


    return account_info

We can now run PykTok's "author_username" column through this function to get information about each account!

This may take a while, depending on the size of your dataset. For this tutorial we are only taking a subset of the videos!

**Note**: Make sure to add in your own csv pathname!

In [None]:
import pandas as pd
df = pd.read_csv("")

df = df[['video_id', 'video_timestamp', 'video_duration',
       'video_locationcreated', 'suggested_words', 'video_diggcount',
       'video_sharecount', 'video_commentcount', 'video_playcount',
       'video_description', 'hashtags', 'author_username']]

# Get unique values from the "account_username" column and convert it to a list
accounts = df['author_username'].unique().tolist()[::50]

# Initialize an empty list to store account information dictionaries
all_account_info = []

for acc in accounts:
    url = f"https://tiktok.com/@{acc}"
    driver.get(url)

    # Get account information
    account_info = getAccountInformation(driver)

    # Append the dictionary to the list
    all_account_info.append(account_info)

# Convert list of dictionaries to DataFrame
data = pd.DataFrame(all_account_info)

# Drop NaN rows
data = data.dropna()
data

We now have to merge this new dataframe onto our original PykTok dataset **accounts**!

In [None]:
accountInformation = data.merge(df, on="author_username", how="inner").drop_duplicates()
accountInformation

<a id="sec3"></a>
## Part 3: Topic Modeling for Descriptions of Videos

We will now use the description and hashtags of the provided videos and the accounts' bios to see if there are any underlying themes! 

The following function, **lda_topic_modeling**, performs Latent Dirichlet Allocation (LDA) topic modeling on a dataset with textual data. Here's a breakdown of its functionality:

- **data**: The dataset containing the textual data.
- **column_names**: A list of column names from the dataset that contain text data. 
- **num_topics**: The number of topics to identify. The default is 5, but it can be adjusted to the desired level of granularity.

**Combining Text Data:** The function first combines text from the specified columns into one by concatenating the text values row-wise. It handles missing values (NaNs) by filling them with empty strings.

**Creating Document-Term Matrix (DTM):** It uses CountVectorizer from scikit-learn to convert the combined text data into a document-term matrix (DTM). The CountVectorizer converts a collection of text documents into a matrix of token counts.

**Fitting LDA Model:** The function initializes and fits an LDA model using LatentDirichletAllocation from scikit-learn. LDA is a generative probabilistic model that discovers latent topics within a collection of documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def lda_topic_modeling(data, column_names, num_topics=5):
    # Combine text from specified columns into one
    data['combined_text'] = data[column_names[0]].fillna('') + ' ' + data[column_names[1]].fillna('') + ' ' + data[column_names[2]].fillna('') + data[column_names[3]].fillna('')
    
    # Create document-term matrix
    count_vectorizer = CountVectorizer(stop_words='english')
    dtm = count_vectorizer.fit_transform(data['combined_text'])
    
    # Fit LDA model
    lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42)
    lda_model.fit(dtm)
    
    # Display topics
    feature_names = count_vectorizer.get_feature_names_out()
    topics = {}
    for topic_idx, topic in enumerate(lda_model.components_):
        topics[f'Topic {topic_idx+1}'] = [feature_names[i] for i in topic.argsort()[:-11:-1]]
    
    return topics

# Specify the columns containing text data
columns = ['suggested_words', 'video_description', 'hashtags', 'author_bio']

# Number of topics to identify
num_topics = 10

# Run LDA topic modeling
topics = lda_topic_modeling(accountInformation, columns, num_topics)

# Print the topics
for topic, words in topics.items():
    print(topic + ": " + ", ".join(words))