# README - *Finding TikTok Affiliates In Beauty*

This notebook walks through the code and scripts used to extract TikTok user's accounts for the purposes of finding affiliates for marketing agencies. The accounts must meet the following criteria:

- In the makeup, hair care, and beauty world
- Have less than 50K followers

After scraping accounts and emails that match the above criteria, we leverage the Python SDK for Google's Gemini API to provide a brief summary of the eligible accounts using their TikToks and bio. 

Individual python scripts were written to be run in the same order as defined by the sections below.

## *Section 0 -- Data Collection*

We start by defining our functions used to scrape the TikTok users by trending videos for specific hashtags. We use the unofficial TikTok API package to collect the results.

In [70]:
from TikTokApi import TikTokApi
from datetime import datetime
import pandas as pd
import asyncio
import os

# set env variables
OUTPUT_PATH = os.path.join(os.environ['DATAFILES_PATH'], 'TikTok-Users')
MS_TOKEN = os.environ['MS_TOKEN'] # make env variable

def get_user_url(video):
  
    # Extract user ID from the video object
    user_id = video.as_dict['author']['uniqueId']

    # Construct the user URL using the user ID
    user_url = f"https://www.tiktok.com/@{user_id}"
    return user_url

def get_video_url(video):

    user_id = video.as_dict['author']['uniqueId']
    vid_id = video.as_dict['video']['id']

    # Construct the user URL using the user ID
    vid_url = f"https://www.tiktok.com/@{user_id}/video/{vid_id}"
    return vid_url

def is_over_50k_follers(video):

    follower_count = video.as_dict['authorStats']['followerCount']

    print(f"Follower Count: {follower_count}")

    if follower_count > 50000:
        return True
    else:
        return False
    
# function to fetch follower count of user
def get_follow_cnt(video):

    return video.as_dict['authorStats']['followerCount']


Next, we define our hashtags and collect the data for today ...

In [71]:
post_sample_size = 25
iterations = 40
hashtags = ["makeup", "beauty", "skincare", "haircare", "skincareroutine", "haircareroutine", "makeuproutine"]
videos = []

async with TikTokApi() as api:
    await api.create_sessions(ms_tokens=[MS_TOKEN], num_sessions=1, sleep_after=3, headless=False)

    for ihashtag in hashtags:

        print(f"Finding trending videos with #{ihashtag} ...\n\n")

        tag = api.hashtag(name=ihashtag) # add other keywords and search criteria

        for i in range(iterations):
            
            print(20*"-", f"SCRAPING SAMPLE #{i+1}", 20*"-")
            async for video in tag.videos(count=post_sample_size, cursor=post_sample_size*i):
                
                # save video
                print(f"Saved {video} ...")
                videos.append(video)

            print()

Finding trending videos with #makeup ...


-------------------- SCRAPING SAMPLE #1 --------------------
Saved TikTokApi.video(id='7208244986666585386') ...
Saved TikTokApi.video(id='6943673684880116997') ...
Saved TikTokApi.video(id='7097959048515013894') ...
Saved TikTokApi.video(id='7368888677692394757') ...
Saved TikTokApi.video(id='7146353978878463278') ...
Saved TikTokApi.video(id='7027564715756801286') ...
Saved TikTokApi.video(id='7175984394950167813') ...
Saved TikTokApi.video(id='7075004746326641962') ...
Saved TikTokApi.video(id='7125684912811691290') ...
Saved TikTokApi.video(id='7120667483551321350') ...
Saved TikTokApi.video(id='6959951225651498246') ...
Saved TikTokApi.video(id='6895969291384966402') ...
Saved TikTokApi.video(id='7355703403797957895') ...
Saved TikTokApi.video(id='6836724424373275909') ...
Saved TikTokApi.video(id='7048721568062524678') ...
Saved TikTokApi.video(id='6931494638096305413') ...
Saved TikTokApi.video(id='6939553105851829510') ...
Saved TikTok

## *Section 1 -- Data Filtering & Cleaning*

We now filter through the collected users based on follower count and store in local database.

In [72]:
user_urls = []
video_urls = []
user_bios = []
follower_count = []

async with TikTokApi() as api:
    await api.create_sessions(ms_tokens=[MS_TOKEN], num_sessions=1, sleep_after=3, headless=False)

    for i, video in enumerate(videos):

        print(20*"-", f" FILTERING TIKTOK #{i+1} ({100*((i+1)/len(videos)):.2f}%)", 20*"-")

        if is_over_50k_follers(video):
            print("... Skipping\n")
            continue

        
        user_url = get_user_url(video)
        vid_url = get_video_url(video)
        num_followers = get_follow_cnt(video)

        print(f"Video URL: {vid_url}")
        print(f"User URL: {user_url}\n")
        
        user_urls.append(user_url)
        video_urls.append(vid_url)
        follower_count.append(num_followers)

        # get user bio
        user_data = await api.user(username=video.as_dict['author']['uniqueId']).info()
        user_bios.append(user_data['userInfo']['user']['signature'])
    

user_df = pd.DataFrame({"User URL":user_urls, "Follower Count":follower_count, "Account Bio":user_bios, 'Video URL':video_urls})
user_df = user_df.drop_duplicates(subset=['User URL'])
user_df["Hashtags Searched"] = "/".join(hashtags)

now = datetime.now()
date_string = now.strftime("%Y-%m-%d") 
user_df.to_excel(os.path.join(OUTPUT_PATH,  f'beauty_usrs_{date_string}.xlsx'), index=False)

--------------------  FILTERING TIKTOK #1 (0.01%) --------------------
Follower Count: 2900000
... Skipping

--------------------  FILTERING TIKTOK #2 (0.02%) --------------------
Follower Count: 18300000
... Skipping

--------------------  FILTERING TIKTOK #3 (0.03%) --------------------
Follower Count: 2500000
... Skipping

--------------------  FILTERING TIKTOK #4 (0.03%) --------------------
Follower Count: 2000000
... Skipping

--------------------  FILTERING TIKTOK #5 (0.04%) --------------------
Follower Count: 3900000
... Skipping

--------------------  FILTERING TIKTOK #6 (0.05%) --------------------
Follower Count: 6300000
... Skipping

--------------------  FILTERING TIKTOK #7 (0.06%) --------------------
Follower Count: 3400000
... Skipping

--------------------  FILTERING TIKTOK #8 (0.07%) --------------------
Follower Count: 15700000
... Skipping

--------------------  FILTERING TIKTOK #9 (0.08%) --------------------
Follower Count: 321500
... Skipping

------------------

We now filter through the collected users based on follower count and store in local database.

Now we scrape the eligible user's bios for emails. Once complete, we attach the emails to their associated rows in our DataFrame and update the final result to our local database ... We start by loading the regular expressions library and defining a function to be applied to each account bio to extract any possible email:

In [73]:
# load the regular expressions library
import re

def extract_emails(text):
    # Regular expression to find email addresses
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    emails = re.findall(pattern, text)
    return emails


Now we use the function above to extract the associated emails...

In [74]:
# apply the function to extract emails and create a new column 'Email'
user_df['Email'] = user_df['Account Bio'].apply(extract_emails)

# convert list of emails to a single string, if any emails are found
user_df['Email'] = user_df['Email'].apply(lambda x: ', '.join(x) if x else '')

# add username column from user URL 
user_df['User'] = user_df['User URL'].str.extract(r'@([^/]+)')

now = datetime.now()
date_string = now.strftime("%Y-%m-%d") 
user_df.to_excel(os.path.join(OUTPUT_PATH,  f'beauty_usrs_{date_string}.xlsx'), index=False)

## *Section 2 -- Account Summary Generations*

We now use the [Python SDK for Gemini API](https://ai.google.dev/tutorials/python_quickstart) to write a brief summary of the user's content based on both the user's bio and videos. 

## *Section 3 -- Automate Process*

In this section, we begin with a brief discussion about the following processes performed:

- Providing a log of users already scraped so to not include them again.
- Automating data collection on a daily basis using AWS.
- Storing results on a cloud database.
- Providing a U.I. for employee's to easily navigate and use in order to contact potential affiliates via email (using dash).