### Analyzing Boss Zhipin Data

We have obtained a data set of jobs posted by a number of Chinese technology companies in the hopes of better understanding the kinds of labor that go into the production and fine-tuning of AI large language models.

Here, I analyze:
 - Which companies the jobs come from
 - What job titles have been posted
 - What the jobs entail and their requirements
 - The salaries

#### Set-up

In [52]:
#imports
import pandas as pd
import re

In [13]:
#load in the data set
data = pd.read_csv("/Users/benwarren/Documents/GitHub/bosszp-selenium/output_data/master_file_updated_2024-08-22_en.csv")
data = data.drop(["Unnamed: 0.2", "Unnamed: 0.1", "Unnamed: 0", "index"], axis=1)
data.head()

Unnamed: 0,company,title,description,salary,tags,location,labels,link
0,SpeechOcean,Twins palm collection,Working hours: unlimited Working hours: unlimi...,100-150 yuan/hour,"['Beijing', 'Experience required', 'Education ...","Building D, Yousheng Building, Haidian Distric...",['twin'],https://www.zhipin.com/job_detail/dffcfd448427...
1,SpeechOcean,Part-time in minor languages,【Job Description】1.Responsible for the labelin...,15-20K,"['Beijing', 'Experience not required', 'Bachel...","Room 801, Block D, Yousheng Building, Haidian ...",['Minority languages'],https://www.zhipin.com/job_detail/c76ac9fdf1ec...
2,SpeechOcean,Online novel writer,Working hours: unlimited Working period: unlim...,120-150 yuan/hour,"['Beijing', 'Experience required', 'Education ...","Building D, Yousheng Building, Haidian Distric...","['Part-time job', 'Piece-rate', 'Online novel'...",https://www.zhipin.com/job_detail/308b46a0261e...
3,SpeechOcean,Portuguese Project Assistant,1. Job responsibilities: 1) Confirm project re...,8-10K · 16 salary,"['Beijing', 'Experience not required', 'Bachel...","Room 801, Block D, Yousheng Building, Haidian ...","['Portuguese', 'Brazilian']",https://www.zhipin.com/job_detail/9a2728e2bfb5...
4,SpeechOcean,Tele-Galician,"[Job Description] Proofreading of 100,000 entr...",200-300 yuan/day,"['Beijing', '3 days/week 2 months', 'Education...","D801, Block D, Yousheng Building, Haidian Dist...","['CATTI Level 1 Translation', 'CATTI Level 2 T...",https://www.zhipin.com/job_detail/df056f03e254...


#### Jobs by company

There are 118 jobs in the current data set. 

Magic Data, Konvery Data, and Jing Lianwen are the most represented companies in the data set, with 16 jobs each. Each of the other companies have 14 jobs in the data set. 

In [17]:
#Number of total jobs
len(data)

118

In [14]:
#Number of jobs by company
data['company'].value_counts()

Magic Data           16
Konvery Data         16
Jing Lianwen         16
SpeechOcean          14
Data Hall            14
MindFlow             14
Shanghai Aishu       14
Zhixin Technology    14
Name: company, dtype: int64

#### What types of jobs are available?

Out of the 118 total jobs, there were 64 distinct job titles.

The most common were:
- Product Manager (5)
- Operations Engineer (5)
- Operations Manager (4)
- Business Manager (GPT Direction) (4)
- Part-time in minor languages (3)

Out of all words used in the job titles, the most common were:
- Manager (45)
- Project (29)
- Intern (17)
- Engineer (13)

These most common words describe the level of the position and its general orientation, but we also see that AI-related terms like 'data', 'ai', 'gpt', and 'model' are present. 

In [20]:
# Top 5 job titles
jobs = data['title']
counts = data['title'].value_counts()
counts[:5]

Product Manager                     5
Operations Engineer                 5
Operations Manager                  4
Business Manager (GPT Direction)    4
Part-time in minor languages        3
Name: title, dtype: int64

In [22]:
#Number of total distinct jobs
len(data['title'].unique())

64

In [57]:
#top words contained in job titles

#declare empty list
word_list = []

#loop through job titles and add each word to a list
for title in jobs:
    temp = title.split()
    word_list = word_list + temp

#remove non-letter characters and cast all to lower
clean_list = []

for word in word_list:
    new = word.lower()
    new = re.sub('[\W_]+', '', new)
    if len(new) > 1:
        clean_list.append(new)
    
#use a list comprehension to count the words and store them
word_freq = [clean_list.count(w) for w in clean_list]

#create a df with words and counts
freq_df = pd.DataFrame({"word": clean_list, "freq": word_freq})

#drop dupes
freq_df.drop_duplicates("word", inplace=True)

#top 5 values
freq_df.sort_values(by="freq", ascending=False).head()


Unnamed: 0,word,freq
19,manager,45
11,project,29
26,intern,17
34,engineer,14
108,business,13


In [60]:
#do the titles use AI terms?

#ai - 4
print("ai")
print(freq_df[freq_df['word'] == "ai"])

#model - 1
print("model")
print(freq_df[freq_df['word'] == "model"])

#gpt - 6
print("GPT")
print(freq_df[freq_df['word'] == "gpt"])

#data
print("data")
print(freq_df[freq_df['word'] == "data"])

ai
   word  freq
32   ai     4
model
     word  freq
98  model     6
GPT
    word  freq
128  gpt     6
data
    word  freq
63  data     7


In [62]:
#what are some examples?

#AI
for job in jobs:
    if "AI" in job:
        print(job) 

#GPT
for job in jobs:
    if "GPT" in job:
        print(job) 

AI Solutions Engineer
AI Data Labeling Intern (Zhijiang)
AI Solutions Engineer
AI Data Labeling Intern (Zhijiang)
ChatGPT Trainer Intern
Business Manager (GPT Direction)
Business Manager (GPT Business)
Business Manager (GPT Direction)
ChatGPT Trainer Intern
Business Manager (GPT Business)
Business Manager (GPT Direction)
Business Manager (GPT Direction)


#### What are the requirements and responsibilities of the jobs?

Many of the job descriptions in the data set contain language indicating that they list the requirements necessary for the jobs (51). Others contain the hours (20) and salaries, though that information may be contained in other places.*

*these numbers are based on the description containing the specific term, so should not be considered representative. 

The most common terms in the job descriptions are:
- project (264)
- experience (248)
- data (230)
- job (200)
- good (168)

Common AI terms like ai, data, training, model, and testing show up in job postings for AI solutions engineers, AI sales, and other positions.

**Note:** I'd like to try to break up the descriptions into more structured information like hours, requirements, address, etc.

In [70]:
#what do the descriptions contain?
desc = data['description']

has_description = 0
has_requirements = 0
has_hours = 0
has_salary = 0

for text in desc:
    if "description" in text:
        has_description += 1
    if "requirement" in text:
        has_requirements += 1
    if "hour" in text:
        has_hours += 1
    if "salary" in text:
        has_salary += 1

print("Has Description: ", has_description)
print("Has Requirements: ", has_requirements)
print("Has Hours: ", has_hours)
print("Has Salary: ", has_salary)

Has Description:  3
Has Requirements:  51
Has Hours:  20
Has Salary:  10


In [76]:
#most common words in descriptions

#declare empty list
word_list = []

#loop through job titles and add each word to a list
for text in desc:
    temp = text.split()
    word_list = word_list + temp

#remove non-letter characters and cast all to lower
clean_list = []

for word in word_list:
    new = word.lower()
    new = re.sub('[\W_]+', '', new)
    if len(new) > 1:
        clean_list.append(new)

#drop stopwords
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", 
             "you", "your", "yours", "yourself", "yourselves", "he", "him",
               "his", "himself", "she", "her", "hers", "herself", "it", "its", 
               "itself", "they", "them", "their", "theirs", "themselves", "what",
                 "which", "who", "whom", "this", "that", "these", "those", "am", 
                 "is", "are", "was", "were", "be", "been", "being", "have", "has", 
                 "had", "having", "do", "does", "did", "doing", "a", "an", "the", 
                 "and", "but", "if", "or", "because", "as", "until", "while", "of",
                   "at", "by", "for", "with", "about", "against", "between", "into", 
                   "through", "during", "before", "after", "above", "below", "to", 
                   "from", "up", "down", "in", "out", "on", "off", "over", "under", 
                   "again", "further", "then", "once", "here", "there", "when", "where", 
                   "why", "how", "all", "any", "both", "each", "few", "more", "most",
                     "other", "some", "such", "no", "nor", "not", "only", "own", "same", 
                     "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", 
                     "should", "now"]

no_swords = [w for w in clean_list if w not in stopwords]

#use a list comprehension to count the words and store them
word_freq = [no_swords.count(w) for w in no_swords]

#create a df with words and counts
freq_df = pd.DataFrame({"word": no_swords, "freq": word_freq})

#drop dupes
freq_df.drop_duplicates("word", inplace=True)

#top 5 values
freq_df.sort_values(by="freq", ascending=False).head()

Unnamed: 0,word,freq
41,project,264
85,experience,248
216,data,230
14,job,200
226,good,168


In [82]:
#How often do AI terms show up?
ai_terms = [
    "ai", "model", "llm", "gpt", "claude", "openai", "training", "testing", "data"
]

freq_df[freq_df['word'].isin(ai_terms)]


Unnamed: 0,word,freq
216,data,230
648,training,27
902,ai,32
1695,model,17
3142,testing,21


In [86]:
#What descriptions do 'ai' jobs have?
i = 1
for text in desc:
    if ' AI ' in text:
        print("Description ", i)
        print(text)
        print()
        i += 1

Description  1
We are looking for an AI Solutions Engineer to provide professional AI solutions to overseas customers. As an AI Solutions Engineer, you play a vital role in the growth and development of new and existing customer relationships. One of your main responsibilities is to act as a liaison between customers and our AI data team, focusing on facilitating a smooth purchasing process. Job Responsibilities: 1. Sell data set products and data service solutions; 2. Build, develop and manage customer relationships; 3. Facilitate the purchase process with each customer; 4. Develop and implement effective sales strategies for new customers, including cold calls and on-site customer visits; 5. Provide technical support to customers; 6. Communicate with customers regularly to discuss their specific needs, make suggestions, and solve service problems; 7. Work with data project managers and engineers to help define project requirements; Job Requirements: 1. Bachelor's degree or above in a

#### How much do people get paid?

Salaries were listed in the jobs in the data set in three main ways: daily or hourly wages, or salary. It was typically unclear what time period the salaries were for.

Salaried roles made up most of the positions, but the greatly differing amounts indicate that they are intended for different amounts of time. 

In [146]:
#split into number and unit
sal = data['salary']

nums = [s.split()[0] for s in sal]
units = [s.split()[1].lower() if len(s.split()) > 1 else None for s in sal]

# units = list(map(lambda x: x.replace('salaries', 'salary'), units))

lowers = []
highers = []
#split ranges and convert to ints
for val in nums:
    #if value in thousands
    if "K" in val:
        if "-" in val:
            lower, higher = val.split("-")
            higher = int(higher.split("K")[0]) * 1000
            lower = int(lower) * 1000
        else:
            lower = int(val.split("K")[0]) * 1000
            higher = None
    else:
        if "-" in val:
            lower, higher = val.split("-")
            lower = int(lower)
            higher = int(higher)
        else:
            lower = int(val)
            higher = None
    lowers.append(lower)
    highers.append(higher)

sal_df = pd.DataFrame({"low": lowers, "high": highers, "unit": units})

#assume that above 1000 = salary
def wage_type(val):
    if val > 1000:
        return "salary"
    else:
        return None
    
sal_df['new'] = sal_df['low'].apply(wage_type)
sal_df['unit'] = sal_df['unit'].fillna(sal_df['new'])

sal_df.head()


Unnamed: 0,low,high,unit,new
0,100,150,yuan/hour,
1,15000,20000,salary,salary
2,120,150,yuan/hour,
3,8000,10000,·,salary
4,200,300,yuan/day,


In [147]:
#filter to known units
known_df = sal_df[sal_df['unit'] != ""]

#median salary (low range) by type of pay
known_df.groupby("unit").median()['low']

#median salary (high range) by type of pay
known_df.groupby("unit").median()['high']

unit
salaries     60000.0
salary       20000.0
yuan/day       200.0
yuan/hour      150.0
·            11000.0
Name: high, dtype: float64

In [148]:
#daily versus hourly versus salary

#daily
print("Number of daily jobs listed")
print(len(known_df[known_df['unit'] == "yuan/day"]))

print("Max offered daily:")
print(known_df[known_df['unit'] == "yuan/day"]['high'].max())

print("Min offered daily:")
print(known_df[known_df['unit'] == "yuan/day"]['high'].min())

print("Median offered daily:")
print(known_df[known_df['unit'] == "yuan/day"]['high'].median())

print()
#hourly
print("Number of hourly jobs listed")
print(len(known_df[known_df['unit'] == "yuan/hour"]))

print("Max offered hourly:")
print(known_df[known_df['unit'] == "yuan/hour"]['high'].max())

print("Min offered hourly:")
print(known_df[known_df['unit'] == "yuan/hour"]['high'].min())

print("Median offered hourly:")
print(known_df[known_df['unit'] == "yuan/hour"]['high'].median())

print()
#salary
print("Number of salary jobs listed")
print(len(known_df[known_df['unit'] == "salary"]))

print("Max offered salary:")
print(known_df[known_df['unit'] == "salary"]['high'].max())

print("Min offered salary:")
print(known_df[known_df['unit'] == "salary"]['high'].min())

print("Median offered salary:")
print(known_df[known_df['unit'] == "salary"]['high'].median())

Number of daily jobs listed
19
Max offered daily:
300
Min offered daily:
150
Median offered daily:
200.0

Number of hourly jobs listed
6
Max offered hourly:
150
Min offered hourly:
150
Median offered hourly:
150.0

Number of salary jobs listed
82
Max offered salary:
60000
Min offered salary:
4000
Median offered salary:
20000.0
