## EDA and Cleaning - SourceStack datasets

This notebook focuses on exploration and cleaning of two datasets I obtained by calling SourceStack API\
The first dataset comes from: **June 9, 2023**\
and the more recent one from: **April 2, 2024**

### Initial Exploration
1. shape
2. dtypes
3. missing values


### Cleaning
1. parsing strings containing datetimes to dates
2. converting strings containing a list to list of strings
3. convertsing numerical data from strings to Int/Float
5. identify dirty categories

#### Let's read in the data and have a look at its shape, columns and values

In [1]:
!pip install "polars_ds[plot]"

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [2]:
!pip install --upgrade polars

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [3]:
import sys

In [4]:
print(sys.executable)

/home/anopsy/Portfolio/sourcestack/sstack/bin/python


In [5]:
import polars as pl

In [6]:
old_data_path = "/home/anopsy/Portfolio/sourcestack/data/9june2023.csv"
new_data_path = "/home/anopsy/Portfolio/sourcestack/data/2april2024.csv"

In [7]:
old_df = pl.read_csv(old_data_path, try_parse_dates=False)
new_df = pl.read_csv(new_data_path, try_parse_dates=False)

In [8]:
print(f"Shape of the old data1 is:{old_df.shape}")
print(f"Shape of the new data is:{new_df.shape}")

Shape of the old data1 is:(50000, 16)
Shape of the new data is:(50000, 16)


In [9]:
old_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Backend Developer""","""Praha, Czech Republic""",,,"""IBM""",,"""[Docker, GraphQL, NoSQL, IBM, …","""[Container Orchestration, Quer…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic""","""2023-03-13 05:12:29""","""2023-06-05 13:43:49"""
"""Manufacturing Engineer""",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Serverless]""","""[Manufacturing]""",,"""""","""en""","""Sterling Heights""","""United States""","""2021-10-09 00:00:00""","""2023-05-24 05:35:57"""
"""Design Engineer, Motorized Pro…","""520 S Byrkit St Mishawaka, Ind…","""Full-Time""",,"""ABI Attachments""","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""","""2023-04-28 03:04:28""","""2023-05-19 14:48:10"""
"""Cybersecurity Engineer""",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, Splunk]""","""[Compute, Logging & Monitoring…","""[Cybersecurity, Security]""",,"""""","""en""","""Herndon""","""United States""","""2023-04-03 00:00:00""","""2023-05-28 11:47:09"""
"""Your Career so choose wisely w…","""Kolkata, India""","""Full-Time""",False,"""Adeeba e Services""",,"""[Objective-C, Subversion, Swif…","""[Cloud Native Storage, Program…","""[Software]""",,"""""","""en""","""Kolkata""","""India""","""2017-01-17 11:35:48""","""2023-05-30 11:51:08"""


In [10]:
new_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Dir, Engineering NPD, Critical…","""Dominican Republic-Nave 25-Mer…","""Full-Time""",,"""DR""","""Bachelors""","""[Microsoft]""","""[]""","""[]""",,,"""en""",,"""Dominican Republic""","""2024-03-04 00:00:00""","""2024-03-26 08:03:11"""
"""Software Engineer - Embedded""","""Dresden or Hartmannsdorf, Sach…","""Full-Time""",,"""Manning Global""",,"""[Linux]""","""[OS]""","""[Software, IT]""",,,"""en""","""Dresden or Hartmannsdorf""","""Germany""","""2024-02-15 00:00:00""","""2024-04-01 09:40:27"""
"""Embedded Software Test Enginee…","""Brisbane, CA""","""Full-Time""",,"""Avive""",,"""[Linux, C++]""","""[OS, Programming Languages, OS…","""[Software]""",,"""150000.00""","""en""","""Brisbane""","""Australia""","""2023-10-23 00:00:00""","""2024-04-01 15:25:43"""
"""Manufacturing Engineering Mana…","""Monroe, WI""","""Full-Time""",,"""United Future""",,"""[]""","""[]""","""[Manufacturing]""","""Manager""","""1.00""","""en""","""Monroe""","""United States""","""2024-03-27 20:18:23""","""2024-03-28 20:24:27"""
"""Vom Lager zum Wächter | Direkt…","""Ennepetal, Nordrhein-Westfalen…","""Full-Time""",,"""RUHR VERMITTLUNG""",,"""[WhatsApp, Vercel]""","""[Communications, VoIP, Serverl…","""[Security]""",,,"""de""","""Dortmund""","""Germany""","""2024-03-27 12:31:19""","""2024-03-31 11:16:19"""


### Initial explorations of unprocessed dataframes

#### Shape
Both datasets contain **50000 records** \
each record is represented by **16 features**

#### Dtypes
15 of the features are currently String - datatype\
1 feature is Bool

#### Missing values
The datasets contain **null values** and **empty strings**

In [11]:
pl.Config.set_tbl_width_chars(
    200
)  # setting wide format but it doesn't work that well for jupyter notebook

polars.config.Config

Let's have a look at the sample method, so I can have a look at some more records and remember that I can use .sample in the future.

In [12]:
old_df.sample(3)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Security Engineer""","""New York, New York""",,,"""G2HCM""",,"""[Google Cloud, Blockchain]""","""[OSS, Payments, Crypto]""","""[Security]""",,"""""","""en-us""","""New York""","""United States""","""2022-04-27 12:26:53""","""2023-05-26 20:51:48"""
"""Unarmed Event Security Officer""","""Clarksville, IN - Clarksville,…","""Temp""",,"""Battle Tested Security""","""High School""","""[]""","""[]""","""[Security]""","""Unclear Seniority""","""33000.00""","""en""","""Clarksville""","""United States""","""2022-03-10 00:00:00""","""2023-05-30 17:18:49"""
"""2023 Plumbing Design and Engin…","""Any PAE Location - Portland, O…","""Intern""",,"""Pae Consulting Engineers""",,"""[]""","""[]""","""[Consulting, Local Services, D…","""Intern""","""""","""en""","""Portland""","""United States""","""2022-09-28 00:00:00""","""2023-05-29 14:01:43"""


In [13]:
new_df.sample(3)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""DevOps Engineer""",,,,"""Vaillant Group""",,,,"""[DevOps]""",,"""""","""en-gb""",,,,"""2024-03-21 07:57:54"""
"""Product Security Manager | S4""","""Unity Place - Milton Keynes""","""Full-Time""",,"""A01 Santander UK""",,"""[AWS]""","""[PaaS, IaaS, Compute]""","""[Security]""","""Manager""","""6000.00""","""en""","""Milton Keynes""","""United Kingdom""","""2024-03-20 00:00:00""","""2024-03-25 18:25:03"""
"""P&C Transversal Expert & Progr…","""FRANCE - 92 - HAUTS - DE - SEI…","""Full-Time""",,"""GIE AXA""",,"""[ECR]""","""[Provisioning, OSS, Container …","""[Healthcare Providers, Insuran…","""Manager""",,"""en""","""Puteaux""","""France""",,"""2024-03-27 07:42:26"""


#### Add column that will help us identify if the record comes from 2023 or 2024 and concatenate both dataframes into one

In [14]:
# adding static columns with a string helping identify the df
old_df = old_df.with_columns(pl.lit("June 2023").alias("new"))
new_df = new_df.with_columns(pl.lit("April 2024").alias("new"))

In [15]:
old_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Backend Developer""","""Praha, Czech Republic""",,,"""IBM""",,"""[Docker, GraphQL, NoSQL, IBM, …","""[Container Orchestration, Quer…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic""","""2023-03-13 05:12:29""","""2023-06-05 13:43:49""","""June 2023"""
"""Manufacturing Engineer""",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Serverless]""","""[Manufacturing]""",,"""""","""en""","""Sterling Heights""","""United States""","""2021-10-09 00:00:00""","""2023-05-24 05:35:57""","""June 2023"""
"""Design Engineer, Motorized Pro…","""520 S Byrkit St Mishawaka, Ind…","""Full-Time""",,"""ABI Attachments""","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""","""2023-04-28 03:04:28""","""2023-05-19 14:48:10""","""June 2023"""
"""Cybersecurity Engineer""",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, Splunk]""","""[Compute, Logging & Monitoring…","""[Cybersecurity, Security]""",,"""""","""en""","""Herndon""","""United States""","""2023-04-03 00:00:00""","""2023-05-28 11:47:09""","""June 2023"""
"""Your Career so choose wisely w…","""Kolkata, India""","""Full-Time""",False,"""Adeeba e Services""",,"""[Objective-C, Subversion, Swif…","""[Cloud Native Storage, Program…","""[Software]""",,"""""","""en""","""Kolkata""","""India""","""2017-01-17 11:35:48""","""2023-05-30 11:51:08""","""June 2023"""


In [16]:
# concatenating old and new data
whole_df = old_df.vstack(new_df)

print(whole_df.shape)

(100000, 17)


Removing duplicates

In [17]:
whole_df = whole_df.unique()
# there were 333 duplicates
whole_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Retail Front End Supervisor""",,,,"""External Ocean State Job Lot""",,"""[]""","""[]""","""[Retail, Job Board]""","""Manager""","""""","""en""","""Wethersfield""","""United States""","""2023-06-01 17:04:46""","""2023-06-05 19:57:38""","""June 2023"""
"""Développeur mobile iOS (Swift)…","""Toulouse""","""Full-Time""",false,"""MY SAM CAB""",,"""[Notion, Kotlin, iOS, Slack, G…","""[Databases, SaaS, Database, So…","""[iOS, Apple-Related]""",,,"""fr""","""Toulouse""","""France""","""2022-07-13 08:28:10""","""2023-06-07 22:18:30""","""June 2023"""
"""Fluid Systems Chief Engineer""","""Cedar park, TX""",,,"""Firefly Aerospace""","""Bachelors""","""[]""","""[]""","""[Airlines & Aerospace]""","""Chief""",,"""en-us""",,"""United States""",,"""2024-03-25 03:44:55""","""April 2024"""
"""Instrumentation Engineer""",,,true,,,"""[]""","""[]""","""[]""",,,"""pl""",,,"""2024-01-28 18:59:42""","""2024-04-01 00:22:29""","""April 2024"""
"""Civil Roadway Engineering Inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""","""[]""","""[]""","""[]""","""Intern""",,"""en-us""","""Denver""","""United States""","""2022-03-14 22:05:18""","""2024-03-14 22:05:19""","""April 2024"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""AWS Software Engineer III - ET…","""Jersey City, NJ, United States""","""Full-Time""",,"""281971-Ipm Mission Control Ap_…",,"""[JPMorgan Chase, Databricks, P…","""[OSS, OS, Data Science Platfor…","""[Software, Financial Services]""",,"""154125.00""","""en""","""Jersey City""","""United States""","""2023-11-13 20:40:58""","""2024-04-01 05:34:24""","""April 2024"""
"""Únete a nuestra Comunidad de T…","""Argentina, Buenos Aires, Pelle…","""Full-Time""",,,,"""[]""","""[]""","""[]""",,,"""es""","""Buenos Aires""","""Argentina""","""2024-03-08 00:00:00""","""2024-03-26 18:08:42""","""April 2024"""
"""Staff Thermal Systems Engineer…","""MDLI18""","""Full-Time""",,"""0078 MS""","""Bachelors""","""[]""","""[]""","""[]""","""Staff IC""","""197500.00""","""en""",,"""United States""","""2023-05-16 00:00:00""","""2023-05-28 06:33:58""","""June 2023"""
"""Fullstack Developer""","""Guadalajara, Mexico""",,,"""IBM""",,"""[Blockchain, IBM, Angular.js, …","""[Data Science Tools, Full Stac…","""[Software]""",,"""""","""pl""","""Guadalajara""","""Mexico""","""2023-04-11 18:43:36""","""2023-06-05 06:58:58""","""June 2023"""


### Cleaning

#### 1. Converting 'job_published_at', 'last_indexed' to Date

In [18]:
whole_df = whole_df.with_columns(
    pl.col("job_published_at", "last_indexed").str.to_datetime().cast(pl.Date)
)

In [19]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,date,date,str
"""Retail Front End Supervisor""",,,,"""External Ocean State Job Lot""",,"""[]""","""[]""","""[Retail, Job Board]""","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023"""
"""Développeur mobile iOS (Swift)…","""Toulouse""","""Full-Time""",False,"""MY SAM CAB""",,"""[Notion, Kotlin, iOS, Slack, G…","""[Databases, SaaS, Database, So…","""[iOS, Apple-Related]""",,,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023"""
"""Fluid Systems Chief Engineer""","""Cedar park, TX""",,,"""Firefly Aerospace""","""Bachelors""","""[]""","""[]""","""[Airlines & Aerospace]""","""Chief""",,"""en-us""",,"""United States""",,2024-03-25,"""April 2024"""
"""Instrumentation Engineer""",,,True,,,"""[]""","""[]""","""[]""",,,"""pl""",,,2024-01-28,2024-04-01,"""April 2024"""
"""Civil Roadway Engineering Inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""","""[]""","""[]""","""[]""","""Intern""",,"""en-us""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024"""


#### 2. Converting 'tags_matched', 'tag_categories', 'categories' from str to list[str]

In [20]:
def string_to_nested(df, cols):
    """
    takes a df and list of columns that contain strings with lists
    and turns them into nested datatype List
    """
    for col in cols:
        df = df.with_columns(
            pl.col(col).str.extract_all(r"\w+").cast(pl.List(pl.String))
        )
    return df

In [21]:
cols_to_change = ["tags_matched", "tag_categories", "categories"]
whole_df = string_to_nested(whole_df, cols_to_change)

In [22]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,list[str],list[str],list[str],str,str,str,str,str,date,date,str
"""Retail Front End Supervisor""",,,,"""External Ocean State Job Lot""",,[],[],"[""Retail"", ""Job"", ""Board""]","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023"""
"""Développeur mobile iOS (Swift)…","""Toulouse""","""Full-Time""",False,"""MY SAM CAB""",,"[""Notion"", ""Kotlin"", … ""PostgreSQL""]","[""Databases"", ""SaaS"", … ""Languages""]","[""iOS"", ""Apple"", ""Related""]",,,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023"""
"""Fluid Systems Chief Engineer""","""Cedar park, TX""",,,"""Firefly Aerospace""","""Bachelors""",[],[],"[""Airlines"", ""Aerospace""]","""Chief""",,"""en-us""",,"""United States""",,2024-03-25,"""April 2024"""
"""Instrumentation Engineer""",,,True,,,[],[],[],,,"""pl""",,,2024-01-28,2024-04-01,"""April 2024"""
"""Civil Roadway Engineering Inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""",[],[],[],"""Intern""",,"""en-us""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024"""


#### 3. Converting 'comp_est' from str to int

In [23]:
whole_df = whole_df.with_columns(
    pl.col("comp_est").cast(pl.Float64, strict=False).alias("compensation")
)
# polars can handle str->float->int
# casting didn't work for Int64 but it did for Float with strict=False, strict=False turned empty strings to nulls
# it works after all I think the problem was I tried to cast t oint32 and because of huge numbers it didn't work
# now it works with Int64

In [24]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,str,bool,str,str,list[str],list[str],list[str],str,str,str,str,str,date,date,str,f64
"""Retail Front End Supervisor""",,,,"""External Ocean State Job Lot""",,[],[],"[""Retail"", ""Job"", ""Board""]","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023""",
"""Développeur mobile iOS (Swift)…","""Toulouse""","""Full-Time""",False,"""MY SAM CAB""",,"[""Notion"", ""Kotlin"", … ""PostgreSQL""]","[""Databases"", ""SaaS"", … ""Languages""]","[""iOS"", ""Apple"", ""Related""]",,,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023""",
"""Fluid Systems Chief Engineer""","""Cedar park, TX""",,,"""Firefly Aerospace""","""Bachelors""",[],[],"[""Airlines"", ""Aerospace""]","""Chief""",,"""en-us""",,"""United States""",,2024-03-25,"""April 2024""",
"""Instrumentation Engineer""",,,True,,,[],[],[],,,"""pl""",,,2024-01-28,2024-04-01,"""April 2024""",
"""Civil Roadway Engineering Inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""",[],[],[],"""Intern""",,"""en-us""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024""",


In [25]:
whole_df.filter(
    pl.col("compensation") > 0
).shape  # only 14962 records have compensation data available

(14962, 18)

### 4. Language/ education/hours/seniority -> pl.Categorical

extracting seniority from job_name

In [26]:
whole_df = whole_df.with_columns(pl.col("language").str.head(2))

In [27]:
whole_df = whole_df.with_columns(
    pl.col("language").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [28]:
whole_df = whole_df.with_columns(
    pl.col("education").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [29]:
whole_df = whole_df.with_columns(
    pl.col("hours").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [30]:
whole_df = whole_df.with_columns(
    pl.col("seniority").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [31]:
whole_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Retail Front End Supervisor""",,"""unknown""",,"""External Ocean State Job Lot""","""unknown""",[],[],"[""Retail"", ""Job"", ""Board""]","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023""",
"""Développeur mobile iOS (Swift)…","""Toulouse""","""Full-Time""",false,"""MY SAM CAB""","""unknown""","[""Notion"", ""Kotlin"", … ""PostgreSQL""]","[""Databases"", ""SaaS"", … ""Languages""]","[""iOS"", ""Apple"", ""Related""]","""unknown""",,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023""",
"""Fluid Systems Chief Engineer""","""Cedar park, TX""","""unknown""",,"""Firefly Aerospace""","""Bachelors""",[],[],"[""Airlines"", ""Aerospace""]","""Chief""",,"""en""",,"""United States""",,2024-03-25,"""April 2024""",
"""Instrumentation Engineer""",,"""unknown""",true,,"""unknown""",[],[],[],"""unknown""",,"""pl""",,,2024-01-28,2024-04-01,"""April 2024""",
"""Civil Roadway Engineering Inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""",[],[],[],"""Intern""",,"""en""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""AWS Software Engineer III - ET…","""Jersey City, NJ, United States""","""Full-Time""",,"""281971-Ipm Mission Control Ap_…","""unknown""","[""JPMorgan"", ""Chase"", … ""SQL""]","[""OSS"", ""OS"", … ""PaaS""]","[""Software"", ""Financial"", ""Services""]","""unknown""","""154125.00""","""en""","""Jersey City""","""United States""",2023-11-13,2024-04-01,"""April 2024""",154125.0
"""Únete a nuestra Comunidad de T…","""Argentina, Buenos Aires, Pelle…","""Full-Time""",,,"""unknown""",[],[],[],"""unknown""",,"""es""","""Buenos Aires""","""Argentina""",2024-03-08,2024-03-26,"""April 2024""",
"""Staff Thermal Systems Engineer…","""MDLI18""","""Full-Time""",,"""0078 MS""","""Bachelors""",[],[],[],"""Staff IC""","""197500.00""","""en""",,"""United States""",2023-05-16,2023-05-28,"""June 2023""",197500.0
"""Fullstack Developer""","""Guadalajara, Mexico""","""unknown""",,"""IBM""","""unknown""","[""Blockchain"", ""IBM"", … ""TypeScript""]","[""Data"", ""Science"", … ""Libraries""]","[""Software""]","""unknown""","""""","""pl""","""Guadalajara""","""Mexico""",2023-04-11,2023-06-05,"""June 2023""",


In [32]:
whole_df.filter(
    (pl.col("job_name").str.contains("(?i)intern")) & (pl.col("seniority") != "Intern")
)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""SAP China iXp Intern - Center …","""Information Technology""","""Intern""",,,"""Vocational""","[""SAP""]","[""Financial"", ""Services"", … ""ERP""]","[""China"", ""Related""]","""IC""",,"""en""","""Shanghai""","""China""",2023-05-01,2024-03-27,"""April 2024""",
"""Senior Product Manager, Intern…","""CH - Switzerland""","""Full-Time""",true,"""Insulet Corporation""","""Bachelors""",[],[],[],"""Senior IC""",,"""en""",,"""Switzerland""",2024-03-25,2024-04-01,"""April 2024""",
"""Intern - GMP/GLP Biosafety Tes…","""Singapore""","""Full-Time""",,,"""unknown""","[""SAP""]","[""Travel"", ""and"", … ""Customers""]",[],"""Director""",,"""en""","""Singapore""","""Singapore""",2023-05-09,2024-03-21,"""April 2024""",
"""Junior Front End Development A…","""Long Island City, NY, United S…","""Full-Time""",true,"""Inbulks Corp""","""unknown""","[""SQL"", ""HTML5"", … ""Java""]","[""Infrastructure"", ""Full"", … ""SQL""]","[""eCom"", ""Software""]","""IC""","""""","""en""","""Long Island City""","""United States""",2021-05-25,2023-05-30,"""June 2023""",
"""Data Analyst (Internal Data - …","""Tanjong Pagar Plaza, singapore…","""Contract""",,"""Quess Corp""","""unknown""",[],[],"[""IT""]","""IC""",,"""en""","""Tanjong Pagar Plaza""","""Singapore""",2023-10-11,2024-03-20,"""April 2024""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Civil/Site Engineer Internship""","""Sydney, NSW, Australia""","""Intern""",,"""Ace Talent recruitment""","""Bachelors""",[],[],"[""Recruiting"", ""Staffing""]","""Junior IC""",,"""en""","""Sydney""","""Australia""",2024-02-23,2024-03-27,"""April 2024""",
"""Electronics Technician/Enginee…",,"""Intern""",,"""Lockheed Martin""","""Bachelors""","[""Microsoft"", ""Excel""]","[""Misc"", ""Biz"", … ""Customers""]",[],"""IC""","""""","""en""",,,,2023-05-26,"""June 2023""",
"""Civil Engineer Intern""","""Melbourne, Victoria, Australia""","""Intern""",,"""Nexus Silicon Technologies PTY""","""unknown""","[""Microsoft""]",[],"[""Civil"", ""Engineering""]","""Junior IC""","""""","""en""","""Melbourne""","""Australia""",2023-05-23,2023-05-28,"""June 2023""",
"""Engineering Technician Intern""","""Kansas City, MO, USA""","""Intern""",false,"""BlueScope""","""unknown""","[""Microsoft""]",[],[],"""IC""",,"""en""","""Kansas City""","""United States""",2024-03-05,2024-03-28,"""April 2024""",


In [33]:
senior_job = whole_df.filter(
    (pl.col("job_name").str.contains("(?i)senior"))
    & (pl.col("seniority") != "Senior IC")
)

senior_job

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Javascript full-stack develope…",,"""Full-Time""",true,"""ООО «Прогрессив Майнд»""","""unknown""",,,"[""Software""]","""unknown""",,"""ru""","""Москва""",,2020-08-03,2024-03-28,"""April 2024""",
"""Senior Systems Engineering Man…","""GB - ENG - LAN - Warton""","""unknown""",,,"""unknown""","[""Microsoft"", ""MATLAB"", ""Excel""]","[""SMB"", ""Customers"", … ""Tools""]",[],"""Manager""",,"""en""",,"""United Kingdom""",,2024-03-25,"""April 2024""",
"""(Senior) Software Engineer Fro…","""Hamburg""","""Full-Time""",,"""collectAI""","""unknown""","[""Node"", ""js"", … ""React""]","[""Libraries"", ""JavaScript"", … ""Management""]","[""SaaS"", ""AI"", ""Software""]","""unknown""","""""","""en""","""Hamburg""","""Germany""",2023-05-23,2023-05-29,"""June 2023""",
"""Senior Staff Engineer - Full S…","""San Francisco, California, Uni…","""Full-Time""",,"""Valo Health""","""unknown""","[""JavaScript"", ""AWS"", … ""js""]","[""JavaScript"", ""UI"", … ""Libraries""]","[""Machine"", ""Learning""]","""Staff IC""","""231500.00""","""en""","""San Francisco""","""United States""",2023-02-08,2023-05-24,"""June 2023""",231500.0
"""(Senior) Software Developer (m…","""Frankfurt, DE""","""Full-Time""",,"""Talentrecruiters Personalberat…","""unknown""","[""C"", ""NET""]","[""OSS"", ""Programming"", … ""Framework""]","[""Software"", ""Recruiting"", ""Staffing""]","""unknown""",,"""de""","""Frankfurt""","""Germany""",2020-06-12,2024-03-26,"""April 2024""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Senior Program Manager, Sustai…","""Seattle, WA""","""Full-Time""",,,"""unknown""",[],[],[],"""Manager""","""175100.00""","""en""","""Seattle""","""United States""",2023-06-02,2023-06-09,"""June 2023""",175100.0
"""Senior/Principal Electrical En…","""London, LONDON, United Kingdom""","""Full-Time""",false,"""AECOM""","""unknown""",[],[],"[""Transportation""]","""Staff IC""","""""","""en""","""Manchester""","""United Kingdom""",2023-02-08,2023-05-29,"""June 2023""",
"""Senior Program Manager""","""Tampa, FL 33621 US (Primary)""","""Full-Time""",,"""Prescient Edge""","""Bachelors""",[],[],"[""Government""]","""Manager""",,"""en""","""Tampa""","""United States""",,2024-03-31,"""April 2024""",
"""Senior/Principal DevOps Engine…","""Quito, Pichincha, Ecuador""","""Full-Time""",true,"""FullStack Labs""","""unknown""","[""Uber"", ""Linux"", … ""Helm""]","[""Service"", ""Proxy"", … ""Management""]","[""Software"", ""DevOps"", … ""Development""]","""Staff IC""","""""","""en""","""Quito""","""Ecuador""",2023-02-14,2023-05-23,"""June 2023""",


In [34]:
whole_df.select(pl.col("seniority").value_counts())

seniority
struct[2]
"{""Manager"",6688}"
"{""unknown"",53578}"
"{""Chief"",557}"
"{""Intern"",1982}"
"{""Unclear Seniority"",4593}"
…
"{""Junior IC"",1896}"
"{""Senior Manager"",273}"
"{""Exec"",465}"
"{""Senior Exec"",8}"


##### set union on list[str] categories create set of tags

In [35]:
!pip install hvplot

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [36]:
import hvplot.polars

In [37]:
# whole_df.group_by('seniority').agg(pl.col('country').top_k_by('compensation', k=2))

## Analysis

## 1. How many Juniors/Interns per entire data set

In [38]:
whole_df.group_by("new", "seniority").agg(pl.col("seniority").count().alias("count"))

new,seniority,count
str,cat,u32
"""April 2024""","""Contract""",482
"""April 2024""","""unknown""",26797
"""April 2024""","""Senior Manager""",120
"""June 2023""","""Chief""",225
"""June 2023""","""Director""",425
…,…,…
"""June 2023""","""Intern""",752
"""June 2023""","""Exec""",220
"""June 2023""","""unknown""",26781
"""June 2023""","""IC""",3377


In [39]:
seniority_groups = whole_df.group_by("seniority", "new").agg(
    pl.col("seniority").count().alias("count")
)
seniority_groups = seniority_groups.select(pl.all().sort_by("count"))

In [40]:
seniority_groups = seniority_groups.with_columns(
    (pl.col("count") / 500).alias("percent of jobs")
)
# 500 = 50_000 / 100

In [41]:
seniority_groups

seniority,new,count,percent of jobs
cat,str,u32,f64
"""Founder""","""April 2024""",1,0.002
"""Founder""","""June 2023""",2,0.004
"""Senior Exec""","""June 2023""",2,0.004
"""Senior Exec""","""April 2024""",6,0.012
"""Senior Manager""","""April 2024""",120,0.24
…,…,…,…
"""Manager""","""June 2023""",3413,6.826
"""Senior IC""","""April 2024""",8608,17.216
"""Senior IC""","""June 2023""",9232,18.464
"""unknown""","""June 2023""",26781,53.562


In [130]:
seniority_group_plot = seniority_groups.hvplot.barh(
    x="seniority",
    y="count",
    color="new",
    rot=90,
    title="Number of Job Offers per Seniority",
    alpha=0.3,
    colorbar=True,
    clabel="count",
    cmap="prism",
)
hvplot.save(seniority_group_plot, "seniority.png")

In [43]:
entry_level = seniority_groups.filter(
    (pl.col("seniority") == "Junior IC") | (pl.col("seniority") == "Intern")
)
entry_level

seniority,new,count,percent of jobs
cat,str,u32,f64
"""Intern""","""June 2023""",752,1.504
"""Junior IC""","""June 2023""",837,1.674
"""Junior IC""","""April 2024""",1059,2.118
"""Intern""","""April 2024""",1230,2.46


In [131]:
entry_plot = entry_level.hvplot.barh(
    x="seniority",
    y="count",
    color="new",
    rot=90,
    title="Number of Job Offers per Seniority",
    alpha=0.3,
    colorbar=True,
    clabel="count",
    cmap="prism",
)
hvplot.save(entry_plot, "entry.png")

June 2023
Junior job offers were 1.674% of the total 50000
Internship offers were 1.504% of the total 50000

in April 2024
Junior job offers were 2.118% of the total 50000
Internship offers were 2.46% of the total 50000

Entry-level jobs in June 2023 were 3.178%
Entry-level jobs in APril 2024 were 4.578%
The number of entry-level jobs has risen by 44%

In [45]:
known_seniority = (
    seniority_groups.filter(
        (pl.col("seniority") != "unknown")
        & (pl.col("seniority") != "Unclear Seniority")
    )
    .group_by("new")
    .sum()
)

In [46]:
known_seniority

new,seniority,count,percent of jobs
str,cat,u32,f64
"""June 2023""",,21055,42.11
"""April 2024""",,20773,41.546


In [47]:
perc_of_known_seniority = entry_level.join(known_seniority, on="new", how="left")

In [48]:
perc_of_known_seniority = perc_of_known_seniority.with_columns(
    (pl.col("count") / pl.col("count_right") * 100).alias("percent of seniority")
)
perc_of_known_seniority

seniority,new,count,percent of jobs,seniority_right,count_right,percent of jobs_right,percent of seniority
cat,str,u32,f64,cat,u32,f64,f64
"""Intern""","""June 2023""",752,1.504,,21055,42.11,3.571598
"""Junior IC""","""June 2023""",837,1.674,,21055,42.11,3.975303
"""Junior IC""","""April 2024""",1059,2.118,,20773,41.546,5.097964
"""Intern""","""April 2024""",1230,2.46,,20773,41.546,5.921148


In [49]:
mean_comp_seniority = whole_df.group_by("new", "seniority").agg(
    pl.col("compensation").mean().alias("mean_comp_seniority")
)

In [50]:
mean_comp_seniority

new,seniority,mean_comp_seniority
str,cat,f64
"""June 2023""","""unknown""",1.3595e9
"""April 2024""","""unknown""",4.5190e8
"""April 2024""","""Founder""",
"""June 2023""","""Exec""",3.4617e9
"""April 2024""","""Senior Manager""",143160.8375
…,…,…
"""June 2023""","""Junior IC""",3.0982e7
"""April 2024""","""Chief""",268487.67
"""April 2024""","""Senior Exec""",
"""April 2024""","""Senior IC""",1.2428e9


In [51]:
mean_comp_seniority = mean_comp_seniority.drop_nulls()
mean_comp_seniority = mean_comp_seniority.select(
    pl.all().sort_by("mean_comp_seniority", descending=True)
)

In [52]:
mean_comp_seniority

new,seniority,mean_comp_seniority
str,cat,f64
"""April 2024""","""Founder""",
"""June 2023""","""Senior Exec""",
"""June 2023""","""Founder""",
"""April 2024""","""Senior Exec""",
"""June 2023""","""Chief""",4.0817e9
…,…,…
"""April 2024""","""Exec""",154049.461562
"""April 2024""","""Senior Manager""",143160.8375
"""June 2023""","""Senior Manager""",140226.263158
"""April 2024""","""Contract""",131988.499375


In [53]:
mean_comp_seniority.hvplot.barh(
    x="seniority",
    y="mean_comp_seniority",
    color="new",
    rot=90,
    title="Mean compensation per Seniority",
    alpha=0.3,
    colorbar=True,
    clabel="count",
    cmap="prism",
)

In [54]:
junior_comp = whole_df.filter(
    (pl.col("seniority") == "Junior IC") & (pl.col("compensation") > 0)
)
junior_comp.group_by("new").agg(pl.col("compensation").median())

new,compensation
str,f64
"""April 2024""",240000.0
"""June 2023""",60000.0


In [55]:
junior_comp

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Design Engineer""","""Bangalore Urban""","""Full-Time""",,"""Automation Technologies""","""Bachelors""",[],[],"[""Design""]","""Junior IC""","""240000.00""","""en""","""Bangalore Urban""","""India""",2023-05-30,2024-03-22,"""April 2024""",240000.0
"""Network Construction Engineer …","""3120 139th Ave SE, Bellevue, W…","""Full-Time""",,"""Verizon""","""Bachelors""","[""Google"", ""Microsoft""]","[""IaaS""]","[""Construction""]","""Junior IC""","""75500.00""","""en""","""Bellevue""","""United States""",2024-01-22,2024-03-28,"""April 2024""",75500.0
"""Plastic Product Development En…","""Rajkot""","""Full-Time""",,"""Essen Speciality Films""","""unknown""","[""SAP""]","[""SaaS"", ""Payments"", … ""A""]","[""Movies"", ""Film""]","""Junior IC""","""300000.00""","""en""","""Rajkot""","""India""",2022-10-06,2024-03-22,"""April 2024""",300000.0
"""Junior Electrical Engineer""","""Manassas, VA""","""Full-Time""",,"""Latitude""","""Bachelors""",[],[],"[""Government"", ""Recruiting"", ""Staffing""]","""Junior IC""","""80000.00""","""en""","""Manassas""","""United States""",,2023-06-05,"""June 2023""",80000.0
"""Security Specialist (Polygraph…","""Chantilly, VA""","""Full-Time""",,"""NRO""","""unknown""",[],[],"[""Security""]","""Junior IC""","""89156.0""","""en""","""Chantilly""","""United States""",2023-05-17,2023-05-29,"""June 2023""",89156.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Mechanical Engineer, Remote Se…","""Washington, D.C., DC""","""Full-Time""",true,"""DeVine Consulting""","""unknown""",[],[],"[""Consulting"", ""Mechanical"", … ""Engineering""]","""Junior IC""","""106000.0""","""en""","""Washington""","""United States""",2023-05-15,2023-05-28,"""June 2023""",106000.0
"""ERP Operation Support Engineer…","""New York, NY""","""Full-Time""",,"""Cinter Career Services""","""unknown""","[""Microsoft""]",[],"[""England"", ""Related"", ""ERP""]","""Junior IC""","""60000.0""","""en""","""New York""","""United States""",2023-04-26,2023-06-09,"""June 2023""",60000.0
"""JUNIOR DATA SCIENTIST - Dubai,…","""Dubai, UAE""","""Full-Time""",,"""Dubai, UAE.""","""Bachelors""","[""Python"", ""NumPy"", … ""seaborn""]","[""Visualization"", ""Libraries"", … ""Visualization""]",[],"""Junior IC""","""60000.00""","""en""","""Los Angeles""","""United States""",2023-04-24,2023-06-08,"""June 2023""",60000.0
"""Software Developer, Junior""","""USA, SC, Charleston (2387 Clem…","""Full-Time""",,"""631 Booz Allen Hamilton_United…","""Some High School""","[""Java"", ""JavaScript"", … ""SQL""]","[""Stat"", ""Tools"", … ""Software""]","[""Software""]","""Junior IC""","""75950.00""","""en""","""Charleston""","""United States""",2023-05-17,2023-05-19,"""June 2023""",75950.0


In [56]:
intern_comp = whole_df.filter(
    (pl.col("seniority") == "Intern") & (pl.col("compensation") > 0)
)
intern_comp.group_by("new").agg(pl.col("compensation").median())

new,compensation
str,f64
"""June 2023""",39.0
"""April 2024""",32500.0


In [57]:
intern_comp

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Back End Developer Intern""","""Bangalore""","""Intern""",,"""RayIoT Solutions""","""unknown""","[""MySQL"", ""JavaScript"", … ""Flask""]","[""OSS"", ""Stat"", … ""Datastores""]","[""IoT"", ""Software""]","""Intern""","""210000.00""","""en""","""Bangalore""","""India""",2024-02-27,2024-03-30,"""April 2024""",210000.0
"""Quality Engineer Intern""","""Texarkana, Arkansas, United St…","""Contract""",,"""Xylem Carrières""","""unknown""",[],[],[],"""Intern""","""48000.00""","""fr""","""Texarkana""","""United States""",2024-01-31,2024-03-24,"""April 2024""",48000.0
"""Software Developer Summer 2024…","""Washington, Washington, United…","""Temp""",,"""US News & World Report ,L.P.""","""Masters""","[""React"", ""Google"", … ""Django""]","[""IaaS"", ""Libraries"", … ""OSS""]","[""Software""]","""Intern""","""16.1""","""en""","""Washington""","""United States""",2024-02-21,2024-03-30,"""April 2024""",16.1
"""Software Development Intern""","""Washington, DC""","""Intern""",,"""Fund II Foundation""","""unknown""",[],[],"[""Software"", ""Software"", ""Development""]","""Intern""","""25.00""","""en""",,"""United States""",,2024-03-27,"""April 2024""",25.0
"""Jr. Cybersecurity Specialist""","""Remote""","""Intern""",true,"""Cutsforth""","""High School""","[""Excel""]","[""Misc"", ""Biz"", … ""Customers""]","[""Security"", ""Cybersecurity""]","""Intern""","""42200.00""","""en""",,,,2024-03-14,"""April 2024""",42200.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Intern: Supplier Quality Engin…","""Salem, Virginia, 24153""","""Intern""",,"""Volvo Group""","""unknown""","[""Microsoft""]",[],"[""Mechanical"", ""Industrial"", … ""Services""]","""Intern""","""31.00""","""fr""","""Salem""","""United States""",2023-04-17,2023-05-23,"""June 2023""",31.0
"""Software Engineer, Intern or C…","""Remote, Other 00000, US""","""Unclear""",true,"""Dakota Software Corporation""","""Bachelors""","[""Angular"", ""js"", … ""Core""]","[""Full"", ""Stack"", … ""Libraries""]","[""Software""]","""Intern""","""21.5""","""en""",,"""United States""",2024-03-13,2024-04-01,"""April 2024""",21.5
"""Engineer Intern""","""Building Department / Fire Mar…","""Intern""",,"""Town of Vernon""","""unknown""","[""Microsoft""]",[],[],"""Intern""","""17.00""","""en""",,,2024-01-30,2024-03-14,"""April 2024""",17.0
"""Summer Intern - Civil/Geotechn…","""United States""","""Intern""",,"""Haley & Aldrich""","""unknown""","[""Express""]","[""Backend"", ""Frameworks""]","[""Civil"", ""Engineering""]","""Intern""","""51000.00""","""en""",,"""United States""",2022-12-20,2023-06-08,"""June 2023""",51000.0


In [58]:
from datetime import datetime

In [59]:
clean_timeline = whole_df.filter(
    pl.col("job_published_at").is_between(datetime(2020, 12, 31), datetime(2024, 4, 2)),
)

In [60]:
timeline = clean_timeline.group_by("job_published_at", "new").agg(
    pl.col("job_published_at").count().alias("job_count")
)
timeline

job_published_at,new,job_count
date,str,u32
2023-03-19,"""June 2023""",19
2021-07-05,"""April 2024""",1
2023-10-06,"""April 2024""",32
2022-03-27,"""April 2024""",16
2022-11-15,"""April 2024""",13
…,…,…
2022-09-12,"""June 2023""",27
2022-08-14,"""June 2023""",2
2021-06-23,"""April 2024""",1
2022-07-19,"""April 2024""",7


In [61]:
pivot_timeline = timeline.pivot(
    index="job_published_at", columns="new", values="job_count"
)

In [62]:
%pip install selenium

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [63]:
%pip install phantomjs

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [65]:
plot_tl = pivot_timeline.hvplot.line(
    x="job_published_at",
    y=["June 2023", "April 2024"],
    title="Number of New Job Offers Posted per Day",
)

In [66]:
hvplot.save(plot_tl, "timeline.png")



In [67]:
timeline.hvplot.line(x="job_published_at", y="job_count", color="new")

#### 4. Identify dirty categories


In [68]:
whole_df = whole_df.with_columns(pl.col("job_name").str.to_lowercase())

In [69]:
whole_df.select(pl.col("job_name").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_65696,column_65697,column_65698,column_65699,column_65700,column_65701,column_65702,column_65703,column_65704,column_65705,column_65706,column_65707,column_65708,column_65709,column_65710,column_65711,column_65712,column_65713,column_65714,column_65715,column_65716,column_65717,column_65718,column_65719,column_65720,column_65721,column_65722,column_65723,column_65724,column_65725,column_65726,column_65727,column_65728,column_65729,column_65730,column_65731,column_65732
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""software engineer"",814}","{""senior software engineer"",615}","{""product manager"",462}","{""devops engineer"",398}","{""data engineer"",378}","{""project engineer"",373}","{""security officer"",361}","{""electrical engineer"",354}","{""program manager"",342}","{""data analyst"",327}","{""mechanical engineer"",277}","{""full stack developer"",276}","{""software developer"",274}","{""data scientist"",247}","{""systems engineer"",236}","{""quality engineer"",235}","{""network engineer"",235}","{""security guard"",232}","{""retail front end supervisor"",202}","{""process engineer"",196}","{""senior data engineer"",194}","{""manufacturing engineer"",194}","{""engineering manager"",192}","{""application developer: cloud fullstack"",184}","{""senior devops engineer"",182}","{""sales engineer"",164}","{""senior product manager"",161}","{""senior software developer"",158}","{""technical writer"",155}","{""field service engineer"",153}","{""site reliability engineer"",147}","{""product owner"",146}","{""android developer"",144}","{""backend developer"",144}","{""ios developer"",140}","{""civil engineer"",131}","{""qa engineer"",128}",…,"{""system engineer für militärische support systeme (all genders)"",1}","{""software engineering-director"",1}","{""fibre engineer bramford"",1}","{""principal weapons engineer"",1}","{""stationary engineer - 2nd class (6852)"",1}","{""senior data analyst, cash app compliance"",1}","{""senior qa automation engineer | remote friendly"",1}","{""production software engineer (big data support)"",1}","{""staff software engineer - scripting and developer experience"",1}","{""front-end developer (4th member of techstars team)"",1}","{""senior engineer, research (402112)"",1}","{""adjunct – machine learning for computer science – online – college of engineering & technology"",1}","{""software engineer (scala)"",1}","{""senior java engineer - p2p (apac)"",1}","{""quality assurance associate engineer"",1}","{""consulting engineer (associate)"",1}","{""package consultant: oracle cloud hcm time & labor"",1}","{""operarios/as de confección 1626127435.51"",1}","{""specialist, security (part-time, weekend, nights)"",1}","{""adeeba ios developer / sr. ios developer"",1}","{""java fullstack_meenakshi_hexaware"",1}","{""java developer/engineer position.......004"",1}","{""program manager- case management- texas office for refugees- austin, tx"",1}","{""data scientist (obp)"",1}","{""acquisition and production engineering manager"",1}","{""associate solution engineer (orbit)"",1}","{""manager project engineering"",1}","{""software engineer, ios mobile apps - full or part time"",1}","{""lead java developer - (java , aws) remote"",1}","{""specialist software engineer(big data)"",1}","{""frontend software development manager"",1}","{""cloud data analyst"",1}","{""manager of software monitoring- millennial specialty insurance"",1}","{""senior manager application engineering"",1}","{""aws software engineer iii - etl & python"",1}","{""staff thermal systems engineer, transformational computing"",1}","{""olive cultivation engineer ""baharya oasis"" ""مهندس زراعة زيتون"""",1}"


In [70]:
job_names = (
    whole_df.group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [71]:
job_pop = job_names.filter(pl.col("count") > 50)

In [72]:
job_pop.transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_64,column_65,column_66,column_67,column_68,column_69,column_70,column_71,column_72,column_73,column_74,column_75,column_76,column_77,column_78,column_79,column_80,column_81,column_82,column_83,column_84,column_85,column_86,column_87,column_88,column_89,column_90,column_91,column_92,column_93,column_94,column_95,column_96,column_97,column_98,column_99,column_100
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,…,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""software engineer""","""senior software engineer""","""product manager""","""devops engineer""","""data engineer""","""project engineer""","""security officer""","""electrical engineer""","""program manager""","""data analyst""","""mechanical engineer""","""full stack developer""","""software developer""","""data scientist""","""systems engineer""","""quality engineer""","""network engineer""","""security guard""","""retail front end supervisor""","""process engineer""","""senior data engineer""","""manufacturing engineer""","""engineering manager""","""application developer: cloud f…","""senior devops engineer""","""sales engineer""","""senior product manager""","""senior software developer""","""technical writer""","""field service engineer""","""site reliability engineer""","""product owner""","""backend developer""","""android developer""","""ios developer""","""civil engineer""","""engineer""",…,"""engineering technician""","""controls engineer""","""package consultant: sap cloud …","""quality assurance engineer""","""software engineer ii""","""senior engineer""","""machine learning engineer""","""chief engineer""","""cloud engineer""","""application engineer""","""staff software engineer""","""senior full stack developer""","""system engineer""","""embedded software engineer""","""software architect""","""software de recrutamento e sel…","""senior program manager""","""senior structural engineer""","""product engineer""","""security engineer""","""qa automation engineer""","""industrial engineer""","""service engineer""","""engineer ii""","""application developer: azure c…","""production engineer""","""engineering intern""","""senior full stack engineer""","""unarmed security officer""","""senior backend developer""","""software development engineer""","""lead engineer""","""big data engineer""","""solutions engineer""","""software engineer iii""","""technical product manager""","""electrical design engineer"""
"""814""","""615""","""462""","""398""","""378""","""373""","""361""","""354""","""342""","""327""","""277""","""276""","""274""","""247""","""236""","""235""","""235""","""232""","""202""","""196""","""194""","""194""","""192""","""184""","""182""","""164""","""161""","""158""","""155""","""153""","""147""","""146""","""144""","""144""","""140""","""131""","""128""",…,"""76""","""76""","""76""","""76""","""75""","""74""","""73""","""70""","""68""","""68""","""67""","""67""","""66""","""65""","""64""","""64""","""62""","""62""","""62""","""61""","""60""","""59""","""58""","""58""","""57""","""56""","""56""","""56""","""55""","""55""","""53""","""52""","""52""","""52""","""51""","""51""","""51"""


In [73]:
choices = job_pop.select(pl.col("job_name"))
choices.dtypes

[String]

In [74]:
choices

job_name
str
"""software engineer"""
"""senior software engineer"""
"""product manager"""
"""devops engineer"""
"""data engineer"""
…
"""big data engineer"""
"""solutions engineer"""
"""software engineer iii"""
"""technical product manager"""


In [75]:
whole_df.select(pl.col("company_name").value_counts(sort=True))

company_name
struct[2]
"{null,6266}"
"{""IBM"",2683}"
"{""Allied Universal"",1057}"
"{""CLBPTS"",668}"
"{""Bosch Group"",533}"
…
"{""Workiva"",1}"
"{""RHI"",1}"
"{""43107 GEA Food Solutions Weert"",1}"
"{""Auto Trader"",1}"


In [76]:
(
    whole_df.group_by("company_name")
    .agg(pl.col("company_name").count().alias("count"))
    .filter(pl.col("count") > 1)
    .sort("count", descending=True)
)

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""Calvary Robotics""",2
"""1000 Pacific Biosciences of Ca…",2
"""NEOM""",2
"""Antino Labs""",2


Let's create a list of most common job_names, and then let's fuzzy match them with the rest

In [77]:
job_names = (
    whole_df.group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [78]:
job_pop = job_names.filter(pl.col("count") > 10)

In [79]:
job_pop.transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_465,column_466,column_467,column_468,column_469,column_470,column_471,column_472,column_473,column_474,column_475,column_476,column_477,column_478,column_479,column_480,column_481,column_482,column_483,column_484,column_485,column_486,column_487,column_488,column_489,column_490,column_491,column_492,column_493,column_494,column_495,column_496,column_497,column_498,column_499,column_500,column_501
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,…,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""software engineer""","""senior software engineer""","""product manager""","""devops engineer""","""data engineer""","""project engineer""","""security officer""","""electrical engineer""","""program manager""","""data analyst""","""mechanical engineer""","""full stack developer""","""software developer""","""data scientist""","""systems engineer""","""quality engineer""","""network engineer""","""security guard""","""retail front end supervisor""","""process engineer""","""manufacturing engineer""","""senior data engineer""","""engineering manager""","""application developer: cloud f…","""senior devops engineer""","""sales engineer""","""senior product manager""","""senior software developer""","""technical writer""","""field service engineer""","""site reliability engineer""","""product owner""","""backend developer""","""android developer""","""ios developer""","""civil engineer""","""engineer""",…,"""site reliability engineer iii""","""cloud solutions architect""","""project engineering manager""","""senior application security en…","""sr. data scientist""","""lead software developer""","""representante de envios""","""cloud data engineer""","""transportation project enginee…","""devops engineer - remote, full…","""application developer: ibm clo…","""senior software engineer (back…","""data analyst (remote)""","""senior security analyst""","""robotics engineer""","""senior software engineer - jav…","""functional safety engineer""","""cloud infrastructure engineer""",""".net full stack developer""","""software quality assurance eng…","""engineering internship""","""sr. software developer""","""fire engineer""","""systems integration engineer""","""cad engineer""","""data scientist ii""","""package consultant: oracle clo…","""manufacturing engineering mana…","""field sales engineer""","""principal mechanical engineer""","""senior cybersecurity engineer""","""security guard - full time""","""frontend software engineer""","""lead site reliability engineer""","""design engineer ii""","""engineering specialist""","""senior cloud security engineer"""
"""814""","""615""","""462""","""398""","""378""","""373""","""361""","""354""","""342""","""327""","""277""","""276""","""274""","""247""","""236""","""235""","""235""","""232""","""202""","""196""","""194""","""194""","""192""","""184""","""182""","""164""","""161""","""158""","""155""","""153""","""147""","""146""","""144""","""144""","""140""","""131""","""128""",…,"""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11"""


In [80]:
data_job = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)data"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [81]:
junior_job = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)junior"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [82]:
whole_df.filter(
    (pl.col("job_name").str.contains("(?i)junior"))
    & (pl.col("seniority") != "Junior IC")
)
# there are 146 more Junior Job positions that are not specified as such in seniority but contain "Junior" in the job_name

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""(junior) data analyst/visualiz…","""Meckenbeuren, 88074 Germany""","""Full-Time""",,"""Winterhalter Gruppe""","""unknown""","[""ETL"", ""Python"", … ""SQL""]","[""Analytics"", ""Big"", … ""Visualization""]",[],"""IC""",,"""en""","""Meckenbeuren""","""Germany""",2024-03-19,2024-03-26,"""April 2024""",
"""civil engineer project manager…","""Elk Grove, CA, 95758""","""Contract""",,"""Interwest Consulting Group""","""Bachelors""",[],[],"[""Civil"", ""Engineering"", ""Consulting""]","""IC""","""""","""en""","""Elk Grove""","""United States""",,2023-06-02,"""June 2023""",
"""junior data analyst - dailymot…","""Paris, France""","""Full-Time""",false,"""Dailymotion""","""unknown""","[""Google"", ""Data"", … ""Dailymotion""]","[""Data"", ""Science"", … ""Media""]","[""Videos"", ""UX"", … ""Apps""]","""IC""","""""","""en""","""Paris""","""France""",2023-03-24,2023-05-28,"""June 2023""",
"""junior brand & product manager…","""Experienced Professional""","""unknown""",,"""British American Tobacco""","""unknown""",[],[],"[""England"", ""Related"", … ""Goods""]","""IC""",,"""en""","""Hamburg""","""Germany""",2024-03-04,2024-03-14,"""April 2024""",
"""junior front end development a…","""Long Island City, NY, United S…","""Full-Time""",true,"""Inbulks Corp""","""unknown""","[""SQL"", ""HTML5"", … ""Java""]","[""Infrastructure"", ""Full"", … ""SQL""]","[""eCom"", ""Software""]","""IC""","""""","""en""","""Long Island City""","""United States""",2021-05-25,2023-05-30,"""June 2023""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""security control assessor/data…","""Lackland Air Force Base - JBSA…","""Full-Time""",,"""Feditc""","""unknown""",[],[],"[""Security""]","""Senior IC""",,"""en""","""JBSA-Lackland AFB""","""United States""",2024-01-26,2024-03-29,"""April 2024""",
"""junior data analyst""","""Rio de Janeiro, RJ, 22640100 B…","""Full-Time""",,,"""unknown""","[""Power"", ""BI"", … ""Tableau""]","[""Business"", ""Intelligence"", … ""Tools""]",[],"""IC""",,"""en""","""Rio de Janeiro""","""Brazil""",2024-02-15,2024-03-31,"""April 2024""",
"""(junior/senior) security analy…","""Deutsche Telekom Cyber Securit…","""Part-Time""",,"""Deutsche Telekom""","""unknown""",[],[],"[""Security""]","""IC""","""43078.00""","""en""","""Vienna""","""Austria""",2023-12-05,2024-03-23,"""April 2024""",43078.0
"""full stack engineer (junior)""","""Huntsville, AL""","""Full-Time""",,"""Sangoma""","""unknown""","[""TurboGears"", ""AngularJS"", … ""AWS""]","[""IaaS"", ""Programming"", … ""Libraries""]",[],"""Senior IC""",,"""en""","""Huntsville""","""United States""",,2023-06-09,"""June 2023""",


In [83]:
junior_job

job_name,count
str,u32
"""junior software engineer""",34
"""junior data scientist - dubai,…",21
"""junior data analyst""",20
"""junior software developer""",14
"""junior electrical engineer""",8
…,…
"""junior engineering officer""",1
"""junior security engineer""",1
"""junior android developer (java…",1
"""junior ai/ml software develope…",1


In [84]:
intern = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)intern"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [85]:
intern

job_name,count
str,u32
"""engineering intern""",56
"""mechanical engineering intern""",22
"""civil engineering intern""",19
"""software engineer intern""",19
"""software engineering intern""",18
…,…
"""engineering-intern (summer 202…",1
"""distribution industrial engine…",1
"""【cc/ech4-jp】internship at chas…",1
"""intern, transportation enginee…",1


In [86]:
internship = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)internship"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [87]:
internship

job_name,count
str,u32
"""engineering internship""",11
"""internship for android develop…",5
"""electrical engineering interns…",5
"""mechanical engineering interns…",5
"""internship for ios from an it …",4
…,…
"""summer 2024 machine learning i…",1
"""data internships - data analys…",1
"""mandatory internship - product…",1
"""quality engineer - summer inte…",1


In [88]:
whole_df.select(pl.col("company_name").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_28745,column_28746,column_28747,column_28748,column_28749,column_28750,column_28751,column_28752,column_28753,column_28754,column_28755,column_28756,column_28757,column_28758,column_28759,column_28760,column_28761,column_28762,column_28763,column_28764,column_28765,column_28766,column_28767,column_28768,column_28769,column_28770,column_28771,column_28772,column_28773,column_28774,column_28775,column_28776,column_28777,column_28778,column_28779,column_28780,column_28781
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{null,6266}","{""IBM"",2683}","{""Allied Universal"",1057}","{""CLBPTS"",668}","{""Bosch Group"",533}","{""Schneider Electric"",397}","{""260312-SOUTH FLORIDA REGION ADMIN"",380}","{""Novartis"",367}","{""Volvo Group"",342}","{""Lockheed Martin"",339}","{""Endeavor IT Solution"",305}","{""Open Systems Technologies"",280}","{""Explore Jobs Search"",266}","{""Weblee Technologies"",264}","{""The Boeing Company"",260}","{""Continental"",259}","{""Coders Brain Technology"",242}","{""Capgemini"",224}","{""IBM Careers"",220}","{""FullStack Labs"",215}","{""AECOM"",201}","{""241387-COMP & BEN ADMIN PROF FEES"",199}","{""Burlington Stores"",191}","{""Securitas US Business Unit"",176}","{""CACI-FEDERAL"",172}","{""Worley"",160}","{""Nagarro"",153}","{""Jobsbridge"",152}","{""Segula Technologies"",146}","{""Publicis Groupe"",143}","{""Oowlish Technology"",143}","{""Latitude"",136}","{""Sargent & Lundy"",135}","{""GardaWorld"",135}","{""Sonsoft"",134}","{""About Alstom"",129}","{""SAP"",127}",…,"{""Cogito"",1}","{""MCS of Tampa"",1}","{""pmX Group"",1}","{""SYSNAV"",1}","{""USCCB"",1}","{""Magic Eden"",1}","{""Global Fashion Group Sgp Services ."",1}","{""Perfection Custom Closets"",1}","{""Gro Intelligence"",1}","{""Security Services Northwest"",1}","{""Spatial"",1}","{""The Citadel"",1}","{""BitByte Robotronix India"",1}","{""JBHired"",1}","{""Outwood Academy Hasland Hall"",1}","{""POLYWOOD"",1}","{""Falcon Group"",1}","{""SecureWorks US"",1}","{""Nexthub CYBER & STRATEGIC RISK"",1}","{""Team Tumbleweed"",1}","{""Superior Resource Group"",1}","{""Allura Partners"",1}","{""Language Bear"",1}","{""Northwood Club"",1}","{""EI India"",1}","{""Cummins"",1}","{""STG DI HUB CONTENT SERVICES"",1}","{""Graphika"",1}","{""Imagia Canexia Health"",1}","{""VSee"",1}","{""MTU Aero Engines AG"",1}","{""MassMutual Global Business Services Romania"",1}","{""Workiva"",1}","{""RHI"",1}","{""43107 GEA Food Solutions Weert"",1}","{""Auto Trader"",1}","{""Transformco"",1}"


In [89]:
whole_df.select(pl.col("seniority").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""unknown"",53578}","{""Senior IC"",17840}","{""Manager"",6688}","{""IC"",6623}","{""Unclear Seniority"",4593}","{""Staff IC"",3513}","{""Intern"",1982}","{""Junior IC"",1896}","{""Contract"",1106}","{""Director"",874}","{""Chief"",557}","{""Exec"",465}","{""Senior Manager"",273}","{""Senior Exec"",8}","{""Founder"",3}"


In [90]:
from polars_ds.diagnosis import DIA
import polars.selectors as cs

In [91]:
dia = DIA(whole_df)
dia.plot_null_distribution(cs.all())

Null Distribution,Null Distribution.1,Null Distribution.2
job_name,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
job_location,0.1600.100.100.110.120.110.110.110.120.110.120.110.130.100.110.120.120.120.120.120.120.120.110.110.110.130.110.120.120.130.120.110.120.120.130.120.110.110.100.110.120.110.120.110.110.110.110.120.110.120.110.16,11.45%
hours,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
remote,0.7900.780.770.770.760.760.770.750.790.750.780.780.760.770.750.760.760.760.760.760.760.770.780.770.780.750.770.760.770.770.780.770.760.760.760.780.760.770.760.770.790.750.760.760.770.770.770.750.750.760.780.69,76.62%
company_name,0.08200.0550.0580.0750.0610.0510.0610.0590.0540.0660.0650.0660.0650.0610.0520.0630.0590.0730.0670.0570.0670.0610.0640.0690.0670.0620.0600.0760.0560.0680.0660.0670.0600.0690.0680.0690.0490.0620.0720.0640.0660.0570.0640.0600.0590.0610.0670.0630.0620.0610.0600.082,6.27%
education,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
tags_matched,0.02400.0170.0200.0110.0170.0130.0140.0180.0210.0200.0180.0130.0160.0140.0150.0170.0140.0160.0150.0160.0150.0160.0150.0120.0130.0130.0160.0120.0190.0170.0150.0190.0100.0200.0180.0210.0160.0110.0190.0130.0150.0240.0150.0170.0190.0170.0170.0150.0140.0120.0150.020,1.56%
tag_categories,0.02400.0170.0200.0110.0170.0130.0140.0180.0210.0200.0180.0130.0160.0140.0150.0170.0140.0160.0150.0160.0150.0160.0150.0120.0130.0130.0160.0120.0190.0170.0150.0190.0100.0200.0180.0210.0160.0110.0190.0130.0150.0240.0150.0170.0190.0170.0170.0150.0140.0120.0150.020,1.56%
categories,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
seniority,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%


In [92]:
whole_df.select(pl.col("hours").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""Full-Time"",61588}","{""unknown"",27739}","{""Contract"",3727}","{""Part-Time"",2009}","{""Unclear"",1905}","{""Intern"",1048}","{""Temp"",701}","{""Hourly"",513}","{""Student"",280}","{""Trainee"",187}","{""Advisor"",94}","{""Gig"",83}","{""Commission"",82}","{""Grant"",27}","{""Conditional"",13}","{""Volunteer"",3}"


In [93]:
whole_df.select(pl.col("language").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,column_37,column_38,column_39,column_40,column_41,column_42,column_43
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""en"",91143}","{""de"",1595}","{""fr"",1440}","{""pt"",1051}","{""es"",1002}","{""zh"",639}","{""unknown"",583}","{""nl"",567}","{""ja"",375}","{""ko"",347}","{""pl"",319}","{""sk"",214}","{""it"",168}","{""sv"",117}","{""ru"",96}","{""tr"",41}","{""id"",40}","{""hu"",39}","{""no"",38}","{""sl"",29}","{""cs"",29}","{""ro"",16}","{""uk"",16}","{""hr"",11}","{""fi"",11}","{""et"",11}","{""da"",11}","{""tl"",9}","{""ca"",6}","{""vi"",5}","{""el"",5}","{""lt"",5}","{""cy"",4}","{""af"",4}","{""ka"",2}","{""sw"",2}","{""sq"",2}","{""ar"",1}","{""lv"",1}","{""he"",1}","{""th"",1}","{""hy"",1}","{""sr"",1}","{""gb"",1}"


In [94]:
whole_df.select(pl.col("country").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_146,column_147,column_148,column_149,column_150,column_151,column_152,column_153,column_154,column_155,column_156,column_157,column_158,column_159,column_160,column_161,column_162,column_163,column_164,column_165,column_166,column_167,column_168,column_169,column_170,column_171,column_172,column_173,column_174,column_175,column_176,column_177,column_178,column_179,column_180,column_181,column_182
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""United States"",44581}","{""India"",10275}","{null,9708}","{""United Kingdom"",4191}","{""Germany"",3560}","{""Canada"",2362}","{""Brazil"",1783}","{""France"",1530}","{""Australia"",1340}","{""Mexico"",1015}","{""China"",948}","{""Poland"",927}","{""Singapore"",902}","{""Spain"",857}","{""Netherlands"",854}","{""South Africa"",774}","{""Israel"",769}","{""Philippines"",666}","{""Italy"",587}","{""Romania"",574}","{""Malaysia"",564}","{""Japan"",509}","{""Ireland"",496}","{""Belgium"",480}","{""Sweden"",470}","{""Switzerland"",411}","{""Portugal"",401}","{""Colombia"",383}","{""Argentina"",371}","{""Austria"",334}","{""Thailand"",325}","{""Saudi Arabia"",309}","{""United Arab Emirates"",303}","{""Czech Republic"",294}","{""Taiwan"",281}","{""Egypt"",275}","{""Hungary"",263}",…,"{""South Sudan"",2}","{""Liechtenstein"",2}","{""Trinidad And Tobago"",2}","{""Fiji"",2}","{""Mali"",2}","{""Zambia"",2}","{""Equatorial Guinea"",2}","{""Somalia"",2}","{""Zimbabwe"",2}","{""Cayman Islands"",2}","{""Faroe Islands"",1}","{""Tajikistan"",1}","{""Marshall Islands"",1}","{""Sudan"",1}","{""Togo"",1}","{""Libya"",1}","{""Brunei"",1}","{""Liberia"",1}","{""Saint Lucia"",1}","{""Benin"",1}","{""Guyana"",1}","{""Vanuatu"",1}","{""Djibouti"",1}","{""Yemen"",1}","{""Afghanistan"",1}","{""Saint Kitts And Nevis"",1}","{""Guinea"",1}","{""San Marino"",1}","{""Bermuda"",1}","{""Greenland"",1}","{""Wallis And Futuna"",1}","{""Central African Republic"",1}","{""Laos"",1}","{""Turkmenistan"",1}","{""Gabon"",1}","{""Mozambique"",1}","{""Aruba"",1}"


In [95]:
date_data_new = whole_df.select(cs.date())
bool_data_new = whole_df.select(cs.by_dtype(pl.Boolean))
string_data_new = whole_df.select(cs.string(include_categorical=True))
nested_data_new = whole_df.select(
    cs.by_name("tags_matched", "tag_categories", "categories")
)
num_data_new = whole_df.select(cs.float())

In [96]:
whole_df.select(pl.col("job_location").value_counts(sort=True))

job_location
struct[2]
"{null,11452}"
"{""Remote"",1316}"
"{""United States"",1284}"
"{""Bangalore, India"",942}"
"{""New York, NY"",453}"
…
"{""Weert: De Fuus 8"",1}"
"{""GILBERT, Arizona, United States"",1}"
"{""Centro Corporativo El Cafetal, Heredia, Heredia, Costa Rica"",1}"
"{""Beijing, BJ, CN"",1}"


In [97]:
print(f"date type columns:{date_data_new.columns}")
print(f"bool type columns:{bool_data_new.columns}")
print(f"string type columns:{string_data_new.columns}")
print(f"nested type columns:{nested_data_new.columns}")

date type columns:['job_published_at', 'last_indexed']
bool type columns:['remote']
string type columns:['job_name', 'job_location', 'hours', 'company_name', 'education', 'seniority', 'comp_est', 'language', 'city', 'country', 'new']
nested type columns:['tags_matched', 'tag_categories', 'categories']


In [98]:
missing = (
    whole_df.select(pl.all().is_null().sum())
    .melt(value_name="missing")
    .filter(pl.col("missing") > 0)
)

In [99]:
compensation = whole_df.filter(pl.col("compensation") > 0)

In [100]:
compensation

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""field service engineer (german…","""Weiterstadt, HE, DE, 64331""","""unknown""",,"""Ametek""","""unknown""",[],[],[],"""unknown""","""5.00""","""en""","""Weiterstadt""","""Germany""",2024-03-17,2024-04-01,"""April 2024""",5.0
"""quality engineer""","""Burlington, Washington""","""Full-Time""",,"""Legend Brands""","""Bachelors""","[""Excel"", ""SAP""]","[""Treasury"", ""FP"", … ""Accounting""]",[],"""unknown""","""74000.0""","""en""","""Burlington""","""United States""",2023-05-12,2023-05-23,"""June 2023""",74000.0
"""quality assurance engineer ii""","""Savannah, GA""","""Contract""",,"""Aviation Technology Associates""","""Bachelors""","[""Sigma""]","[""Serverless"", ""Tools""]","[""Airlines"", ""Aerospace"", ""QA""]","""unknown""","""110000.00""","""en""","""Savannah""","""United States""",2023-05-25,2023-05-30,"""June 2023""",110000.0
"""alfresco solution architect""","""Boulder, Colorado, 80303 Unite…","""Full-Time""",true,"""Zia Consulting""","""Bachelors""","[""Git"", ""JavaScript"", … ""Java""]","[""Java"", ""Build"", … ""Databases""]","[""Architecture"", ""Planning"", ""Consulting""]","""unknown""","""75055.00""","""en""","""Boulder""","""United States""",2022-10-06,2023-05-23,"""June 2023""",75055.0
"""manufacturing quality engineer""","""Marlborough, CT""","""Full-Time""",,"""Novanta , USA""","""Bachelors""",[],[],"[""Manufacturing""]","""unknown""","""91798.00""","""en""","""Marlborough""","""United States""",2024-03-26,2024-03-27,"""April 2024""",91798.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""acquisition and production eng…","""Moorestown, NJ""","""Full-Time""",,"""JSL Technologies""","""Bachelors""","[""AWS""]","[""IaaS"", ""PaaS"", ""Compute""]",[],"""Manager""","""97500.0""","""en""","""Moorestown""","""United States""",2023-12-12,2024-03-14,"""April 2024""",97500.0
"""software engineer, ios mobile …","""Manchester""","""Part-Time""",,"""Auto Trader""","""unknown""","[""Xcode"", ""iOS"", … ""Swift""]","[""Mobile"", ""Java"", … ""Runtime""]","[""Software"", ""iOS"", … ""Apps""]","""unknown""","""47500.00""","""en""","""Manchester""","""United Kingdom""",2023-01-06,2024-04-01,"""April 2024""",47500.0
"""frontend software development …","""Vancouver, British Columbia, C…","""unknown""",,"""Global Relay""","""unknown""","[""Docker"", ""React"", ""Kubernetes""]","[""Container"", ""Orchestration"", … ""Software""]","[""Software"", ""Software"", ""Development""]","""Manager""","""145000.00""","""en""","""Vancouver""","""Canada""",2024-02-12,2024-03-30,"""April 2024""",145000.0
"""aws software engineer iii - et…","""Jersey City, NJ, United States""","""Full-Time""",,"""281971-Ipm Mission Control Ap_…","""unknown""","[""JPMorgan"", ""Chase"", … ""SQL""]","[""OSS"", ""OS"", … ""PaaS""]","[""Software"", ""Financial"", ""Services""]","""unknown""","""154125.00""","""en""","""Jersey City""","""United States""",2023-11-13,2024-04-01,"""April 2024""",154125.0


#### country+code

In [101]:
whole_df = whole_df.with_columns(pl.col("country").str.replace("Turkey", "Turkiye"))

In [102]:
whole_df = whole_df.with_columns(
    pl.col("country")
    .str.replace("Ivory Coast", "Côte d'Ivoire")
    .str.replace("Turkey", "Turkiye")
)

In [103]:
whole_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""retail front end supervisor""",,"""unknown""",,"""External Ocean State Job Lot""","""unknown""",[],[],"[""Retail"", ""Job"", ""Board""]","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023""",
"""développeur mobile ios (swift)…","""Toulouse""","""Full-Time""",false,"""MY SAM CAB""","""unknown""","[""Notion"", ""Kotlin"", … ""PostgreSQL""]","[""Databases"", ""SaaS"", … ""Languages""]","[""iOS"", ""Apple"", ""Related""]","""unknown""",,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023""",
"""fluid systems chief engineer""","""Cedar park, TX""","""unknown""",,"""Firefly Aerospace""","""Bachelors""",[],[],"[""Airlines"", ""Aerospace""]","""Chief""",,"""en""",,"""United States""",,2024-03-25,"""April 2024""",
"""instrumentation engineer""",,"""unknown""",true,,"""unknown""",[],[],[],"""unknown""",,"""pl""",,,2024-01-28,2024-04-01,"""April 2024""",
"""civil roadway engineering inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""",[],[],[],"""Intern""",,"""en""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""aws software engineer iii - et…","""Jersey City, NJ, United States""","""Full-Time""",,"""281971-Ipm Mission Control Ap_…","""unknown""","[""JPMorgan"", ""Chase"", … ""SQL""]","[""OSS"", ""OS"", … ""PaaS""]","[""Software"", ""Financial"", ""Services""]","""unknown""","""154125.00""","""en""","""Jersey City""","""United States""",2023-11-13,2024-04-01,"""April 2024""",154125.0
"""únete a nuestra comunidad de t…","""Argentina, Buenos Aires, Pelle…","""Full-Time""",,,"""unknown""",[],[],[],"""unknown""",,"""es""","""Buenos Aires""","""Argentina""",2024-03-08,2024-03-26,"""April 2024""",
"""staff thermal systems engineer…","""MDLI18""","""Full-Time""",,"""0078 MS""","""Bachelors""",[],[],[],"""Staff IC""","""197500.00""","""en""",,"""United States""",2023-05-16,2023-05-28,"""June 2023""",197500.0
"""fullstack developer""","""Guadalajara, Mexico""","""unknown""",,"""IBM""","""unknown""","[""Blockchain"", ""IBM"", … ""TypeScript""]","[""Data"", ""Science"", … ""Libraries""]","[""Software""]","""unknown""","""""","""pl""","""Guadalajara""","""Mexico""",2023-04-11,2023-06-05,"""June 2023""",


In [104]:
alpha_path = "/home/anopsy/Portfolio/sourcestack/data/alpha3_codes.csv"
alpha_codes = pl.read_csv(alpha_path)

In [105]:
alpha_df = whole_df.join(alpha_codes, on="country", how="left")
alpha_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation,code
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64,str
"""retail front end supervisor""",,"""unknown""",,"""External Ocean State Job Lot""","""unknown""",[],[],"[""Retail"", ""Job"", ""Board""]","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023""",,"""USA"""
"""développeur mobile ios (swift)…","""Toulouse""","""Full-Time""",false,"""MY SAM CAB""","""unknown""","[""Notion"", ""Kotlin"", … ""PostgreSQL""]","[""Databases"", ""SaaS"", … ""Languages""]","[""iOS"", ""Apple"", ""Related""]","""unknown""",,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023""",,"""FRA"""
"""fluid systems chief engineer""","""Cedar park, TX""","""unknown""",,"""Firefly Aerospace""","""Bachelors""",[],[],"[""Airlines"", ""Aerospace""]","""Chief""",,"""en""",,"""United States""",,2024-03-25,"""April 2024""",,"""USA"""
"""instrumentation engineer""",,"""unknown""",true,,"""unknown""",[],[],[],"""unknown""",,"""pl""",,,2024-01-28,2024-04-01,"""April 2024""",,
"""civil roadway engineering inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""",[],[],[],"""Intern""",,"""en""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024""",,"""USA"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""aws software engineer iii - et…","""Jersey City, NJ, United States""","""Full-Time""",,"""281971-Ipm Mission Control Ap_…","""unknown""","[""JPMorgan"", ""Chase"", … ""SQL""]","[""OSS"", ""OS"", … ""PaaS""]","[""Software"", ""Financial"", ""Services""]","""unknown""","""154125.00""","""en""","""Jersey City""","""United States""",2023-11-13,2024-04-01,"""April 2024""",154125.0,"""USA"""
"""únete a nuestra comunidad de t…","""Argentina, Buenos Aires, Pelle…","""Full-Time""",,,"""unknown""",[],[],[],"""unknown""",,"""es""","""Buenos Aires""","""Argentina""",2024-03-08,2024-03-26,"""April 2024""",,"""ARG"""
"""staff thermal systems engineer…","""MDLI18""","""Full-Time""",,"""0078 MS""","""Bachelors""",[],[],[],"""Staff IC""","""197500.00""","""en""",,"""United States""",2023-05-16,2023-05-28,"""June 2023""",197500.0,"""USA"""
"""fullstack developer""","""Guadalajara, Mexico""","""unknown""",,"""IBM""","""unknown""","[""Blockchain"", ""IBM"", … ""TypeScript""]","[""Data"", ""Science"", … ""Libraries""]","[""Software""]","""unknown""","""""","""pl""","""Guadalajara""","""Mexico""",2023-04-11,2023-06-05,"""June 2023""",,"""MEX"""


In [106]:
alpha_df = alpha_df.drop("categories", "tags_matched", "tag_categories")

In [107]:
alpha_df.write_csv(
    "/home/anopsy/Portfolio/sourcestack/data/alpha_df.csv", separator=","
)

In [108]:
alpha_df

job_name,job_location,hours,remote,company_name,education,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation,code
str,str,cat,bool,str,cat,cat,str,cat,str,str,date,date,str,f64,str
"""retail front end supervisor""",,"""unknown""",,"""External Ocean State Job Lot""","""unknown""","""Manager""","""""","""en""","""Wethersfield""","""United States""",2023-06-01,2023-06-05,"""June 2023""",,"""USA"""
"""développeur mobile ios (swift)…","""Toulouse""","""Full-Time""",false,"""MY SAM CAB""","""unknown""","""unknown""",,"""fr""","""Toulouse""","""France""",2022-07-13,2023-06-07,"""June 2023""",,"""FRA"""
"""fluid systems chief engineer""","""Cedar park, TX""","""unknown""",,"""Firefly Aerospace""","""Bachelors""","""Chief""",,"""en""",,"""United States""",,2024-03-25,"""April 2024""",,"""USA"""
"""instrumentation engineer""",,"""unknown""",true,,"""unknown""","""unknown""",,"""pl""",,,2024-01-28,2024-04-01,"""April 2024""",,
"""civil roadway engineering inte…","""Denver, Colorado, United State…","""Unclear""",,"""RS&H Talent Acquisition""","""Bachelors""","""Intern""",,"""en""","""Denver""","""United States""",2022-03-14,2024-03-14,"""April 2024""",,"""USA"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""aws software engineer iii - et…","""Jersey City, NJ, United States""","""Full-Time""",,"""281971-Ipm Mission Control Ap_…","""unknown""","""unknown""","""154125.00""","""en""","""Jersey City""","""United States""",2023-11-13,2024-04-01,"""April 2024""",154125.0,"""USA"""
"""únete a nuestra comunidad de t…","""Argentina, Buenos Aires, Pelle…","""Full-Time""",,,"""unknown""","""unknown""",,"""es""","""Buenos Aires""","""Argentina""",2024-03-08,2024-03-26,"""April 2024""",,"""ARG"""
"""staff thermal systems engineer…","""MDLI18""","""Full-Time""",,"""0078 MS""","""Bachelors""","""Staff IC""","""197500.00""","""en""",,"""United States""",2023-05-16,2023-05-28,"""June 2023""",197500.0,"""USA"""
"""fullstack developer""","""Guadalajara, Mexico""","""unknown""",,"""IBM""","""unknown""","""unknown""","""""","""pl""","""Guadalajara""","""Mexico""",2023-04-11,2023-06-05,"""June 2023""",,"""MEX"""


In [109]:
country_bar = alpha_df.group_by("code").agg(pl.col("code").count().alias("count"))

In [110]:
country_bar.transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_143,column_144,column_145,column_146,column_147,column_148,column_149,column_150,column_151,column_152,column_153,column_154,column_155,column_156,column_157,column_158,column_159,column_160,column_161,column_162,column_163,column_164,column_165,column_166,column_167,column_168,column_169,column_170,column_171,column_172,column_173,column_174,column_175,column_176,column_177,column_178,column_179
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,…,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""MAR""","""SMR""","""TWN""","""JAM""","""MMR""","""SWE""","""IND""","""MYS""","""LBY""","""SLE""","""PSE""","""AFG""","""GIB""","""ARE""","""SLV""","""TZA""","""URY""","""None""","""MEX""","""SRB""","""ECU""","""PER""","""MNE""","""ZAF""","""GAB""","""LTU""","""CAF""","""NAM""","""SEN""","""GUY""",,"""MWI""","""USA""","""CHL""","""BHS""","""RUS""","""GBR""",…,"""TUR""","""BEN""","""TJK""","""IRN""","""ABW""","""SHN""","""NIC""","""SDN""","""VEN""","""BGR""","""ISR""","""JPN""","""AZE""","""NOR""","""KHM""","""ZWE""","""PHL""","""CRI""","""BLZ""","""BMU""","""LBN""","""NGA""","""MHL""","""LIE""","""MLT""","""GRC""","""ALB""","""HKG""","""BWA""","""MOZ""","""CMR""","""MDA""","""MKD""","""CAN""","""BLR""","""TKM""","""LSO"""
"""79""","""1""","""281""","""16""","""4""","""470""","""10275""","""564""","""1""","""2""","""7""","""1""","""3""","""303""","""17""","""5""","""50""","""2""","""1015""","""122""","""47""","""145""","""5""","""774""","""1""","""100""","""1""","""25""","""8""","""1""","""0""","""2""","""44581""","""159""","""3""","""28""","""4203""",…,"""225""","""1""","""1""","""11""","""1""","""3""","""9""","""1""","""25""","""210""","""769""","""509""","""5""","""126""","""13""","""2""","""666""","""203""","""2""","""1""","""48""","""152""","""1""","""2""","""56""","""241""","""8""","""232""","""3""","""1""","""6""","""11""","""17""","""2362""","""4""","""1""","""4"""


In [111]:
alpha_df = alpha_df.with_columns(pl.col("country").fill_null("unknown"))

In [112]:
alpha_df.select(pl.col("country").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_146,column_147,column_148,column_149,column_150,column_151,column_152,column_153,column_154,column_155,column_156,column_157,column_158,column_159,column_160,column_161,column_162,column_163,column_164,column_165,column_166,column_167,column_168,column_169,column_170,column_171,column_172,column_173,column_174,column_175,column_176,column_177,column_178,column_179,column_180,column_181,column_182
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""United States"",44581}","{""India"",10275}","{""unknown"",9708}","{""United Kingdom"",4191}","{""Germany"",3560}","{""Canada"",2362}","{""Brazil"",1783}","{""France"",1530}","{""Australia"",1340}","{""Mexico"",1015}","{""China"",948}","{""Poland"",927}","{""Singapore"",902}","{""Spain"",857}","{""Netherlands"",854}","{""South Africa"",774}","{""Israel"",769}","{""Philippines"",666}","{""Italy"",587}","{""Romania"",574}","{""Malaysia"",564}","{""Japan"",509}","{""Ireland"",496}","{""Belgium"",480}","{""Sweden"",470}","{""Switzerland"",411}","{""Portugal"",401}","{""Colombia"",383}","{""Argentina"",371}","{""Austria"",334}","{""Thailand"",325}","{""Saudi Arabia"",309}","{""United Arab Emirates"",303}","{""Czech Republic"",294}","{""Taiwan"",281}","{""Egypt"",275}","{""Hungary"",263}",…,"{""South Sudan"",2}","{""Liechtenstein"",2}","{""Trinidad And Tobago"",2}","{""Fiji"",2}","{""Mali"",2}","{""Zambia"",2}","{""Equatorial Guinea"",2}","{""Somalia"",2}","{""Zimbabwe"",2}","{""Cayman Islands"",2}","{""Faroe Islands"",1}","{""Tajikistan"",1}","{""Marshall Islands"",1}","{""Sudan"",1}","{""Togo"",1}","{""Libya"",1}","{""Brunei"",1}","{""Liberia"",1}","{""Saint Lucia"",1}","{""Benin"",1}","{""Guyana"",1}","{""Vanuatu"",1}","{""Djibouti"",1}","{""Yemen"",1}","{""Afghanistan"",1}","{""Saint Kitts And Nevis"",1}","{""Guinea"",1}","{""San Marino"",1}","{""Bermuda"",1}","{""Greenland"",1}","{""Wallis And Futuna"",1}","{""Central African Republic"",1}","{""Laos"",1}","{""Turkmenistan"",1}","{""Gabon"",1}","{""Mozambique"",1}","{""Aruba"",1}"


In [113]:
alpha_df.select(pl.col("language").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,column_37,column_38,column_39,column_40,column_41,column_42,column_43
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""en"",91143}","{""de"",1595}","{""fr"",1440}","{""pt"",1051}","{""es"",1002}","{""zh"",639}","{""unknown"",583}","{""nl"",567}","{""ja"",375}","{""ko"",347}","{""pl"",319}","{""sk"",214}","{""it"",168}","{""sv"",117}","{""ru"",96}","{""tr"",41}","{""id"",40}","{""hu"",39}","{""no"",38}","{""sl"",29}","{""cs"",29}","{""ro"",16}","{""uk"",16}","{""hr"",11}","{""fi"",11}","{""et"",11}","{""da"",11}","{""tl"",9}","{""ca"",6}","{""vi"",5}","{""el"",5}","{""lt"",5}","{""cy"",4}","{""af"",4}","{""ka"",2}","{""sw"",2}","{""sq"",2}","{""ar"",1}","{""lv"",1}","{""he"",1}","{""th"",1}","{""hy"",1}","{""sr"",1}","{""gb"",1}"


In [114]:
company_counts = alpha_df["company_name"].value_counts(sort=True)
company_counts = company_counts.drop_nulls()
company_counts

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""Workiva""",1
"""RHI""",1
"""43107 GEA Food Solutions Weert""",1
"""Auto Trader""",1


In [115]:
top20_companies = company_counts.filter(pl.col("count") >= 200)
top20_companies

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""Coders Brain Technology""",242
"""Capgemini""",224
"""IBM Careers""",220
"""FullStack Labs""",215


In [116]:
top20_companies["count"].sum()

9502

In [117]:
top50_companies = company_counts.filter(pl.col("count") >= 20)
top50_companies

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""Lever Demo 2""",20
"""The Osborn Engineering Co""",20
"""Bowman""",20
"""IN10 VMware Software India""",20


In [118]:
top50_companies["count"].sum()

31576

In [119]:
plot_top_companies = top20_companies.hvplot.barh(
    x="company_name",
    y="count",
    color="count",
    rot=90,
    title="Top Companies",
    colorbar=True,
    cmap="plasma",
    clabel="Number of Jobs",
)

In [120]:
hvplot.save(plot_top_companies, "top_companies.png")



city->cat

In [121]:
lat_long = pl.read_csv("/home/anopsy/Portfolio/sourcestack/data/city_coordinates.csv")
lat_long

city,lat,long
str,f64,f64
"""Bilzen""",50.870779,5.5181089
"""Sumidaku""",35.700379,139.805867
"""Kabupaten Bogor""",-6.545325,107.001742
"""Reykjavík""",64.145981,-21.942237
"""Dun Laoghaire""",53.292279,-6.136008
…,…,…
"""Bensenville""",41.953838,-87.943178
"""Osasco""",-23.532486,-46.79168
"""Chehalis""",46.659965,-122.963432
"""Aracajú""",-10.916206,-37.077466


In [122]:
city_count = whole_df.group_by("city").count()

  city_count = whole_df.group_by("city").count()


In [123]:
city_df = city_count.join(lat_long, on="city", how="left")
city_df = city_df.drop_nulls()

In [124]:
city_df.sort(by="count", descending=True).head(10)

city,count,lat,long
str,u32,f64,f64
"""Bengaluru""",1942,12.976794,77.590082
"""Bangalore""",1512,12.988157,77.6226
"""San Francisco""",961,37.779259,-122.419329
"""London""",956,51.489334,-0.144055
"""Singapore""",863,1.357107,103.819499
"""New York""",847,40.712728,-74.006015
"""Hyderabad""",815,17.360589,78.474061
"""Pune""",779,18.521428,73.854454
"""Annapolis Junction""",690,39.118996,-76.796342
"""Austin""",656,30.271129,-97.7437


In [125]:
city_df = city_df.with_columns(log_num=pl.col("count").log(base=2))
city_df

city,count,lat,long,log_num
str,u32,f64,f64,f64
"""Tobyhanna""",2,41.177032,-75.417962,1.0
"""Berrien Springs""",1,41.946434,-86.338897,0.0
"""Cheadle""",3,52.988439,-1.99376,1.584963
"""Santa Cruz""",12,28.467178,-16.250784,3.584963
"""Bradford""",6,53.794423,-1.751919,2.584963
…,…,…,…,…
"""Al Abageyah""",1,30.02287,31.264988,0.0
"""Fenton""",4,38.513199,-90.440058,2.0
"""Cumbria-Barrow in Furness""",1,54.097908,-3.257052,0.0
"""Toronto (Remote)""",1,0.0,0.0,0.0


In [126]:
city_df.hvplot.points(
    x="long",
    y="lat",
    coastline=True,
    tiles=True,
    s="count",
    color="count",
    cmap="plasma_r",
    alpha=0.8,
)

In [127]:
plot_city = city_df.hvplot.points(
    x="long",
    y="lat",
    coastline=True,
    tiles=True,
    s="count",
    color="log_num",
    cmap="plasma_r",
    alpha=0.7,
)

In [128]:
hvplot.save(plot_city, "cities.png")

