## EDA and Cleaning - SourceStack datasets

This notebook focuses on exploration and cleaning of two datasets I obtained by calling SourceStack API\
The first dataset comes from: **June 9, 2023**\
and the more recent one from: **April 2, 2024**

### Initial Exploration
1. shape
2. dtypes
3. missing values


### Cleaning
1. parsing strings containing datetimes to dates
2. converting strings containing a list to list of strings
3. convertsing numerical data from strings to Int/Float
5. identify dirty categories

#### Let's read in the data and have a look at its shape, columns and values

In [1]:
!pip install "polars_ds[plot]"

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [2]:
!pip install --upgrade polars

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [3]:
import sys

In [4]:
print(sys.executable)

/home/anopsy/Portfolio/sourcestack/sstack/bin/python


In [5]:
import polars as pl

In [6]:
old_data_path = "/home/anopsy/Portfolio/sourcestack/data/9june2023.csv"
new_data_path = "/home/anopsy/Portfolio/sourcestack/data/2april2024.csv"

In [7]:
old_df = pl.read_csv(old_data_path, try_parse_dates=False)
new_df = pl.read_csv(new_data_path, try_parse_dates=False)

In [8]:
print(f"Shape of the old data1 is:{old_df.shape}")
print(f"Shape of the new data is:{new_df.shape}")

Shape of the old data1 is:(50000, 16)
Shape of the new data is:(50000, 16)


In [9]:
old_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Backend Developer""","""Praha, Czech Republic""",,,"""IBM""",,"""[Docker, GraphQL, NoSQL, IBM, …","""[Container Orchestration, Quer…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic""","""2023-03-13 05:12:29""","""2023-06-05 13:43:49"""
"""Manufacturing Engineer""",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Serverless]""","""[Manufacturing]""",,"""""","""en""","""Sterling Heights""","""United States""","""2021-10-09 00:00:00""","""2023-05-24 05:35:57"""
"""Design Engineer, Motorized Pro…","""520 S Byrkit St Mishawaka, Ind…","""Full-Time""",,"""ABI Attachments""","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""","""2023-04-28 03:04:28""","""2023-05-19 14:48:10"""
"""Cybersecurity Engineer""",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, Splunk]""","""[Compute, Logging & Monitoring…","""[Cybersecurity, Security]""",,"""""","""en""","""Herndon""","""United States""","""2023-04-03 00:00:00""","""2023-05-28 11:47:09"""
"""Your Career so choose wisely w…","""Kolkata, India""","""Full-Time""",False,"""Adeeba e Services""",,"""[Objective-C, Subversion, Swif…","""[Cloud Native Storage, Program…","""[Software]""",,"""""","""en""","""Kolkata""","""India""","""2017-01-17 11:35:48""","""2023-05-30 11:51:08"""


In [10]:
new_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Dir, Engineering NPD, Critical…","""Dominican Republic-Nave 25-Mer…","""Full-Time""",,"""DR""","""Bachelors""","""[Microsoft]""","""[]""","""[]""",,,"""en""",,"""Dominican Republic""","""2024-03-04 00:00:00""","""2024-03-26 08:03:11"""
"""Software Engineer - Embedded""","""Dresden or Hartmannsdorf, Sach…","""Full-Time""",,"""Manning Global""",,"""[Linux]""","""[OS]""","""[Software, IT]""",,,"""en""","""Dresden or Hartmannsdorf""","""Germany""","""2024-02-15 00:00:00""","""2024-04-01 09:40:27"""
"""Embedded Software Test Enginee…","""Brisbane, CA""","""Full-Time""",,"""Avive""",,"""[Linux, C++]""","""[OS, Programming Languages, OS…","""[Software]""",,"""150000.00""","""en""","""Brisbane""","""Australia""","""2023-10-23 00:00:00""","""2024-04-01 15:25:43"""
"""Manufacturing Engineering Mana…","""Monroe, WI""","""Full-Time""",,"""United Future""",,"""[]""","""[]""","""[Manufacturing]""","""Manager""","""1.00""","""en""","""Monroe""","""United States""","""2024-03-27 20:18:23""","""2024-03-28 20:24:27"""
"""Vom Lager zum Wächter | Direkt…","""Ennepetal, Nordrhein-Westfalen…","""Full-Time""",,"""RUHR VERMITTLUNG""",,"""[WhatsApp, Vercel]""","""[Communications, VoIP, Serverl…","""[Security]""",,,"""de""","""Dortmund""","""Germany""","""2024-03-27 12:31:19""","""2024-03-31 11:16:19"""


### Initial explorations of unprocessed dataframes

#### Shape
Both datasets contain **50000 records** \
each record is represented by **16 features**

#### Dtypes
15 of the features are currently String - datatype\
1 feature is Bool

#### Missing values
The datasets contain **null values** and **empty strings**

In [11]:
pl.Config.set_tbl_width_chars(
    200
)  # setting wide format but it doesn't work that well for jupyter notebook

polars.config.Config

Let's have a look at the sample method, so I can have a look at some more records and remember that I can use .sample in the future.

In [12]:
old_df.sample(3)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Senior Backendutvecklare till …","""Malmö, Sweden""",,,"""IBM""",,"""[Vi, Git, Java, Gradle, IBM, J…","""[Software, SaaS, Data Governan…","""[IT]""","""Senior IC""","""""","""zh-cn""","""Malmö""","""Sweden""","""2023-05-03 07:25:03""","""2023-06-06 09:28:17"""
"""Tenure Track Faculty - Geotech…","""Sacramento - Northern Californ…",,,"""California State University, S…","""Bachelors""","""[]""","""[]""","""[Civil Engineering]""",,"""93000.00""","""en""","""Sacramento""","""United States""","""2022-09-19 09:00:00""","""2023-06-06 16:32:20"""
"""Network Forensics Cybersecurit…","""Arlington, VA""","""Full-Time""",,"""Maania Consultancy Services""",,"""[Windows, Linux]""","""[OS]""","""[Security, Cybersecurity]""","""Senior IC""","""""","""en""","""Arlington""","""United States""","""2023-05-16 00:00:00""","""2023-05-18 13:13:56"""


In [13]:
new_df.sample(3)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Software Engineering Internshi…","""Colombo, Sri Lanka""","""Intern""",,"""BeGOOD Solutions""","""Bachelors""","""[TypeScript, MongoDB, Spring B…","""[SaaS, App Definition and Deve…","""[Software, IT]""","""Intern""",,"""en""","""Colombo""","""Sri Lanka""",,"""2024-03-29 17:02:30"""
"""Manufacturing Engineer – Space…","""Kirkland, Washington, USA""","""Full-Time""",,"""Amazon Kuiper Manufacturing En…",,,,"""[Manufacturing, Hardware]""",,"""""",,"""Kirkland""","""United States""","""2024-02-22 00:00:00""","""2024-04-01 14:54:00"""
"""Project Engineer for Developme…","""Berlin, Deutschland""","""Part-Time""",,"""Voyage SE""",,,,"""[]""",,"""""",,"""Berlin""","""Germany""","""2019-08-01 00:00:00""","""2024-03-27 05:24:52"""


#### Add column that will help us identify if the record comes from 2023 or 2024 and concatenate both dataframes into one

In [14]:
# adding static columns with a string helping identify the df
old_df = old_df.with_columns(pl.lit("June 2023").alias("new"))
new_df = new_df.with_columns(pl.lit("April 2024").alias("new"))

In [15]:
old_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Backend Developer""","""Praha, Czech Republic""",,,"""IBM""",,"""[Docker, GraphQL, NoSQL, IBM, …","""[Container Orchestration, Quer…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic""","""2023-03-13 05:12:29""","""2023-06-05 13:43:49""","""June 2023"""
"""Manufacturing Engineer""",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Serverless]""","""[Manufacturing]""",,"""""","""en""","""Sterling Heights""","""United States""","""2021-10-09 00:00:00""","""2023-05-24 05:35:57""","""June 2023"""
"""Design Engineer, Motorized Pro…","""520 S Byrkit St Mishawaka, Ind…","""Full-Time""",,"""ABI Attachments""","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""","""2023-04-28 03:04:28""","""2023-05-19 14:48:10""","""June 2023"""
"""Cybersecurity Engineer""",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, Splunk]""","""[Compute, Logging & Monitoring…","""[Cybersecurity, Security]""",,"""""","""en""","""Herndon""","""United States""","""2023-04-03 00:00:00""","""2023-05-28 11:47:09""","""June 2023"""
"""Your Career so choose wisely w…","""Kolkata, India""","""Full-Time""",False,"""Adeeba e Services""",,"""[Objective-C, Subversion, Swif…","""[Cloud Native Storage, Program…","""[Software]""",,"""""","""en""","""Kolkata""","""India""","""2017-01-17 11:35:48""","""2023-05-30 11:51:08""","""June 2023"""


In [16]:
# concatenating old and new data
whole_df = old_df.vstack(new_df)

print(whole_df.shape)

(100000, 17)


Removing duplicates

In [17]:
whole_df = whole_df.unique()
# there were 333 duplicates
whole_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Senior Data Science Engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",true,"""RightEye""",,"""[SQL, NoSQL, AWS, Python]""","""[NoSQL, Data Science Tools, Ia…","""[]""","""Senior IC""",,"""en""","""Bethesda""","""United States""","""2023-09-08 18:35:30""","""2024-03-15 07:27:53""","""April 2024"""
"""Domestic Intruder/Fire alarm e…","""Shaftesbury, United Kingdom""",,false,"""Swann Recruitment""",,"""[]""","""[]""","""[Recruiting & Staffing]""",,,"""en""","""Shaftesbury""","""United Kingdom""","""2018-10-03 09:27:33""","""2024-03-20 23:47:42""","""April 2024"""
"""Software Developer - Product S…","""Remote, Spain""","""Unclear""",true,"""Red Hat""",,"""[Vue.js, GitHub, Bootstrap, Py…","""[Scheduling & Orchestration, T…","""[Security, Software]""",,,"""en-us""",,"""Spain""","""2024-03-08 05:00:00""","""2024-03-30 16:07:00""","""April 2024"""
"""SAP Automation Engineer""","""Hyderabad, India""",,false,"""SQUIRCLE IT CONSULTING SERVICE…",,"""[SAP]""","""[IaaS, Travel and Tourism, FP&…","""[IT, ERP, Consulting, Business…",,"""""","""en""","""Hyderabad""","""India""","""2016-10-14 18:29:23""","""2023-05-29 22:34:26""","""June 2023"""
"""Cleared Armed Security Officer…","""Columbia, MD, US""",,,,"""Some High School""","""[]""","""[]""","""[Security]""","""Unclear Seniority""",,"""en""","""Columbia""","""United States""","""2024-02-14 18:20:00""","""2024-03-17 00:11:04""","""April 2024"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Security Tech Lead""",,"""Full-Time""",,"""AArete Technosoft""",,"""[SonarQube, Git, JSON, React, …","""[Programming Languages, Projec…","""[Security]""","""Staff IC""","""""","""en""","""Pune City""","""India""","""2022-04-06 00:00:00""","""2023-06-04 02:20:23""","""June 2023"""
"""Senior Quality Program Manager…","""Oregon - Remote""","""Full-Time""",true,"""100-SFDC""",,"""[Salesforce, Ranger, GitHub]""","""[CRM, Continuous Integration (…","""[Philanthropy, CRM, Education,…","""Manager""","""""","""en""",,"""United States""","""2023-05-17 00:00:00""","""2023-05-19 05:28:35""","""June 2023"""
"""Product Manager - Layered | Je…","""Greensboro, NC""",,,"""Market America""","""Bachelors""","""[Excel, Jira]""","""[No Code, Back Office Tools, P…","""[Jewelry]""","""IC""",,"""en-us""","""Greensboro""","""United States""",,"""2023-06-08 15:38:36""","""June 2023"""
"""Electrical & Instrumentation E…","""Chennai""","""Full-Time""",,"""544 FLSmidth""","""Bachelors""","""[Atlas]""","""[Build Tools, Infra Build Tool…","""[]""",,"""""","""en""","""Chennai""","""India""","""2023-05-04 00:00:00""","""2023-05-24 16:49:29""","""June 2023"""


### Cleaning

#### 1. Converting 'job_published_at', 'last_indexed' to Date

In [18]:
whole_df = whole_df.with_columns(
    pl.col("job_published_at", "last_indexed").str.to_datetime().cast(pl.Date)
)

In [19]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,date,date,str
"""Senior Data Science Engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",True,"""RightEye""",,"""[SQL, NoSQL, AWS, Python]""","""[NoSQL, Data Science Tools, Ia…","""[]""","""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024"""
"""Domestic Intruder/Fire alarm e…","""Shaftesbury, United Kingdom""",,False,"""Swann Recruitment""",,"""[]""","""[]""","""[Recruiting & Staffing]""",,,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024"""
"""Software Developer - Product S…","""Remote, Spain""","""Unclear""",True,"""Red Hat""",,"""[Vue.js, GitHub, Bootstrap, Py…","""[Scheduling & Orchestration, T…","""[Security, Software]""",,,"""en-us""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024"""
"""SAP Automation Engineer""","""Hyderabad, India""",,False,"""SQUIRCLE IT CONSULTING SERVICE…",,"""[SAP]""","""[IaaS, Travel and Tourism, FP&…","""[IT, ERP, Consulting, Business…",,"""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023"""
"""Cleared Armed Security Officer…","""Columbia, MD, US""",,,,"""Some High School""","""[]""","""[]""","""[Security]""","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024"""


#### 2. Converting 'tags_matched', 'tag_categories', 'categories' from str to list[str]

In [20]:
def string_to_nested(df, cols):
    """
    takes a df and list of columns that contain strings with lists
    and turns them into nested datatype List
    """
    for col in cols:
        df = df.with_columns(
            pl.col(col).str.extract_all(r"\w+").cast(pl.List(pl.String))
        )
    return df

In [21]:
cols_to_change = ["tags_matched", "tag_categories", "categories"]
whole_df = string_to_nested(whole_df, cols_to_change)

In [22]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,list[str],list[str],list[str],str,str,str,str,str,date,date,str
"""Senior Data Science Engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",True,"""RightEye""",,"[""SQL"", ""NoSQL"", … ""Python""]","[""NoSQL"", ""Data"", … ""Tools""]",[],"""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024"""
"""Domestic Intruder/Fire alarm e…","""Shaftesbury, United Kingdom""",,False,"""Swann Recruitment""",,[],[],"[""Recruiting"", ""Staffing""]",,,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024"""
"""Software Developer - Product S…","""Remote, Spain""","""Unclear""",True,"""Red Hat""",,"[""Vue"", ""js"", … ""Git""]","[""Scheduling"", ""Orchestration"", … ""Framework""]","[""Security"", ""Software""]",,,"""en-us""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024"""
"""SAP Automation Engineer""","""Hyderabad, India""",,False,"""SQUIRCLE IT CONSULTING SERVICE…",,"[""SAP""]","[""IaaS"", ""Travel"", … ""SaaS""]","[""IT"", ""ERP"", … ""Intelligence""]",,"""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023"""
"""Cleared Armed Security Officer…","""Columbia, MD, US""",,,,"""Some High School""",[],[],"[""Security""]","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024"""


#### 3. Converting 'comp_est' from str to int

In [23]:
whole_df = whole_df.with_columns(
    pl.col("comp_est").cast(pl.Float64, strict=False).alias("compensation")
)
# polars can handle str->float->int
# casting didn't work for Int64 but it did for Float with strict=False, strict=False turned empty strings to nulls
# it works after all I think the problem was I tried to cast t oint32 and because of huge numbers it didn't work
# now it works with Int64

In [24]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,str,bool,str,str,list[str],list[str],list[str],str,str,str,str,str,date,date,str,f64
"""Senior Data Science Engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",True,"""RightEye""",,"[""SQL"", ""NoSQL"", … ""Python""]","[""NoSQL"", ""Data"", … ""Tools""]",[],"""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024""",
"""Domestic Intruder/Fire alarm e…","""Shaftesbury, United Kingdom""",,False,"""Swann Recruitment""",,[],[],"[""Recruiting"", ""Staffing""]",,,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024""",
"""Software Developer - Product S…","""Remote, Spain""","""Unclear""",True,"""Red Hat""",,"[""Vue"", ""js"", … ""Git""]","[""Scheduling"", ""Orchestration"", … ""Framework""]","[""Security"", ""Software""]",,,"""en-us""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024""",
"""SAP Automation Engineer""","""Hyderabad, India""",,False,"""SQUIRCLE IT CONSULTING SERVICE…",,"[""SAP""]","[""IaaS"", ""Travel"", … ""SaaS""]","[""IT"", ""ERP"", … ""Intelligence""]",,"""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023""",
"""Cleared Armed Security Officer…","""Columbia, MD, US""",,,,"""Some High School""",[],[],"[""Security""]","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024""",


In [25]:
whole_df.filter(
    pl.col("compensation") > 0
).shape  # only 14962 records have compensation data available

(14962, 18)

### 4. Language/ education/hours/seniority -> pl.Categorical

extracting seniority from job_name

In [26]:
whole_df = whole_df.with_columns(pl.col("language").str.head(2))

In [27]:
whole_df = whole_df.with_columns(
    pl.col("language").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [28]:
whole_df = whole_df.with_columns(
    pl.col("education").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [29]:
whole_df = whole_df.with_columns(
    pl.col("hours").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

In [30]:
whole_df = whole_df.with_columns(
    pl.col("seniority").fill_null("unknown").cast(pl.Categorical)
)  # introducing "unknow" category

Extracting more Junior/Intern/Senior

In [31]:
whole_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Senior Data Science Engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",true,"""RightEye""","""unknown""","[""SQL"", ""NoSQL"", … ""Python""]","[""NoSQL"", ""Data"", … ""Tools""]",[],"""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024""",
"""Domestic Intruder/Fire alarm e…","""Shaftesbury, United Kingdom""","""unknown""",false,"""Swann Recruitment""","""unknown""",[],[],"[""Recruiting"", ""Staffing""]","""unknown""",,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024""",
"""Software Developer - Product S…","""Remote, Spain""","""Unclear""",true,"""Red Hat""","""unknown""","[""Vue"", ""js"", … ""Git""]","[""Scheduling"", ""Orchestration"", … ""Framework""]","[""Security"", ""Software""]","""unknown""",,"""en""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024""",
"""SAP Automation Engineer""","""Hyderabad, India""","""unknown""",false,"""SQUIRCLE IT CONSULTING SERVICE…","""unknown""","[""SAP""]","[""IaaS"", ""Travel"", … ""SaaS""]","[""IT"", ""ERP"", … ""Intelligence""]","""unknown""","""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023""",
"""Cleared Armed Security Officer…","""Columbia, MD, US""","""unknown""",,,"""Some High School""",[],[],"[""Security""]","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Security Tech Lead""",,"""Full-Time""",,"""AArete Technosoft""","""unknown""","[""SonarQube"", ""Git"", … ""Jenkins""]","[""Programming"", ""Languages"", … ""DevOps""]","[""Security""]","""Staff IC""","""""","""en""","""Pune City""","""India""",2022-04-06,2023-06-04,"""June 2023""",
"""Senior Quality Program Manager…","""Oregon - Remote""","""Full-Time""",true,"""100-SFDC""","""unknown""","[""Salesforce"", ""Ranger"", ""GitHub""]","[""CRM"", ""Continuous"", … ""Customers""]","[""Philanthropy"", ""CRM"", … ""Nonprofits""]","""Manager""","""""","""en""",,"""United States""",2023-05-17,2023-05-19,"""June 2023""",
"""Product Manager - Layered | Je…","""Greensboro, NC""","""unknown""",,"""Market America""","""Bachelors""","[""Excel"", ""Jira""]","[""No"", ""Code"", … ""Treasury""]","[""Jewelry""]","""IC""",,"""en""","""Greensboro""","""United States""",,2023-06-08,"""June 2023""",
"""Electrical & Instrumentation E…","""Chennai""","""Full-Time""",,"""544 FLSmidth""","""Bachelors""","[""Atlas""]","[""Build"", ""Tools"", … ""DevOps""]",[],"""unknown""","""""","""en""","""Chennai""","""India""",2023-05-04,2023-05-24,"""June 2023""",


In [32]:
whole_df.filter(
    (pl.col("job_name").str.contains("(?i)intern")) & (pl.col("seniority") != "Intern")
)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Senior Project Engineer-Cruise…","""USA - Saint Charles, MO""","""Full-Time""",,"""The Boeing Company""","""Bachelors""",[],[],[],"""Senior IC""",,"""en""","""Saint Charles""","""United States""",2023-05-23,2023-06-08,"""June 2023""",
"""Internal Audit Technology- Se…","""Dallas, Texas""","""Intern""",,"""HF Sinclair Corporation""","""Bachelors""","[""SAP"", ""Power"", ""BI""]","[""Travel"", ""Enterprise"", … ""IT""]","[""Oil"", ""Gas""]","""Senior IC""",,"""en""","""Dallas""","""United States""",2024-03-11,2024-03-15,"""April 2024""",
"""Cybersecurity Analyst (Intern)""","""900 Innovators Way, Simi Valle…","""Full-Time""",,"""AeroVironment""","""unknown""","[""Microsoft"", ""AWS""]","[""Compute"", ""IaaS"", ""PaaS""]","[""Cybersecurity"", ""Security""]","""IC""","""27.5""","""en""","""Simi Valley""","""United States""",2024-03-04,2024-04-01,"""April 2024""",27.5
"""Program Manager, International…","""New York, New York, 10006""","""Temp""",,"""Temporary""","""unknown""",[],[],"[""Education""]","""Manager""","""67500.0""","""en""","""New York""","""United States""",2023-04-10,2023-05-31,"""June 2023""",67500.0
"""Aviation Security Technician -…","""DEN CONA East Lvl 04""","""Full-Time""",,"""City and County of Denver""","""High School""","[""Microsoft""]",[],"[""Security"", ""Airlines"", ""Aerospace""]","""IC""","""27.700000000000003""","""en""",,"""United States""",2023-05-16,2023-05-18,"""June 2023""",27.7
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Data Analyst (Digital Analytic…","""London, UK""","""Full-Time""",false,"""Utility Warehouse""","""unknown""","[""Google"", ""BigQuery"", … ""Python""]","[""Analytics"", ""Software"", … ""Warehouses""]","[""Support"", ""Machine"", … ""Education""]","""IC""","""""","""en""","""London""","""United Kingdom""",2023-05-24,2023-05-29,"""June 2023""",
"""Product Manager Intern (Techni…",,"""Intern""",,"""ETC""","""High School""",[],[],"[""Movies"", ""Film"", … ""Planning""]","""IC""",,"""en""",,,,2024-03-29,"""April 2024""",
"""Psychiatrist-ACT Adult SPRG 47…",,"""Part-Time""",,"""MaineHealth Physician Recruitm…","""unknown""",[],[],"[""Mental"", ""Healthcare"", … ""Hospital""]","""Staff IC""",,"""en""",,,,2024-03-25,"""April 2024""",
"""Expression of Interest: Softwa…","""Remote""","""Full-Time""",true,"""Fingerprint For Success""","""unknown""",[],[],"[""Software""]","""Junior IC""",,"""en""","""San Antonio""","""United States""",2024-03-27,2024-04-01,"""April 2024""",


In [33]:
senior_job = whole_df.filter(
    (pl.col("job_name").str.contains("(?i)senior"))
    & (pl.col("seniority") != "Senior IC")
)

senior_job

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Senior Principal Engineer RF S…","""United States-California-Point…","""Full-Time""",,"""CORP-Corporate Office""","""Bachelors""","[""Linux"", ""Python"", … ""XML""]","[""OSS"", ""OS"", … ""Languages""]","[""Business"", ""Software""]","""Staff IC""","""109000.00""","""en""",,"""United States""",2024-03-22,2024-03-28,"""April 2024""",109000.0
"""(Senior) Backend Engineer Orde…","""MediaMarktSaturn Technology""","""unknown""",,"""MediaMarktSaturn""","""unknown""","[""Kafka"", ""NoSQL"", … ""Persona""]","[""Big"", ""Data"", … ""Languages""]",[],"""unknown""","""""","""en""","""Ingolstadt""","""Germany""",2023-03-03,2023-06-03,"""June 2023""",
"""(Senior) Software Architect (a…","""- Stuttgart, Baden-Württemberg…","""Full-Time""",true,"""Almato AG""","""unknown""","[""Go"", ""Java"", … ""Python""]","[""Programming"", ""Languages"", … ""IaaS""]","[""Architecture"", ""Planning"", ""Software""]","""unknown""",,"""de""","""Stuttgart""","""Germany""",2024-03-13,2024-04-01,"""April 2024""",
"""(Senior) Product Manager""","""Hong Kong""","""unknown""",,"""Novartis""","""unknown""",[],[],"[""Healthcare""]","""IC""","""""","""ja""","""Hong Kong""","""Hong Kong""",,2023-05-23,"""June 2023""",
"""Senior Data Engineer""","""Remote""","""Full-Time""",true,"""Sambasafety""","""unknown""","[""Snowflake"", ""Fivetran"", … ""AWS""]","[""Business"", ""Intelligence"", … ""Computing""]","[""Writing"", ""Editing"", … ""Software""]","""Manager""","""125000.00""","""en""","""Denver""","""United States""",2023-05-08,2023-06-09,"""June 2023""",125000.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""ICT Application Engineer Senio…","""Lenggstrasse 31 Zürich, ZH, 80…","""Full-Time""",,"""Psychiatrische Universitätskli…","""Bachelors""",[],[],"[""Mental"", ""Healthcare""]","""unknown""","""""","""de""","""Zürich""","""Switzerland""",2023-03-23,2023-06-03,"""June 2023""",
"""Senior Engineering Manager - D…","""Germany. Our headquarters are …","""Full-Time""",true,"""ResearchGate""","""unknown""","[""SQL"", ""Python""]","[""OSS"", ""Stat"", … ""Languages""]","[""Machine"", ""Learning""]","""Manager""","""""","""en""","""Germany. Our headquarters are …","""Germany""",2023-04-27,2023-06-06,"""June 2023""",
"""Senior Technical Program Manag…","""New York, NY · Information Tec…","""unknown""",,"""Talent Hunt Group""","""Bachelors""","[""Excel""]","[""Back"", ""Office"", … ""Code""]",[],"""Manager""",,"""en""","""New York""","""United States""",2023-09-28,2024-03-26,"""April 2024""",
"""Senior Customer Project/ Progr…","""Las Vegas, NV - 1700 Vegas Dr""","""Full-Time""",,"""CCI CCI Corporate Services""","""Bachelors""","[""Windows"", ""Excel""]","[""Enterprise"", ""Customers"", … ""Treasury""]","[""Business""]","""Manager""",,"""en""","""Las Vegas""","""United States""",2024-03-26,2024-03-26,"""April 2024""",


In [34]:
whole_df.select(pl.col("seniority").value_counts())

seniority
struct[2]
"{""Senior IC"",17840}"
"{""unknown"",53578}"
"{""Unclear Seniority"",4593}"
"{""Exec"",465}"
"{""Contract"",1106}"
…
"{""Director"",874}"
"{""Senior Manager"",273}"
"{""Chief"",557}"
"{""Founder"",3}"


##### set union on list[str] categories create set of tags

In [35]:
!pip install hvplot

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [36]:
import hvplot.polars

In [37]:
# whole_df.group_by('seniority').agg(pl.col('country').top_k_by('compensation', k=2))

## Analysis

## 1. How many Juniors/Interns per entire data set

In [38]:
whole_df.group_by("new", "seniority").agg(pl.col("seniority").count().alias("count"))

new,seniority,count
str,cat,u32
"""June 2023""","""Staff IC""",1793
"""April 2024""","""IC""",3246
"""June 2023""","""Founder""",2
"""June 2023""","""Contract""",624
"""June 2023""","""Intern""",752
…,…,…
"""April 2024""","""Intern""",1230
"""June 2023""","""Senior Exec""",2
"""April 2024""","""Senior IC""",8608
"""June 2023""","""unknown""",26781


In [39]:
seniority_groups = whole_df.group_by("seniority", "new").agg(
    pl.col("seniority").count().alias("count")
)
seniority_groups = seniority_groups.select(pl.all().sort_by("count"))

In [40]:
seniority_groups = seniority_groups.with_columns(
    (pl.col("count") / 500).alias("percent of jobs")
)
# 500 = 50_000 / 100

In [41]:
seniority_groups

seniority,new,count,percent of jobs
cat,str,u32,f64
"""Founder""","""April 2024""",1,0.002
"""Founder""","""June 2023""",2,0.004
"""Senior Exec""","""June 2023""",2,0.004
"""Senior Exec""","""April 2024""",6,0.012
"""Senior Manager""","""April 2024""",120,0.24
…,…,…,…
"""Manager""","""June 2023""",3413,6.826
"""Senior IC""","""April 2024""",8608,17.216
"""Senior IC""","""June 2023""",9232,18.464
"""unknown""","""June 2023""",26781,53.562


In [42]:
seniority_groups.hvplot.barh(
    x="seniority",
    y="count",
    color="new",
    rot=90,
    title="Number of Job Offers per Seniority",
    alpha=0.3,
    colorbar=True,
    clabel="count",
    cmap="prism",
)
# , color="new", subplots=True

In [43]:
entry_level = seniority_groups.filter(
    (pl.col("seniority") == "Junior IC") | (pl.col("seniority") == "Intern")
)
entry_level

seniority,new,count,percent of jobs
cat,str,u32,f64
"""Intern""","""June 2023""",752,1.504
"""Junior IC""","""June 2023""",837,1.674
"""Junior IC""","""April 2024""",1059,2.118
"""Intern""","""April 2024""",1230,2.46


In [44]:
entry_level.hvplot.bar(
    x="seniority",
    y="count",
    color="new",
    rot=90,
    title="Number of Job Offers per Seniority",
    alpha=0.3,
    colorbar=True,
    clabel="count",
    cmap="gnuplot",
)

June 2023
Junior job offers were 1.674% of the total 50000
Internship offers were 1.504% of the total 50000

in April 2024
Junior job offers were 2.118% of the total 50000
Internship offers were 2.46% of the total 50000

Entry-level jobs in June 2023 were 3.178%
Entry-level jobs in APril 2024 were 4.578%
The number of entry-level jobs has risen by 44%

In [45]:
known_seniority = (
    seniority_groups.filter(
        (pl.col("seniority") != "unknown")
        & (pl.col("seniority") != "Unclear Seniority")
    )
    .group_by("new")
    .sum()
)

In [46]:
known_seniority

new,seniority,count,percent of jobs
str,cat,u32,f64
"""April 2024""",,20773,41.546
"""June 2023""",,21055,42.11


In [47]:
perc_of_known_seniority = entry_level.join(known_seniority, on="new", how="left")

In [48]:
perc_of_known_seniority = perc_of_known_seniority.with_columns(
    (pl.col("count") / pl.col("count_right") * 100).alias("percent of seniority")
)
perc_of_known_seniority

seniority,new,count,percent of jobs,seniority_right,count_right,percent of jobs_right,percent of seniority
cat,str,u32,f64,cat,u32,f64,f64
"""Intern""","""June 2023""",752,1.504,,21055,42.11,3.571598
"""Junior IC""","""June 2023""",837,1.674,,21055,42.11,3.975303
"""Junior IC""","""April 2024""",1059,2.118,,20773,41.546,5.097964
"""Intern""","""April 2024""",1230,2.46,,20773,41.546,5.921148


In [49]:
mean_comp_seniority = whole_df.group_by("new", "seniority").agg(
    pl.col("compensation").mean().alias("mean_comp_seniority")
)

In [128]:
mean_comp_seniority

new,seniority,mean_comp_seniority
str,cat,f64
"""June 2023""","""Chief""",4.0817e9
"""April 2024""","""Intern""",40958.337174
"""June 2023""","""Exec""",3.4617e9
"""April 2024""","""Manager""",5.1606e8
"""June 2023""","""Senior Exec""",
…,…,…
"""April 2024""","""Senior IC""",1.2428e9
"""April 2024""","""unknown""",4.5190e8
"""April 2024""","""Contract""",131988.499375
"""June 2023""","""Contract""",8.0126e6


In [133]:
mean_comp_seniority = mean_comp_seniority.drop_nulls()
mean_comp_seniority = mean_comp_seniority.select(
    pl.all().sort_by("mean_comp_seniority", descending=True)
)

In [134]:
mean_comp_seniority

new,seniority,mean_comp_seniority
str,cat,f64
"""June 2023""","""Chief""",4.0817e9
"""June 2023""","""Exec""",3.4617e9
"""June 2023""","""Director""",1.8520e9
"""April 2024""","""Staff IC""",1.7029e9
"""June 2023""","""unknown""",1.3595e9
…,…,…
"""April 2024""","""Exec""",154049.461562
"""April 2024""","""Senior Manager""",143160.8375
"""June 2023""","""Senior Manager""",140226.263158
"""April 2024""","""Contract""",131988.499375


In [135]:
mean_comp_seniority.hvplot.barh(
    x="seniority",
    y="mean_comp_seniority",
    color="new",
    rot=90,
    title="Mean compensation per Seniority",
    alpha=0.3,
    colorbar=True,
    clabel="count",
    cmap="prism",
)

In [51]:
junior_comp = whole_df.filter(
    (pl.col("seniority") == "Junior IC") & (pl.col("compensation") > 0)
)
junior_comp.group_by("new").agg(pl.col("compensation").median())

new,compensation
str,f64
"""June 2023""",60000.0
"""April 2024""",240000.0


In [52]:
junior_comp

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Technical Product Manager-cctv""","""Bengaluru""","""Full-Time""",,"""Nexilis Electronics India""","""unknown""",[],[],"[""India"", ""Related""]","""Junior IC""","""300000.00""","""en""","""Bengaluru""","""India""",2023-11-18,2024-03-30,"""April 2024""",300000.0
"""React""","""Mohali""","""Full-Time""",,"""RichestSoft""","""unknown""","[""HTML5"", ""Socket"", … ""SQL""]","[""Full"", ""Stack"", … ""Hosting""]","[""Design"", ""Mobile"", … ""Design""]","""Junior IC""","""285000.00""","""en""","""Mohali""","""India""",2022-09-26,2024-03-29,"""April 2024""",285000.0
"""Graduate / Junior Electrical D…","""Walton Road""","""Student""",,"""Premier Group Recruitment""","""unknown""",,,"[""Recruiting"", ""Staffing"", ""Design""]","""Junior IC""","""35000.00""","""en""",,,2023-09-06,2024-04-01,"""April 2024""",35000.0
"""Application Engineer""","""Chennai""","""Full-Time""",,"""DECELER""","""unknown""",[],[],[],"""Junior IC""","""150000.00""","""en""","""Chennai""","""India""",2023-11-27,2024-03-22,"""April 2024""",150000.0
"""Android Developer""","""ahmedabad""","""Full-Time""",,"""Career Fair Services & Technol…","""unknown""","[""Java"", ""Flutter"", … ""Android""]","[""OSS"", ""Languages"", … ""Tools""]","[""Android"", ""Software""]","""Junior IC""","""625000.00""","""en""","""Ahmedabad""","""India""",2022-08-07,2024-03-27,"""April 2024""",625000.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""ERP Operation Support Engineer…","""New York, NY""","""Full-Time""",,"""Cinter Career Services""","""unknown""","[""Microsoft""]",[],"[""England"", ""Related"", ""ERP""]","""Junior IC""","""60000.0""","""en""","""New York""","""United States""",2023-04-26,2023-06-09,"""June 2023""",60000.0
"""Commissioning Engineer - PLC J…",,"""Commission""",,"""KNAPP""","""unknown""","[""Siemens""]","[""Industrial"", ""Applications"", ""Industry""]",[],"""Junior IC""","""1000.00""","""en""",,,2023-09-20,2024-03-16,"""April 2024""",1000.0
"""Entry Level Software Developer""","""Manchester Rd, Ballwin, MO""","""Full-Time""",false,"""LaunchCode""","""unknown""","[""Angular"", ""js"", … ""Java""]","[""Cloud"", ""Native"", … ""Mobile""]","[""Software""]","""Junior IC""","""40000.00""","""en""","""Ballwin""","""United States""",2022-08-25,2023-05-31,"""June 2023""",40000.0
"""Junior Windows Infrastructure …","""Herndon, Virginia, United Stat…","""Full-Time""",,"""Peraton""","""unknown""","[""Citrix"", ""Active"", … ""Centrify""]","[""Password"", ""Managers"", … ""IT""]",[],"""Junior IC""","""86000.00""","""en""","""Herndon""","""United States""",2023-04-05,2023-06-03,"""June 2023""",86000.0


In [53]:
intern_comp = whole_df.filter(
    (pl.col("seniority") == "Intern") & (pl.col("compensation") > 0)
)
intern_comp.group_by("new").agg(pl.col("compensation").median())

new,compensation
str,f64
"""April 2024""",32500.0
"""June 2023""",39.0


In [54]:
intern_comp

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""Cybersecurity Engineering Inte…","""KUS51558 Austin (KUS51558) Con…","""Full-Time""",,"""Kyndryl""","""unknown""","[""AWS""]","[""IaaS"", ""Compute"", ""PaaS""]","[""Security"", ""Cybersecurity""]","""Intern""","""60000.00""","""en""","""Austin""","""United States""",2024-03-25,2024-03-28,"""April 2024""",60000.0
"""TN Extension Internship Progra…","""US - Tennessee - knoxville""","""Full-Time""",,"""Ext""","""Bachelors""","[""Accessibility""]",[],"[""Higher"", ""Education"", … ""Education""]","""Intern""","""6840.00""","""en""",,"""United States""",2024-02-05,2024-03-24,"""April 2024""",6840.0
"""Android Developer""","""Delhi (NCR)""","""Contract""",false,"""FullThrottle Labs testcdsc""","""unknown""","[""Android"", ""Android"", … ""JSON""]","[""Mobile"", ""Languages"", … ""OSS""]","[""Android"", ""Software""]","""Intern""","""100000.0""","""en""","""Delhi""","""India""",2023-02-06,2024-03-27,"""April 2024""",100000.0
"""Backend Engineering Intern - F…","""San Mateo, California""","""Intern""",,"""Verkada""","""Bachelors""","[""Docker"", ""Terraform"", … ""Go""]","[""Container"", ""Management"", … ""Messaging""]","[""Software""]","""Intern""","""55.00""","""en""","""San Mateo""","""United States""",2023-05-23,2023-05-23,"""June 2023""",55.0
"""Trainee Water Hygiene Engineer""",,"""Trainee""",false,,"""unknown""",[],[],[],"""Intern""","""19422.00""","""en""","""Milton Keynes""","""United Kingdom""",2023-05-18,2023-05-26,"""June 2023""",19422.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Software Engineering Intern""","""USA-MI-Ann Arbor-KLA""","""Full-Time""",,"""KLA Corporation""","""Doctorate""",[],[],"[""Software""]","""Intern""","""77000.00""","""en""","""Ann Arbor""","""United States""",2023-12-13,2024-04-01,"""April 2024""",77000.0
"""Engineering Intern""","""Beaverton, OR 97006 US (Primar…","""Full-Time""",,"""Corbin believes that professio…","""Bachelors""",[],[],[],"""Intern""","""3000.00""","""en""","""Beaverton""","""United States""",,2024-03-24,"""April 2024""",3000.0
"""Intern Software Engineer | Nov…","""Christchurch, New Zealand""","""Full-Time""",,"""Partly""","""unknown""",[],[],"[""Software""]","""Intern""","""1.00""","""en""","""Christchurch""","""New Zealand""",2023-05-07,2023-05-30,"""June 2023""",1.0
"""Summer 2024 Intern - Engineeri…","""St James MN Plant Corporate""","""Full-Time""",,"""Smithfield Support Services Co…","""unknown""","[""Microsoft""]",[],"[""Support""]","""Intern""","""40000.00""","""en""",,"""United States""",2024-02-05,2024-03-25,"""April 2024""",40000.0


In [55]:
from datetime import datetime

In [56]:
clean_timeline = whole_df.filter(
    pl.col("job_published_at").is_between(datetime(2020, 12, 31), datetime(2024, 4, 2)),
)

In [57]:
timeline = clean_timeline.group_by("job_published_at", "new").agg(
    pl.col("job_published_at").count().alias("job_count")
)
timeline

job_published_at,new,job_count
date,str,u32
2023-05-11,"""April 2024""",16
2023-03-02,"""June 2023""",155
2022-06-17,"""June 2023""",14
2021-06-08,"""June 2023""",30
2021-01-18,"""June 2023""",2
…,…,…
2021-12-12,"""April 2024""",1
2023-05-26,"""April 2024""",21
2022-12-13,"""April 2024""",9
2023-11-11,"""April 2024""",6


In [58]:
pivot_timeline = timeline.pivot(
    index="job_published_at", columns="new", values="job_count"
)

In [59]:
%pip install selenium

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [60]:
%pip install phantomjs

[33mDEPRECATION: geopolars 0.1.0a4 has a non-standard dependency specifier pyarrow>=4.0.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of geopolars or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [62]:
plot_tl = pivot_timeline.hvplot.line(
    x="job_published_at",
    y=["June 2023", "April 2024"],
    title="Number of New Job Offers Posted per Day",
)

In [63]:
hvplot.save(plot_tl, "timeline.png")



In [64]:
timeline.hvplot.line(x="job_published_at", y="job_count", color="new")

#### 4. Identify dirty categories


In [65]:
whole_df = whole_df.with_columns(pl.col("job_name").str.to_lowercase())

In [66]:
whole_df.select(pl.col("job_name").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_65696,column_65697,column_65698,column_65699,column_65700,column_65701,column_65702,column_65703,column_65704,column_65705,column_65706,column_65707,column_65708,column_65709,column_65710,column_65711,column_65712,column_65713,column_65714,column_65715,column_65716,column_65717,column_65718,column_65719,column_65720,column_65721,column_65722,column_65723,column_65724,column_65725,column_65726,column_65727,column_65728,column_65729,column_65730,column_65731,column_65732
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""software engineer"",814}","{""senior software engineer"",615}","{""product manager"",462}","{""devops engineer"",398}","{""data engineer"",378}","{""project engineer"",373}","{""security officer"",361}","{""electrical engineer"",354}","{""program manager"",342}","{""data analyst"",327}","{""mechanical engineer"",277}","{""full stack developer"",276}","{""software developer"",274}","{""data scientist"",247}","{""systems engineer"",236}","{""network engineer"",235}","{""quality engineer"",235}","{""security guard"",232}","{""retail front end supervisor"",202}","{""process engineer"",196}","{""manufacturing engineer"",194}","{""senior data engineer"",194}","{""engineering manager"",192}","{""application developer: cloud fullstack"",184}","{""senior devops engineer"",182}","{""sales engineer"",164}","{""senior product manager"",161}","{""senior software developer"",158}","{""technical writer"",155}","{""field service engineer"",153}","{""site reliability engineer"",147}","{""product owner"",146}","{""backend developer"",144}","{""android developer"",144}","{""ios developer"",140}","{""civil engineer"",131}","{""engineer"",128}",…,"{""sales engineer - wuxi (38783)"",1}","{""solution engineer (技術営業) ー kong japan"",1}","{""senior civil engineer - roads (273303)"",1}","{""lead software engineer - cloud infrastructure"",1}","{""loads & dynamics analyst mechanical engineer - mid career"",1}","{""senior software engineer mlops - london or remote uk"",1}","{""systems engineer - seta"",1}","{""senior netsuite engineer, based in da nang"",1}","{""data analyst - level 2"",1}","{""solutions architect, rbs section (18783)"",1}","{""required ios developer"",1}","{""assistant resident engineer land use"",1}","{""structural engineer - tsi"",1}","{""nap - maintenance engineer"",1}","{""itar network engineer"",1}","{""vlocity lead engineer"",1}","{""fullstack tech lead c#.net and angular"",1}","{""led optical engineer intern- summer 2024 (63276)"",1}","{""senior data engineer (relocate to shanghai, beijing or singapore)"",1}","{""senior customer project/ program manager"",1}","{""systems engineer, all levels (future)"",1}","{""product development engineer undergraduate intern - manufacturing and product engineering"",1}","{""aerospace test engineer"",1}","{""export control and related border security (exbs) program manager"",1}","{""desktop support engineer - uk/london, white city"",1}","{""civil / la engineering technician i"",1}","{""owner software product"",1}","{""software dev engineer, team_pasquale_demaio"",1}","{""cluster security manager, hyd, dc security apjc"",1}","{""commercial engineering program management (epm) - project management office (pmo)"",1}","{""senior .net software engineer - biotech instrumentation (jp10677ssf)"",1}","{""officer, physical security"",1}","{""support engineer (ms dynamics)"",1}","{""security tech lead"",1}","{""senior quality program manager, trailhead"",1}","{""product manager - layered | jewelry"",1}","{""engineer, backend - mox"",1}"


In [67]:
job_names = (
    whole_df.group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [68]:
job_pop = job_names.filter(pl.col("count") > 50)

In [69]:
job_pop.transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_64,column_65,column_66,column_67,column_68,column_69,column_70,column_71,column_72,column_73,column_74,column_75,column_76,column_77,column_78,column_79,column_80,column_81,column_82,column_83,column_84,column_85,column_86,column_87,column_88,column_89,column_90,column_91,column_92,column_93,column_94,column_95,column_96,column_97,column_98,column_99,column_100
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,…,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""software engineer""","""senior software engineer""","""product manager""","""devops engineer""","""data engineer""","""project engineer""","""security officer""","""electrical engineer""","""program manager""","""data analyst""","""mechanical engineer""","""full stack developer""","""software developer""","""data scientist""","""systems engineer""","""quality engineer""","""network engineer""","""security guard""","""retail front end supervisor""","""process engineer""","""manufacturing engineer""","""senior data engineer""","""engineering manager""","""application developer: cloud f…","""senior devops engineer""","""sales engineer""","""senior product manager""","""senior software developer""","""technical writer""","""field service engineer""","""site reliability engineer""","""product owner""","""backend developer""","""android developer""","""ios developer""","""civil engineer""","""qa engineer""",…,"""package consultant: sap cloud …","""engineering technician""","""controls engineer""","""quality assurance engineer""","""software engineer ii""","""senior engineer""","""machine learning engineer""","""chief engineer""","""application engineer""","""cloud engineer""","""senior full stack developer""","""staff software engineer""","""system engineer""","""embedded software engineer""","""software architect""","""software de recrutamento e sel…","""senior program manager""","""product engineer""","""senior structural engineer""","""security engineer""","""qa automation engineer""","""industrial engineer""","""service engineer""","""engineer ii""","""application developer: azure c…","""senior full stack engineer""","""production engineer""","""engineering intern""","""unarmed security officer""","""senior backend developer""","""software development engineer""","""big data engineer""","""lead engineer""","""solutions engineer""","""software engineer iii""","""technical product manager""","""electrical design engineer"""
"""814""","""615""","""462""","""398""","""378""","""373""","""361""","""354""","""342""","""327""","""277""","""276""","""274""","""247""","""236""","""235""","""235""","""232""","""202""","""196""","""194""","""194""","""192""","""184""","""182""","""164""","""161""","""158""","""155""","""153""","""147""","""146""","""144""","""144""","""140""","""131""","""128""",…,"""76""","""76""","""76""","""76""","""75""","""74""","""73""","""70""","""68""","""68""","""67""","""67""","""66""","""65""","""64""","""64""","""62""","""62""","""62""","""61""","""60""","""59""","""58""","""58""","""57""","""56""","""56""","""56""","""55""","""55""","""53""","""52""","""52""","""52""","""51""","""51""","""51"""


In [70]:
choices = job_pop.select(pl.col("job_name"))
choices.dtypes

[String]

In [71]:
choices

job_name
str
"""software engineer"""
"""senior software engineer"""
"""product manager"""
"""devops engineer"""
"""data engineer"""
…
"""lead engineer"""
"""solutions engineer"""
"""software engineer iii"""
"""technical product manager"""


In [72]:
whole_df.select(pl.col("company_name").value_counts(sort=True))

company_name
struct[2]
"{null,6266}"
"{""IBM"",2683}"
"{""Allied Universal"",1057}"
"{""CLBPTS"",668}"
"{""Bosch Group"",533}"
…
"{""STARK Deutschland"",1}"
"{""Smart-One Solutions"",1}"
"{""ADSIPL - Telangana - F02"",1}"
"{""Ollang"",1}"


In [73]:
(
    whole_df.group_by("company_name")
    .agg(pl.col("company_name").count().alias("count"))
    .filter(pl.col("count") > 1)
    .sort("count", descending=True)
)

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""TSYS Card Tech Services India …",2
"""Alesig Consulting""",2
"""WithersRavenel""",2
"""Executive Director""",2


Let's create a list of most common job_names, and then let's fuzzy match them with the rest

In [74]:
job_names = (
    whole_df.group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [75]:
job_pop = job_names.filter(pl.col("count") > 10)

In [76]:
job_pop.transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_465,column_466,column_467,column_468,column_469,column_470,column_471,column_472,column_473,column_474,column_475,column_476,column_477,column_478,column_479,column_480,column_481,column_482,column_483,column_484,column_485,column_486,column_487,column_488,column_489,column_490,column_491,column_492,column_493,column_494,column_495,column_496,column_497,column_498,column_499,column_500,column_501
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,…,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""software engineer""","""senior software engineer""","""product manager""","""devops engineer""","""data engineer""","""project engineer""","""security officer""","""electrical engineer""","""program manager""","""data analyst""","""mechanical engineer""","""full stack developer""","""software developer""","""data scientist""","""systems engineer""","""quality engineer""","""network engineer""","""security guard""","""retail front end supervisor""","""process engineer""","""senior data engineer""","""manufacturing engineer""","""engineering manager""","""application developer: cloud f…","""senior devops engineer""","""sales engineer""","""senior product manager""","""senior software developer""","""technical writer""","""field service engineer""","""site reliability engineer""","""product owner""","""android developer""","""backend developer""","""ios developer""","""civil engineer""","""qa engineer""",…,"""project engineering manager""","""technical engineer""","""cloud data engineer""","""cloud solutions architect""","""sr. data scientist""","""frontend software engineer""","""principal mechanical engineer""","""electrical/controls/automation…","""senior security analyst""","""manufacturing engineering mana…","""fire engineer""","""sr. software developer""","""senior cybersecurity engineer""","""software quality assurance eng…","""lead qa engineer""","""functional safety engineer""","""robotics engineer""","""site reliability engineer (sre…","""engineering internship""","""data analyst (remote)""","""principal product manager""","""senior software engineer (back…","""transportation project enginee…","""site reliability engineer iii""","""cloud infrastructure engineer""","""security guard - full time""","""design engineer ii""","""senior application security en…","""data scientist ii""","""marketing data analyst""","""senior software engineer - jav…","""package consultant: oracle clo…","""devops engineer - remote, full…","""lead software developer""","""field sales engineer""","""engineering specialist""","""application developer: ibm clo…"
"""814""","""615""","""462""","""398""","""378""","""373""","""361""","""354""","""342""","""327""","""277""","""276""","""274""","""247""","""236""","""235""","""235""","""232""","""202""","""196""","""194""","""194""","""192""","""184""","""182""","""164""","""161""","""158""","""155""","""153""","""147""","""146""","""144""","""144""","""140""","""131""","""128""",…,"""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11""","""11"""


In [77]:
data_job = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)data"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [78]:
junior_job = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)junior"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [79]:
whole_df.filter(
    (pl.col("job_name").str.contains("(?i)junior"))
    & (pl.col("seniority") != "Junior IC")
)
# there are 146 more Junior Job positions that are not specified as such in seniority but contain "Junior" in the job_name

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""junior product manager""","""Sydney, NSW, AU, 2007""","""unknown""",,"""perfettivaT1""","""unknown""",[],[],"[""Manufacturing""]","""IC""",,"""en""","""Sydney""","""Australia""",2024-03-07,2024-03-22,"""April 2024""",
"""cyber security project manager…","""United Kingdom""","""Full-Time""",,"""Funding Circle UK""","""unknown""",[],[],"[""Security""]","""IC""","""""","""en""",,"""United Kingdom""",2023-05-15,2023-06-06,"""June 2023""",
"""junior accountmanager duitslan…","""Utrecht""","""Full-Time""",,"""Recruitment Masters""","""unknown""","[""Go""]","[""OSS"", ""Programming"", ""Languages""]","[""Software"", ""Recruiting"", ""Staffing""]","""Manager""",,"""en""","""Utrecht""","""Netherlands""",2024-02-23,2024-03-23,"""April 2024""",
"""junior software engineer""","""Baltimore, MD""","""Full-Time""",,"""Latitude""","""unknown""","[""Git"", ""PHP"", … ""JavaScript""]","[""Stat"", ""Tools"", … ""Tools""]","[""Software""]","""Manager""",,"""en""","""Baltimore""","""United States""",,2023-06-08,"""June 2023""",
"""(junior/senior) security analy…","""Deutsche Telekom Cyber Securit…","""Part-Time""",,"""Deutsche Telekom""","""unknown""",[],[],"[""Security""]","""IC""","""43078.00""","""en""","""Vienna""","""Austria""",2023-12-05,2024-03-23,"""April 2024""",43078.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""junior software engineer analy…","""Pittsburgh, PA""","""Full-Time""",,"""Carnegie Mellon University""","""unknown""",[],[],"[""Education"", ""Higher"", … ""Software""]","""IC""",,"""en""","""Pittsburgh""","""United States""",2024-01-25,2024-04-01,"""April 2024""",
"""(junior) projektmanager crm so…","""Gräfelfing""","""Full-Time""",,"""FUTRUE""","""unknown""","[""Excel""]","[""Treasury"", ""Spreadsheets"", … ""Customers""]","[""Software"", ""CRM""]","""Manager""",,"""en""",,,2024-03-05,2024-03-29,"""April 2024""",
"""junior security analyst (shift…","""Remote""","""Full-Time""",true,"""Fusion Technology""","""unknown""","[""AWS"", ""Intel"", … ""Tanium""]","[""Data"", ""Science"", … ""Analytics""]","[""Security""]","""Manager""",,"""en""","""Herndon""","""United States""",2024-02-23,2024-04-01,"""April 2024""",
"""junior program manager (m/w/d)…","""Weßling, 82331 Germany""","""Full-Time""",,"""Bertrandt AG""","""unknown""",[],[],[],"""Manager""","""65000.0""","""en""","""Weßling""","""Germany""",2024-03-06,2024-03-20,"""April 2024""",65000.0


In [80]:
junior_job

job_name,count
str,u32
"""junior software engineer""",34
"""junior data scientist - dubai,…",21
"""junior data analyst""",20
"""junior software developer""",14
"""junior electrical engineer""",8
…,…
"""junior information security en…",1
"""cloud engineer junior (ts/sci)…",1
"""junior engineering officer""",1
"""junior program manager""",1


In [81]:
intern = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)intern"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [82]:
intern

job_name,count
str,u32
"""engineering intern""",56
"""mechanical engineering intern""",22
"""software engineer intern""",19
"""civil engineering intern""",19
"""software engineering intern""",18
…,…
"""electronics engineering intern""",1
"""civil engineering internship -…",1
"""manager, internal audit inform…",1
"""summer 2024 engineering intern…",1


In [83]:
internship = (
    whole_df.filter(pl.col("job_name").str.contains("(?i)internship"))
    .group_by("job_name")
    .agg(pl.col("job_name").count().alias("count"))
    .sort("count", descending=True)
)

In [84]:
internship

job_name,count
str,u32
"""engineering internship""",11
"""internship for android develop…",5
"""electrical engineering interns…",5
"""mechanical engineering interns…",5
"""internship for ios from an it …",4
…,…
"""psychiatrist-act adult sprg 47…",1
"""internship 2023, r&d engineeri…",1
"""lab innovation engineer (inter…",1
"""hurry up for internship progra…",1


In [85]:
whole_df.select(pl.col("company_name").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_28745,column_28746,column_28747,column_28748,column_28749,column_28750,column_28751,column_28752,column_28753,column_28754,column_28755,column_28756,column_28757,column_28758,column_28759,column_28760,column_28761,column_28762,column_28763,column_28764,column_28765,column_28766,column_28767,column_28768,column_28769,column_28770,column_28771,column_28772,column_28773,column_28774,column_28775,column_28776,column_28777,column_28778,column_28779,column_28780,column_28781
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{null,6266}","{""IBM"",2683}","{""Allied Universal"",1057}","{""CLBPTS"",668}","{""Bosch Group"",533}","{""Schneider Electric"",397}","{""260312-SOUTH FLORIDA REGION ADMIN"",380}","{""Novartis"",367}","{""Volvo Group"",342}","{""Lockheed Martin"",339}","{""Endeavor IT Solution"",305}","{""Open Systems Technologies"",280}","{""Explore Jobs Search"",266}","{""Weblee Technologies"",264}","{""The Boeing Company"",260}","{""Continental"",259}","{""Coders Brain Technology"",242}","{""Capgemini"",224}","{""IBM Careers"",220}","{""FullStack Labs"",215}","{""AECOM"",201}","{""241387-COMP & BEN ADMIN PROF FEES"",199}","{""Burlington Stores"",191}","{""Securitas US Business Unit"",176}","{""CACI-FEDERAL"",172}","{""Worley"",160}","{""Nagarro"",153}","{""Jobsbridge"",152}","{""Segula Technologies"",146}","{""Oowlish Technology"",143}","{""Publicis Groupe"",143}","{""Latitude"",136}","{""Sargent & Lundy"",135}","{""GardaWorld"",135}","{""Sonsoft"",134}","{""About Alstom"",129}","{""SAP"",127}",…,"{""Davido Consulting Group"",1}","{""unimed"",1}","{""Enterprise Bank"",1}","{""99minutos.com"",1}","{""Redner's Jobs"",1}","{""The Green Technology Group"",1}","{""Zipline Logistics"",1}","{""OmniData"",1}","{""Tremendous"",1}","{""PCM Services"",1}","{""Cuculus"",1}","{""Maersk Branch Canada"",1}","{""pentavalue"",1}","{""K2 Group"",1}","{""I-care USA"",1}","{""SPIE SAG Geschäftseinheit CeGIT"",1}","{""Datalytics"",1}","{""Psychiatrische Universitätsklinik Zürich"",1}","{""Southeastern Community College"",1}","{""ResearchGate"",1}","{""ReflexAI"",1}","{""Runnings"",1}","{""Donnell Consulting"",1}","{""Faith Assembly"",1}","{""MS0017 GE Healthcare Austria & Co OG"",1}","{""BATEMAN SPRAYERS"",1}","{""FIRST RF CORPORATION"",1}","{""CCBA 2023"",1}","{""NORTHWARE SA DE CV"",1}","{""Japan Cloud"",1}","{""Varstaff"",1}","{""Sincerus Global Solutions"",1}","{""STARK Deutschland"",1}","{""Smart-One Solutions"",1}","{""ADSIPL - Telangana - F02"",1}","{""Ollang"",1}","{""newgenapps"",1}"


In [86]:
whole_df.select(pl.col("seniority").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""unknown"",53578}","{""Senior IC"",17840}","{""Manager"",6688}","{""IC"",6623}","{""Unclear Seniority"",4593}","{""Staff IC"",3513}","{""Intern"",1982}","{""Junior IC"",1896}","{""Contract"",1106}","{""Director"",874}","{""Chief"",557}","{""Exec"",465}","{""Senior Manager"",273}","{""Senior Exec"",8}","{""Founder"",3}"


In [87]:
from polars_ds.diagnosis import DIA
import polars.selectors as cs

In [88]:
dia = DIA(whole_df)

In [89]:
dia.plot_null_distribution(cs.all())

Null Distribution,Null Distribution.1,Null Distribution.2
job_name,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
job_location,0.1400.100.100.120.130.110.0990.110.110.120.110.110.110.120.120.110.130.110.120.120.120.110.110.110.120.120.120.120.120.110.110.110.130.120.120.110.120.120.110.130.110.110.110.110.120.120.110.120.120.110.110.14,11.45%
hours,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
remote,0.7900.770.760.760.780.760.760.750.750.760.760.760.770.770.760.770.770.770.780.770.780.750.770.750.770.770.760.770.770.760.770.770.770.760.770.770.790.760.760.770.760.770.780.750.770.770.760.760.770.770.770.76,76.62%
company_name,0.07500.0610.0600.0650.0730.0650.0510.0580.0660.0660.0570.0750.0580.0620.0610.0630.0750.0630.0540.0620.0650.0730.0580.0650.0570.0600.0690.0630.0550.0620.0620.0610.0700.0610.0610.0710.0540.0640.0610.0610.0540.0580.0590.0710.0620.0680.0630.0680.0610.0610.0720.041,6.27%
education,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
tags_matched,0.04100.0130.0180.0130.0160.0110.0150.0120.0120.0150.0130.0150.0180.0180.0130.0150.0110.0130.0170.0170.0120.0160.0200.0150.0140.0160.0170.0150.0160.0170.0180.0150.0110.0150.0160.0120.0170.0150.0140.0190.0170.0190.0180.0220.0180.0200.0150.0190.0200.0180.0190.041,1.56%
tag_categories,0.04100.0130.0180.0130.0160.0110.0150.0120.0120.0150.0130.0150.0180.0180.0130.0150.0110.0130.0170.0170.0120.0160.0200.0150.0140.0160.0170.0150.0160.0170.0180.0150.0110.0150.0160.0120.0170.0150.0140.0190.0170.0190.0180.0220.0180.0200.0150.0190.0200.0180.0190.041,1.56%
categories,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%
seniority,5.00−5.00000000000000000000000000000000000000000000000000000,0.00%


In [90]:
whole_df.select(pl.col("hours").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""Full-Time"",61588}","{""unknown"",27739}","{""Contract"",3727}","{""Part-Time"",2009}","{""Unclear"",1905}","{""Intern"",1048}","{""Temp"",701}","{""Hourly"",513}","{""Student"",280}","{""Trainee"",187}","{""Advisor"",94}","{""Gig"",83}","{""Commission"",82}","{""Grant"",27}","{""Conditional"",13}","{""Volunteer"",3}"


In [91]:
whole_df.select(pl.col("language").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,column_37,column_38,column_39,column_40,column_41,column_42,column_43
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""en"",91143}","{""de"",1595}","{""fr"",1440}","{""pt"",1051}","{""es"",1002}","{""zh"",639}","{""unknown"",583}","{""nl"",567}","{""ja"",375}","{""ko"",347}","{""pl"",319}","{""sk"",214}","{""it"",168}","{""sv"",117}","{""ru"",96}","{""tr"",41}","{""id"",40}","{""hu"",39}","{""no"",38}","{""cs"",29}","{""sl"",29}","{""ro"",16}","{""uk"",16}","{""da"",11}","{""et"",11}","{""hr"",11}","{""fi"",11}","{""tl"",9}","{""ca"",6}","{""el"",5}","{""vi"",5}","{""lt"",5}","{""af"",4}","{""cy"",4}","{""sw"",2}","{""ka"",2}","{""sq"",2}","{""he"",1}","{""th"",1}","{""lv"",1}","{""gb"",1}","{""sr"",1}","{""hy"",1}","{""ar"",1}"


In [92]:
whole_df.select(pl.col("country").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_146,column_147,column_148,column_149,column_150,column_151,column_152,column_153,column_154,column_155,column_156,column_157,column_158,column_159,column_160,column_161,column_162,column_163,column_164,column_165,column_166,column_167,column_168,column_169,column_170,column_171,column_172,column_173,column_174,column_175,column_176,column_177,column_178,column_179,column_180,column_181,column_182
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""United States"",44581}","{""India"",10275}","{null,9708}","{""United Kingdom"",4191}","{""Germany"",3560}","{""Canada"",2362}","{""Brazil"",1783}","{""France"",1530}","{""Australia"",1340}","{""Mexico"",1015}","{""China"",948}","{""Poland"",927}","{""Singapore"",902}","{""Spain"",857}","{""Netherlands"",854}","{""South Africa"",774}","{""Israel"",769}","{""Philippines"",666}","{""Italy"",587}","{""Romania"",574}","{""Malaysia"",564}","{""Japan"",509}","{""Ireland"",496}","{""Belgium"",480}","{""Sweden"",470}","{""Switzerland"",411}","{""Portugal"",401}","{""Colombia"",383}","{""Argentina"",371}","{""Austria"",334}","{""Thailand"",325}","{""Saudi Arabia"",309}","{""United Arab Emirates"",303}","{""Czech Republic"",294}","{""Taiwan"",281}","{""Egypt"",275}","{""Hungary"",263}",…,"{""Somalia"",2}","{""Papua New Guinea"",2}","{""Barbados"",2}","{""Ethiopia"",2}","{""Sierra Leone"",2}","{""Equatorial Guinea"",2}","{""Fiji"",2}","{""Antarctica"",2}","{""Cayman Islands"",2}","{""Mali"",2}","{""Guyana"",1}","{""Laos"",1}","{""Greenland"",1}","{""Gabon"",1}","{""Saint Lucia"",1}","{""Benin"",1}","{""Central African Republic"",1}","{""Saint Kitts And Nevis"",1}","{""Tajikistan"",1}","{""Wallis And Futuna"",1}","{""Libya"",1}","{""Vanuatu"",1}","{""Faroe Islands"",1}","{""Sudan"",1}","{""Afghanistan"",1}","{""Bermuda"",1}","{""Aruba"",1}","{""Turkmenistan"",1}","{""Brunei"",1}","{""Liberia"",1}","{""Marshall Islands"",1}","{""San Marino"",1}","{""Togo"",1}","{""Guinea"",1}","{""Djibouti"",1}","{""Yemen"",1}","{""Mozambique"",1}"


In [93]:
date_data_new = whole_df.select(cs.date())
bool_data_new = whole_df.select(cs.by_dtype(pl.Boolean))
string_data_new = whole_df.select(cs.string(include_categorical=True))
nested_data_new = whole_df.select(
    cs.by_name("tags_matched", "tag_categories", "categories")
)
num_data_new = whole_df.select(cs.float())

In [94]:
whole_df.select(pl.col("job_location").value_counts(sort=True))

job_location
struct[2]
"{null,11452}"
"{""Remote"",1316}"
"{""United States"",1284}"
"{""Bangalore, India"",942}"
"{""New York, NY"",453}"
…
"{""Maslak, Ahi Evran Cd., 34485 Sarıyer/İstanbul, Turkey"",1}"
"{""South San Francisco, CA · Information Technology"",1}"
"{""USA MD Bethesda - 12A South Dr (MDC036)"",1}"
"{""Chasie Street, Windhoek, Namibia"",1}"


In [95]:
print(f"date type columns:{date_data_new.columns}")
print(f"bool type columns:{bool_data_new.columns}")
print(f"string type columns:{string_data_new.columns}")
print(f"nested type columns:{nested_data_new.columns}")

date type columns:['job_published_at', 'last_indexed']
bool type columns:['remote']
string type columns:['job_name', 'job_location', 'hours', 'company_name', 'education', 'seniority', 'comp_est', 'language', 'city', 'country', 'new']
nested type columns:['tags_matched', 'tag_categories', 'categories']


In [96]:
missing = (
    whole_df.select(pl.all().is_null().sum())
    .melt(value_name="missing")
    .filter(pl.col("missing") > 0)
)

In [97]:
compensation = whole_df.filter(pl.col("compensation") > 0)

In [98]:
compensation

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""qt/qml software developer""","""Palo Alto, CA, 94303""","""unknown""",,"""Sciton""","""unknown""","[""Linux"", ""Git"", ""C""]","[""OSS"", ""Version"", … ""Languages""]","[""Software""]","""unknown""","""75060.00""","""en""","""Palo Alto""","""United States""",,2024-03-23,"""April 2024""",75060.0
"""contractor safety engineer (m,…","""Gent, Oost-Vlaanderen, Belgium""","""Full-Time""",,"""ArcelorMittal""","""unknown""",[],[],[],"""Contract""","""2080.00""","""en""","""Gent""","""Belgium""",2023-12-18,2024-03-14,"""April 2024""",2080.0
"""entry-level electrical enginee…","""Romeoville, IL USA""","""Full-Time""",,,"""unknown""","[""Microsoft""]",[],[],"""unknown""","""75000.00""","""en""","""Romeoville""","""United States""",2024-03-06,2024-03-31,"""April 2024""",75000.0
"""armed security guard""","""Mayfield, KY""","""Full-Time""",,"""REDCON Solutions Group""","""unknown""",[],[],"[""Security"", ""Physical"", ""Security""]","""unknown""","""20.61""","""en""","""Mayfield""","""United States""",2023-05-01,2023-06-01,"""June 2023""",20.61
"""senior principal engineer rf s…","""United States-California-Point…","""Full-Time""",,"""CORP-Corporate Office""","""Bachelors""","[""Linux"", ""Python"", … ""XML""]","[""OSS"", ""OS"", … ""Languages""]","[""Business"", ""Software""]","""Staff IC""","""109000.00""","""en""",,"""United States""",2024-03-22,2024-03-28,"""April 2024""",109000.0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""application engineer, thermal …","""Westerville, OH, United States""","""unknown""",,"""Vertiv Group Corporation""","""Bachelors""","[""Oracle""]","[""Enterprise"", ""Customers"", … ""Datastores""]",[],"""unknown""","""5.00""","""en""","""Westerville""","""United States""",2024-01-04,2024-03-14,"""April 2024""",5.0
"""lead software engineer - cloud…","""Jersey City, NJ, United States""","""Full-Time""",,"""241387-COMP & BEN ADMIN PROF F…","""Bachelors""","[""Java"", ""Atlas"", … ""Terraform""]","[""Build"", ""Tools"", … ""Config""]","[""Software""]","""unknown""","""181125.00""","""en""","""Jersey City""","""United States""",2023-09-25,2024-03-15,"""April 2024""",181125.0
"""systems engineer - seta""","""Chantilly, VA""","""Full-Time""",,"""McIntire Solutions""","""Bachelors""",[],[],[],"""Senior IC""","""170000.0""","""en""","""Chantilly""","""United States""",2023-04-11,2023-05-23,"""June 2023""",170000.0
"""electrical engineer""","""Kolkata""","""Full-Time""",,"""Benchmark Global Management Se…","""unknown""","[""Schneider""]","[""Applications"", ""Industry"", ""Industrial""]",[],"""unknown""","""114000.00""","""en""","""Kolkata""","""India""",2023-07-08,2024-03-29,"""April 2024""",114000.0


#### country+code

In [99]:
whole_df = whole_df.with_columns(pl.col("country").str.replace("Turkey", "Turkiye"))

In [100]:
whole_df = whole_df.with_columns(
    pl.col("country")
    .str.replace("Ivory Coast", "Côte d'Ivoire")
    .str.replace("Turkey", "Turkiye")
)

In [101]:
whole_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64
"""senior data science engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",true,"""RightEye""","""unknown""","[""SQL"", ""NoSQL"", … ""Python""]","[""NoSQL"", ""Data"", … ""Tools""]",[],"""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024""",
"""domestic intruder/fire alarm e…","""Shaftesbury, United Kingdom""","""unknown""",false,"""Swann Recruitment""","""unknown""",[],[],"[""Recruiting"", ""Staffing""]","""unknown""",,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024""",
"""software developer - product s…","""Remote, Spain""","""Unclear""",true,"""Red Hat""","""unknown""","[""Vue"", ""js"", … ""Git""]","[""Scheduling"", ""Orchestration"", … ""Framework""]","[""Security"", ""Software""]","""unknown""",,"""en""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024""",
"""sap automation engineer""","""Hyderabad, India""","""unknown""",false,"""SQUIRCLE IT CONSULTING SERVICE…","""unknown""","[""SAP""]","[""IaaS"", ""Travel"", … ""SaaS""]","[""IT"", ""ERP"", … ""Intelligence""]","""unknown""","""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023""",
"""cleared armed security officer…","""Columbia, MD, US""","""unknown""",,,"""Some High School""",[],[],"[""Security""]","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024""",
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""security tech lead""",,"""Full-Time""",,"""AArete Technosoft""","""unknown""","[""SonarQube"", ""Git"", … ""Jenkins""]","[""Programming"", ""Languages"", … ""DevOps""]","[""Security""]","""Staff IC""","""""","""en""","""Pune City""","""India""",2022-04-06,2023-06-04,"""June 2023""",
"""senior quality program manager…","""Oregon - Remote""","""Full-Time""",true,"""100-SFDC""","""unknown""","[""Salesforce"", ""Ranger"", ""GitHub""]","[""CRM"", ""Continuous"", … ""Customers""]","[""Philanthropy"", ""CRM"", … ""Nonprofits""]","""Manager""","""""","""en""",,"""United States""",2023-05-17,2023-05-19,"""June 2023""",
"""product manager - layered | je…","""Greensboro, NC""","""unknown""",,"""Market America""","""Bachelors""","[""Excel"", ""Jira""]","[""No"", ""Code"", … ""Treasury""]","[""Jewelry""]","""IC""",,"""en""","""Greensboro""","""United States""",,2023-06-08,"""June 2023""",
"""electrical & instrumentation e…","""Chennai""","""Full-Time""",,"""544 FLSmidth""","""Bachelors""","[""Atlas""]","[""Build"", ""Tools"", … ""DevOps""]",[],"""unknown""","""""","""en""","""Chennai""","""India""",2023-05-04,2023-05-24,"""June 2023""",


In [102]:
alpha_path = "/home/anopsy/Portfolio/sourcestack/data/alpha3_codes.csv"
alpha_codes = pl.read_csv(alpha_path)

In [103]:
alpha_df = whole_df.join(alpha_codes, on="country", how="left")
alpha_df

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation,code
str,str,cat,bool,str,cat,list[str],list[str],list[str],cat,str,cat,str,str,date,date,str,f64,str
"""senior data science engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",true,"""RightEye""","""unknown""","[""SQL"", ""NoSQL"", … ""Python""]","[""NoSQL"", ""Data"", … ""Tools""]",[],"""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024""",,"""USA"""
"""domestic intruder/fire alarm e…","""Shaftesbury, United Kingdom""","""unknown""",false,"""Swann Recruitment""","""unknown""",[],[],"[""Recruiting"", ""Staffing""]","""unknown""",,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024""",,"""GBR"""
"""software developer - product s…","""Remote, Spain""","""Unclear""",true,"""Red Hat""","""unknown""","[""Vue"", ""js"", … ""Git""]","[""Scheduling"", ""Orchestration"", … ""Framework""]","[""Security"", ""Software""]","""unknown""",,"""en""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024""",,"""ESP"""
"""sap automation engineer""","""Hyderabad, India""","""unknown""",false,"""SQUIRCLE IT CONSULTING SERVICE…","""unknown""","[""SAP""]","[""IaaS"", ""Travel"", … ""SaaS""]","[""IT"", ""ERP"", … ""Intelligence""]","""unknown""","""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023""",,"""IND"""
"""cleared armed security officer…","""Columbia, MD, US""","""unknown""",,,"""Some High School""",[],[],"[""Security""]","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024""",,"""USA"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""security tech lead""",,"""Full-Time""",,"""AArete Technosoft""","""unknown""","[""SonarQube"", ""Git"", … ""Jenkins""]","[""Programming"", ""Languages"", … ""DevOps""]","[""Security""]","""Staff IC""","""""","""en""","""Pune City""","""India""",2022-04-06,2023-06-04,"""June 2023""",,"""IND"""
"""senior quality program manager…","""Oregon - Remote""","""Full-Time""",true,"""100-SFDC""","""unknown""","[""Salesforce"", ""Ranger"", ""GitHub""]","[""CRM"", ""Continuous"", … ""Customers""]","[""Philanthropy"", ""CRM"", … ""Nonprofits""]","""Manager""","""""","""en""",,"""United States""",2023-05-17,2023-05-19,"""June 2023""",,"""USA"""
"""product manager - layered | je…","""Greensboro, NC""","""unknown""",,"""Market America""","""Bachelors""","[""Excel"", ""Jira""]","[""No"", ""Code"", … ""Treasury""]","[""Jewelry""]","""IC""",,"""en""","""Greensboro""","""United States""",,2023-06-08,"""June 2023""",,"""USA"""
"""electrical & instrumentation e…","""Chennai""","""Full-Time""",,"""544 FLSmidth""","""Bachelors""","[""Atlas""]","[""Build"", ""Tools"", … ""DevOps""]",[],"""unknown""","""""","""en""","""Chennai""","""India""",2023-05-04,2023-05-24,"""June 2023""",,"""IND"""


In [104]:
alpha_df = alpha_df.drop("categories", "tags_matched", "tag_categories")

In [105]:
alpha_df.write_csv(
    "/home/anopsy/Portfolio/sourcestack/data/alpha_df.csv", separator=","
)

In [106]:
alpha_df

job_name,job_location,hours,remote,company_name,education,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation,code
str,str,cat,bool,str,cat,cat,str,cat,str,str,date,date,str,f64,str
"""senior data science engineer""","""6710a Rockledge Dr suite 400, …","""Full-Time""",true,"""RightEye""","""unknown""","""Senior IC""",,"""en""","""Bethesda""","""United States""",2023-09-08,2024-03-15,"""April 2024""",,"""USA"""
"""domestic intruder/fire alarm e…","""Shaftesbury, United Kingdom""","""unknown""",false,"""Swann Recruitment""","""unknown""","""unknown""",,"""en""","""Shaftesbury""","""United Kingdom""",2018-10-03,2024-03-20,"""April 2024""",,"""GBR"""
"""software developer - product s…","""Remote, Spain""","""Unclear""",true,"""Red Hat""","""unknown""","""unknown""",,"""en""",,"""Spain""",2024-03-08,2024-03-30,"""April 2024""",,"""ESP"""
"""sap automation engineer""","""Hyderabad, India""","""unknown""",false,"""SQUIRCLE IT CONSULTING SERVICE…","""unknown""","""unknown""","""""","""en""","""Hyderabad""","""India""",2016-10-14,2023-05-29,"""June 2023""",,"""IND"""
"""cleared armed security officer…","""Columbia, MD, US""","""unknown""",,,"""Some High School""","""Unclear Seniority""",,"""en""","""Columbia""","""United States""",2024-02-14,2024-03-17,"""April 2024""",,"""USA"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""security tech lead""",,"""Full-Time""",,"""AArete Technosoft""","""unknown""","""Staff IC""","""""","""en""","""Pune City""","""India""",2022-04-06,2023-06-04,"""June 2023""",,"""IND"""
"""senior quality program manager…","""Oregon - Remote""","""Full-Time""",true,"""100-SFDC""","""unknown""","""Manager""","""""","""en""",,"""United States""",2023-05-17,2023-05-19,"""June 2023""",,"""USA"""
"""product manager - layered | je…","""Greensboro, NC""","""unknown""",,"""Market America""","""Bachelors""","""IC""",,"""en""","""Greensboro""","""United States""",,2023-06-08,"""June 2023""",,"""USA"""
"""electrical & instrumentation e…","""Chennai""","""Full-Time""",,"""544 FLSmidth""","""Bachelors""","""unknown""","""""","""en""","""Chennai""","""India""",2023-05-04,2023-05-24,"""June 2023""",,"""IND"""


In [107]:
country_bar = alpha_df.group_by("code").agg(pl.col("code").count().alias("count"))

In [108]:
country_bar.transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_143,column_144,column_145,column_146,column_147,column_148,column_149,column_150,column_151,column_152,column_153,column_154,column_155,column_156,column_157,column_158,column_159,column_160,column_161,column_162,column_163,column_164,column_165,column_166,column_167,column_168,column_169,column_170,column_171,column_172,column_173,column_174,column_175,column_176,column_177,column_178,column_179
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,…,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""COL""",,"""KWT""","""RUS""","""LSO""","""CHE""","""PAK""","""ECU""","""GAB""","""NGA""","""TJK""","""BGR""","""SMR""","""IRQ""","""GNQ""","""CMR""","""MOZ""","""JOR""","""CUB""","""KOR""","""DZA""","""MAC""","""NLD""","""FIN""","""IRL""","""WLF""","""MDG""","""KNA""","""AUS""","""ARM""","""None""","""TKM""","""GUY""","""LCA""","""JAM""","""PRY""","""PHL""",…,"""THA""","""GIN""","""UKR""","""BMU""","""CRI""","""BIH""","""DOM""","""SSD""","""SGP""","""BLR""","""EST""","""URY""","""MWI""","""MCO""","""JPN""","""KAZ""","""IND""","""UGA""","""USA""","""AGO""","""GRL""","""SLE""","""TWN""","""IDN""","""BLZ""","""HUN""","""NOR""","""HND""","""BRN""","""AUT""","""BOL""","""TUR""","""ISL""","""ARG""","""TZA""","""ZMB""","""KGZ"""
"""383""","""0""","""18""","""28""","""4""","""411""","""220""","""47""","""1""","""152""","""1""","""210""","""1""","""19""","""2""","""6""","""1""","""32""","""50""","""207""","""7""","""7""","""854""","""111""","""496""","""1""","""6""","""1""","""1340""","""25""","""2""","""1""","""1""","""1""","""16""","""4""","""666""",…,"""325""","""1""","""140""","""1""","""203""","""5""","""30""","""2""","""902""","""4""","""37""","""50""","""2""","""4""","""509""","""26""","""10275""","""11""","""44581""","""7""","""1""","""2""","""281""","""261""","""2""","""263""","""126""","""17""","""1""","""334""","""13""","""225""","""4""","""371""","""5""","""2""","""2"""


In [109]:
alpha_df = alpha_df.with_columns(pl.col("country").fill_null("unknown"))

In [110]:
alpha_df.select(pl.col("country").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_146,column_147,column_148,column_149,column_150,column_151,column_152,column_153,column_154,column_155,column_156,column_157,column_158,column_159,column_160,column_161,column_162,column_163,column_164,column_165,column_166,column_167,column_168,column_169,column_170,column_171,column_172,column_173,column_174,column_175,column_176,column_177,column_178,column_179,column_180,column_181,column_182
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""United States"",44581}","{""India"",10275}","{""unknown"",9708}","{""United Kingdom"",4191}","{""Germany"",3560}","{""Canada"",2362}","{""Brazil"",1783}","{""France"",1530}","{""Australia"",1340}","{""Mexico"",1015}","{""China"",948}","{""Poland"",927}","{""Singapore"",902}","{""Spain"",857}","{""Netherlands"",854}","{""South Africa"",774}","{""Israel"",769}","{""Philippines"",666}","{""Italy"",587}","{""Romania"",574}","{""Malaysia"",564}","{""Japan"",509}","{""Ireland"",496}","{""Belgium"",480}","{""Sweden"",470}","{""Switzerland"",411}","{""Portugal"",401}","{""Colombia"",383}","{""Argentina"",371}","{""Austria"",334}","{""Thailand"",325}","{""Saudi Arabia"",309}","{""United Arab Emirates"",303}","{""Czech Republic"",294}","{""Taiwan"",281}","{""Egypt"",275}","{""Hungary"",263}",…,"{""Somalia"",2}","{""Papua New Guinea"",2}","{""Barbados"",2}","{""Ethiopia"",2}","{""Sierra Leone"",2}","{""Equatorial Guinea"",2}","{""Fiji"",2}","{""Antarctica"",2}","{""Cayman Islands"",2}","{""Mali"",2}","{""Guyana"",1}","{""Laos"",1}","{""Greenland"",1}","{""Gabon"",1}","{""Saint Lucia"",1}","{""Benin"",1}","{""Central African Republic"",1}","{""Saint Kitts And Nevis"",1}","{""Tajikistan"",1}","{""Wallis And Futuna"",1}","{""Libya"",1}","{""Vanuatu"",1}","{""Faroe Islands"",1}","{""Sudan"",1}","{""Afghanistan"",1}","{""Bermuda"",1}","{""Aruba"",1}","{""Turkmenistan"",1}","{""Brunei"",1}","{""Liberia"",1}","{""Marshall Islands"",1}","{""San Marino"",1}","{""Togo"",1}","{""Guinea"",1}","{""Djibouti"",1}","{""Yemen"",1}","{""Mozambique"",1}"


In [111]:
alpha_df.select(pl.col("language").value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,column_37,column_38,column_39,column_40,column_41,column_42,column_43
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""en"",91143}","{""de"",1595}","{""fr"",1440}","{""pt"",1051}","{""es"",1002}","{""zh"",639}","{""unknown"",583}","{""nl"",567}","{""ja"",375}","{""ko"",347}","{""pl"",319}","{""sk"",214}","{""it"",168}","{""sv"",117}","{""ru"",96}","{""tr"",41}","{""id"",40}","{""hu"",39}","{""no"",38}","{""cs"",29}","{""sl"",29}","{""ro"",16}","{""uk"",16}","{""da"",11}","{""et"",11}","{""hr"",11}","{""fi"",11}","{""tl"",9}","{""ca"",6}","{""el"",5}","{""vi"",5}","{""lt"",5}","{""af"",4}","{""cy"",4}","{""sw"",2}","{""ka"",2}","{""sq"",2}","{""he"",1}","{""th"",1}","{""lv"",1}","{""gb"",1}","{""sr"",1}","{""hy"",1}","{""ar"",1}"


In [112]:
company_counts = alpha_df["company_name"].value_counts(sort=True)
company_counts = company_counts.drop_nulls()
company_counts

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""STARK Deutschland""",1
"""Smart-One Solutions""",1
"""ADSIPL - Telangana - F02""",1
"""Ollang""",1


In [113]:
top20_companies = company_counts.filter(pl.col("count") >= 200)
top20_companies

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""Coders Brain Technology""",242
"""Capgemini""",224
"""IBM Careers""",220
"""FullStack Labs""",215


In [114]:
top20_companies["count"].sum()

9502

In [115]:
top50_companies = company_counts.filter(pl.col("count") >= 20)
top50_companies

company_name,count
str,u32
"""IBM""",2683
"""Allied Universal""",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Electric""",397
…,…
"""Standard Chartered""",20
"""CrowdStrike""",20
"""EverWatch""",20
"""Hire Military Talent""",20


In [116]:
top50_companies["count"].sum()

31576

In [117]:
plot_top_companies = top20_companies.hvplot.barh(
    x="company_name",
    y="count",
    color="count",
    rot=90,
    title="Top Companies",
    colorbar=True,
    cmap="plasma",
    clabel="Number of Jobs",
)

In [118]:
hvplot.save(plot_top_companies, "top_companies.png")



city->cat

In [119]:
lat_long = pl.read_csv("/home/anopsy/Portfolio/sourcestack/data/city_coordinates.csv")
lat_long

city,lat,long
str,f64,f64
"""Bilzen""",50.870779,5.5181089
"""Sumidaku""",35.700379,139.805867
"""Kabupaten Bogor""",-6.545325,107.001742
"""Reykjavík""",64.145981,-21.942237
"""Dun Laoghaire""",53.292279,-6.136008
…,…,…
"""Bensenville""",41.953838,-87.943178
"""Osasco""",-23.532486,-46.79168
"""Chehalis""",46.659965,-122.963432
"""Aracajú""",-10.916206,-37.077466


In [120]:
city_count = whole_df.group_by("city").count()

  city_count = whole_df.group_by("city").count()


In [121]:
city_df = city_count.join(lat_long, on="city", how="left")
city_df = city_df.drop_nulls()

In [122]:
city_df.sort(by="count", descending=True).head(10)

city,count,lat,long
str,u32,f64,f64
"""Bengaluru""",1942,12.976794,77.590082
"""Bangalore""",1512,12.988157,77.6226
"""San Francisco""",961,37.779259,-122.419329
"""London""",956,51.489334,-0.144055
"""Singapore""",863,1.357107,103.819499
"""New York""",847,40.712728,-74.006015
"""Hyderabad""",815,17.360589,78.474061
"""Pune""",779,18.521428,73.854454
"""Annapolis Junction""",690,39.118996,-76.796342
"""Austin""",656,30.271129,-97.7437


In [123]:
city_df = city_df.with_columns(log_num=pl.col("count").log(base=2))
city_df

city,count,lat,long,log_num
str,u32,f64,f64,f64
"""Santa Cruz de Tenerife""",1,28.467178,-16.250784,0.0
"""Laramie""",1,41.311367,-105.591101,0.0
"""Bangalore""",1512,12.988157,77.6226,10.562242
"""Uppsala""",9,59.858613,17.638744,3.169925
"""Hibbing""",4,47.427155,-92.937689,2.0
…,…,…,…,…
"""Fort Myers""",16,26.640628,-81.872308,4.0
"""Charlottesville""",28,38.029306,-78.476678,4.807355
"""Altona""",1,53.586468,9.77767,0.0
"""Taunton""",5,51.014789,-3.102909,2.321928


In [124]:
city_df.hvplot.points(
    x="long",
    y="lat",
    coastline=True,
    tiles=True,
    s="count",
    color="count",
    cmap="plasma_r",
    alpha=0.8,
)

In [125]:
plot_city = city_df.hvplot.points(
    x="long",
    y="lat",
    coastline=True,
    tiles=True,
    s="count",
    color="log_num",
    cmap="plasma_r",
    alpha=0.7,
)

In [126]:
hvplot.save(plot_city, "cities.png")

