## EDA and Cleaning - SourceStack datasets

This notebook focuses on exploration and cleaning of two datasets I obtained by calling SourceStack API\
The first dataset comes from: **June 9, 2023**\
and the more recent one from: **April 2, 2024**

### Initial Exploration
1. shape
2. dtypes
3. missing values


### Cleaning
1. parsing strings containing datetimes to dates
2. converting strings containing a list to list of strings
3. convertsing numerical data from strings to Int/Float
5. identify dirty categories

#### Let's read in the data and have a look at its shape, columns and values

In [1]:
import polars as pl

In [2]:
old_data_path = '/home/anopsy/Portfolio/sourcestack/data/9june2023.csv'
new_data_path = '/home/anopsy/Portfolio/sourcestack/data/2april2024.csv'

In [3]:
old_df = pl.read_csv(old_data_path, try_parse_dates=False)
new_df = pl.read_csv(new_data_path, try_parse_dates=False)

In [4]:
print(f'Shape of the old data1 is:{old_df.shape}')
print(f'Shape of the new data is:{new_df.shape}')

Shape of the old data1 is:(50000, 16)
Shape of the new data is:(50000, 16)


In [5]:
old_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Backend Develo…","""Praha, Czech R…",,,"""IBM""",,"""[Docker, Graph…","""[Container Orc…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic…","""2023-03-13 05:…","""2023-06-05 13:…"
"""Manufacturing …",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Server…","""[Manufacturing…",,"""""","""en""","""Sterling Heigh…","""United States""","""2021-10-09 00:…","""2023-05-24 05:…"
"""Design Enginee…","""520 S Byrkit S…","""Full-Time""",,"""ABI Attachment…","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""","""2023-04-28 03:…","""2023-05-19 14:…"
"""Cybersecurity …",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, …","""[Compute, Logg…","""[Cybersecurity…",,"""""","""en""","""Herndon""","""United States""","""2023-04-03 00:…","""2023-05-28 11:…"
"""Your Career so…","""Kolkata, India…","""Full-Time""",False,"""Adeeba e Servi…",,"""[Objective-C, …","""[Cloud Native …","""[Software]""",,"""""","""en""","""Kolkata""","""India""","""2017-01-17 11:…","""2023-05-30 11:…"


In [6]:
new_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Dir, Engineeri…","""Dominican Repu…","""Full-Time""",,"""DR""","""Bachelors""","""[Microsoft]""","""[]""","""[]""",,,"""en""",,"""Dominican Repu…","""2024-03-04 00:…","""2024-03-26 08:…"
"""Software Engin…","""Dresden or Har…","""Full-Time""",,"""Manning Global…",,"""[Linux]""","""[OS]""","""[Software, IT]…",,,"""en""","""Dresden or Har…","""Germany""","""2024-02-15 00:…","""2024-04-01 09:…"
"""Embedded Softw…","""Brisbane, CA""","""Full-Time""",,"""Avive""",,"""[Linux, C++]""","""[OS, Programmi…","""[Software]""",,"""150000.00""","""en""","""Brisbane""","""Australia""","""2023-10-23 00:…","""2024-04-01 15:…"
"""Manufacturing …","""Monroe, WI""","""Full-Time""",,"""United Future""",,"""[]""","""[]""","""[Manufacturing…","""Manager""","""1.00""","""en""","""Monroe""","""United States""","""2024-03-27 20:…","""2024-03-28 20:…"
"""Vom Lager zum …","""Ennepetal, Nor…","""Full-Time""",,"""RUHR VERMITTLU…",,"""[WhatsApp, Ver…","""[Communication…","""[Security]""",,,"""de""","""Dortmund""","""Germany""","""2024-03-27 12:…","""2024-03-31 11:…"


### Initial explorations of unprocessed dataframes

#### Shape
Both datasets contain **50000 records** \
each record is represented by **16 features**

#### Dtypes
15 of the features are currently String - datatype\
1 feature is Bool

#### Missing values
The datasets contain **null values** and **empty strings**

In [7]:
pl.Config.set_tbl_width_chars(200) #setting wide format but it doesn't work that well for jupyter notebook

polars.config.Config

Let's have a look at the sample method, so I can have a look at some more records and remember that I can use .sample in the future.

In [8]:
old_df.sample(3)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Principal Engi…",,"""Full-Time""",True,"""Visenze""","""Bachelors""","""[AWS, Azure, R…","""[ML Tools, Com…","""[eCom, Retail]…","""Staff IC""","""""","""en""",,,"""2023-05-03 00:…","""2023-05-25 04:…"
"""Infrastructure…","""MULTIPLE CITIE…",,,"""IBM""",,"""[RDS, S3, AWS,…","""[Datastores, C…","""[]""","""Unclear Senior…","""""","""sk""","""Multiple Citie…","""Philippines""","""2023-01-06 05:…","""2023-06-05 22:…"
"""Cloud Engineer…","""Cluj, Cluj, Ro…",,,"""sistemasglT1""",,,,"""[Design, AI]""",,"""""","""en-us""",,"""Romania""","""2023-06-05 02:…","""2023-06-06 22:…"


In [9]:
new_df.sample(3)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str
"""Sr. Software E…","""San Francisco,…","""Full-Time""",,"""DocuSign""",,"""[Git, Cassandr…","""[App Definitio…","""[eSignature, S…","""Senior IC""",,"""en-us""","""San Francisco""","""United States""","""2024-03-05 05:…","""2024-03-31 20:…"
"""Mid/Senior Fro…","""Sofia, Bulgari…",,,"""MiNDS""",,"""[DynamoDB, Red…","""[JavaScript UI…","""[Software]""","""Senior IC""",,"""en-us""","""Sofia""","""Bulgaria""","""2023-09-14 15:…","""2024-03-30 10:…"
"""Clinical Engin…","""Missouri, 6301…",,,"""TRIMEDX""","""Bachelors""","""[Microsoft]""","""[]""","""[Security, Med…","""Manager""",,"""en""","""Missouri""","""United States""","""2024-02-22 22:…","""2024-03-23 10:…"


#### Add column that will help us identify if the record comes from 2023 or 2024 and concatenate both dataframes into one

In [10]:
#adding static columns with a string helping identify the df
old_df = old_df.with_columns(pl.lit('no').alias('new'))
new_df = new_df.with_columns(pl.lit('yes').alias('new'))

In [11]:
old_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Backend Develo…","""Praha, Czech R…",,,"""IBM""",,"""[Docker, Graph…","""[Container Orc…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic…","""2023-03-13 05:…","""2023-06-05 13:…","""no"""
"""Manufacturing …",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Server…","""[Manufacturing…",,"""""","""en""","""Sterling Heigh…","""United States""","""2021-10-09 00:…","""2023-05-24 05:…","""no"""
"""Design Enginee…","""520 S Byrkit S…","""Full-Time""",,"""ABI Attachment…","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""","""2023-04-28 03:…","""2023-05-19 14:…","""no"""
"""Cybersecurity …",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, …","""[Compute, Logg…","""[Cybersecurity…",,"""""","""en""","""Herndon""","""United States""","""2023-04-03 00:…","""2023-05-28 11:…","""no"""
"""Your Career so…","""Kolkata, India…","""Full-Time""",False,"""Adeeba e Servi…",,"""[Objective-C, …","""[Cloud Native …","""[Software]""",,"""""","""en""","""Kolkata""","""India""","""2017-01-17 11:…","""2023-05-30 11:…","""no"""


In [12]:
#concatenating old and new data
whole_df = old_df.vstack(new_df)

print(whole_df.shape)

(100000, 17)


In [13]:
whole_df.sample(5)

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,str,str,str
"""Oracle EPM Clo…","""MULTIPLE CITIE…","""Full-Time""",,"""IBM""",,"""[Anaplan, IBM,…","""[IaaS, Back Of…","""[Management Co…","""Contract""",,"""zh-tw""","""Multiple Citie…","""United States""","""2024-03-12 15:…","""2024-03-16 23:…","""yes"""
"""Customer Engin…","""Singapore,SGP""","""Full-Time""",,"""SGP""","""Associates""","""[]""","""[]""","""[]""",,,"""en""","""Singapore""","""Singapore""","""2024-03-18 00:…","""2024-03-25 11:…","""yes"""
"""Cybersecurity …","""Annapolis Junc…","""Full-Time""",,"""Dobbs Defense …",,"""[]""","""[]""","""[Cybersecurity…","""Senior IC""","""""","""en""","""Annapolis Junc…","""United States""","""2023-05-19 00:…","""2023-05-28 06:…","""no"""
"""Intern (Techni…",,"""Intern""",,"""Synopsys""",,"""[]""","""[]""","""[Recruiting & …","""Intern""","""""","""en-us""",,,,"""2023-05-23 14:…","""no"""
"""Mechanical Eng…","""CLEVELAND, OH,…","""Full-Time""",,"""Carmeuse Lime …","""Bachelors""","""[SAP, Excel, M…","""[Midsize Custo…","""[Mechanical & …",,"""94550.00""","""en""","""Cleveland""","""United States""","""2023-11-28 17:…","""2024-03-17 11:…","""yes"""


### Cleaning

#### 1. Converting 'job_published_at', 'last_indexed' to Date

In [14]:
whole_df = whole_df.with_columns(pl.col('job_published_at', 'last_indexed').str.to_datetime().cast(pl.Date))

In [15]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,str,str,str,str,str,str,str,str,date,date,str
"""Backend Develo…","""Praha, Czech R…",,,"""IBM""",,"""[Docker, Graph…","""[Container Orc…","""[Software]""",,"""""","""pl""","""Praha""","""Czech Republic…",2023-03-13,2023-06-05,"""no"""
"""Manufacturing …",,"""Full-Time""",False,,,"""[Sigma]""","""[Tools, Server…","""[Manufacturing…",,"""""","""en""","""Sterling Heigh…","""United States""",2021-10-09,2023-05-24,"""no"""
"""Design Enginee…","""520 S Byrkit S…","""Full-Time""",,"""ABI Attachment…","""Bachelors""","""[]""","""[]""","""[Design]""","""Senior IC""","""""","""en""","""Mishawaka""","""United States""",2023-04-28,2023-05-19,"""no"""
"""Cybersecurity …",,"""Full-Time""",False,,"""Bachelors""","""[AWS, Qualys, …","""[Compute, Logg…","""[Cybersecurity…",,"""""","""en""","""Herndon""","""United States""",2023-04-03,2023-05-28,"""no"""
"""Your Career so…","""Kolkata, India…","""Full-Time""",False,"""Adeeba e Servi…",,"""[Objective-C, …","""[Cloud Native …","""[Software]""",,"""""","""en""","""Kolkata""","""India""",2017-01-17,2023-05-30,"""no"""


#### 2. Converting 'tags_matched', 'tag_categories', 'categories' from str to list[str]

In [16]:
import polars.selectors as cs

In [17]:
def string_to_nested(df, cols):
    '''
    takes a df and list of columns that contain strings with lists
    and turns them into nested datatype List
    '''
    for col in cols:
        df = df.with_columns(pl.col(col).str.extract_all(r'\w+').cast(pl.List(pl.String)))
    return df
    

In [18]:
cols_to_change = ['tags_matched', 'tag_categories', 'categories']
whole_df = string_to_nested(whole_df, cols_to_change)

In [19]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new
str,str,str,bool,str,str,list[str],list[str],list[str],str,str,str,str,str,date,date,str
"""Backend Develo…","""Praha, Czech R…",,,"""IBM""",,"[""Docker"", ""GraphQL"", … ""Cloud""]","[""Container"", ""Orchestration"", … ""Databases""]","[""Software""]",,"""""","""pl""","""Praha""","""Czech Republic…",2023-03-13,2023-06-05,"""no"""
"""Manufacturing …",,"""Full-Time""",False,,,"[""Sigma""]","[""Tools"", ""Serverless""]","[""Manufacturing""]",,"""""","""en""","""Sterling Heigh…","""United States""",2021-10-09,2023-05-24,"""no"""
"""Design Enginee…","""520 S Byrkit S…","""Full-Time""",,"""ABI Attachment…","""Bachelors""",[],[],"[""Design""]","""Senior IC""","""""","""en""","""Mishawaka""","""United States""",2023-04-28,2023-05-19,"""no"""
"""Cybersecurity …",,"""Full-Time""",False,,"""Bachelors""","[""AWS"", ""Qualys"", ""Splunk""]","[""Compute"", ""Logging"", … ""Security""]","[""Cybersecurity"", ""Security""]",,"""""","""en""","""Herndon""","""United States""",2023-04-03,2023-05-28,"""no"""
"""Your Career so…","""Kolkata, India…","""Full-Time""",False,"""Adeeba e Servi…",,"[""Objective"", ""C"", … ""Git""]","[""Cloud"", ""Native"", … ""Control""]","[""Software""]",,"""""","""en""","""Kolkata""","""India""",2017-01-17,2023-05-30,"""no"""


#### 3. Converting 'comp_est' from str to int

In [20]:
whole_df = whole_df.with_columns(pl.col('comp_est').cast(pl.Float64, strict=False).alias('compensation'))
#polars can handle str->float->int
#casting didn't work for Int64 but it did for Float with strict=False, strict=False turned empty strings to nulls
#it works after all I think the problem was I tried to cast t oint32 and because of huge numbers it didn't work
# now it works with Int64

In [21]:
whole_df.head()

job_name,job_location,hours,remote,company_name,education,tags_matched,tag_categories,categories,seniority,comp_est,language,city,country,job_published_at,last_indexed,new,compensation
str,str,str,bool,str,str,list[str],list[str],list[str],str,str,str,str,str,date,date,str,f64
"""Backend Develo…","""Praha, Czech R…",,,"""IBM""",,"[""Docker"", ""GraphQL"", … ""Cloud""]","[""Container"", ""Orchestration"", … ""Databases""]","[""Software""]",,"""""","""pl""","""Praha""","""Czech Republic…",2023-03-13,2023-06-05,"""no""",
"""Manufacturing …",,"""Full-Time""",False,,,"[""Sigma""]","[""Tools"", ""Serverless""]","[""Manufacturing""]",,"""""","""en""","""Sterling Heigh…","""United States""",2021-10-09,2023-05-24,"""no""",
"""Design Enginee…","""520 S Byrkit S…","""Full-Time""",,"""ABI Attachment…","""Bachelors""",[],[],"[""Design""]","""Senior IC""","""""","""en""","""Mishawaka""","""United States""",2023-04-28,2023-05-19,"""no""",
"""Cybersecurity …",,"""Full-Time""",False,,"""Bachelors""","[""AWS"", ""Qualys"", ""Splunk""]","[""Compute"", ""Logging"", … ""Security""]","[""Cybersecurity"", ""Security""]",,"""""","""en""","""Herndon""","""United States""",2023-04-03,2023-05-28,"""no""",
"""Your Career so…","""Kolkata, India…","""Full-Time""",False,"""Adeeba e Servi…",,"[""Objective"", ""C"", … ""Git""]","[""Cloud"", ""Native"", … ""Control""]","[""Software""]",,"""""","""en""","""Kolkata""","""India""",2017-01-17,2023-05-30,"""no""",


In [22]:
whole_df.filter(pl.col('compensation')>0).shape

(14962, 18)

#### 4. Identify dirty categories


In [23]:
whole_df.select(pl.col('job_name').value_counts(sort=True)).transpose()

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_66296,column_66297,column_66298,column_66299,column_66300,column_66301,column_66302,column_66303,column_66304,column_66305,column_66306,column_66307,column_66308,column_66309,column_66310,column_66311,column_66312,column_66313,column_66314,column_66315,column_66316,column_66317,column_66318,column_66319,column_66320,column_66321,column_66322,column_66323,column_66324,column_66325,column_66326,column_66327,column_66328,column_66329,column_66330,column_66331,column_66332
struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],…,struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2],struct[2]
"{""Software Engineer"",804}","{""Senior Software Engineer"",609}","{""Product Manager"",452}","{""Data Engineer"",372}","{""Project Engineer"",370}","{""Security Officer"",360}","{""DevOps Engineer"",350}","{""Electrical Engineer"",350}","{""Program Manager"",341}","{""Data Analyst"",317}","{""Mechanical Engineer"",276}","{""Software Developer"",263}","{""Full Stack Developer"",256}","{""Data Scientist"",245}","{""Systems Engineer"",236}","{""Network Engineer"",233}","{""Security Guard"",232}","{""Quality Engineer"",230}","{""Retail Front End Supervisor"",202}","{""Process Engineer"",195}","{""Manufacturing Engineer"",193}","{""Engineering Manager"",190}","{""Senior Data Engineer"",188}","{""Application Developer: Cloud FullStack"",184}","{""Senior DevOps Engineer"",163}","{""Sales Engineer"",161}","{""Senior Product Manager"",161}","{""Senior Software Developer"",157}","{""Technical Writer"",151}","{""Field Service Engineer"",150}","{""Site Reliability Engineer"",147}","{""Product Owner"",144}","{""Android Developer"",133}","{""Backend Developer"",132}","{""Civil Engineer"",130}","{""Engineer"",127}","{""QA Engineer"",124}",…,"{""Wind Engineer (Offshore) - Expression of Interest"",1}","{""Building Electrical Engineering Intern - Summer 2024"",1}","{""Principal Research Engineer, Dynamically Reconfigurable Real-time Systems"",1}","{""Software Engineer (Backend NodeJS) - Flutter Studios"",1}","{""Environmental Engineer, Scientist, or Geologist"",1}","{""Security Field Supervisor Armed - County"",1}","{""Engineer Apprentice"",1}","{""Cloud Analyst"",1}","{""Security Attendant (Seasonal)"",1}","{""Junior développeur Front/back-end / NodeJS – Fintech"",1}","{""2024 Summer Undergraduate Intern/Co-op - Manufacturing Engineer"",1}","{""Automation QA Engineer (Backup)"",1}","{""Embedded C Software Engineer with Classic AUTOSAR for ADAS Integration Platform, Engineering Center, Sibiu"",1}","{""Switchgear Quotations Engineer"",1}","{""SWQA Automation and Tools Development Engineer"",1}","{""Data Engineer (Questionnaire)"",1}","{""Werkstudent (m/w/d) für Software-Tests"",1}","{""Chemical Process Engineer Lead"",1}","{""Data Scientist, AVP - People Analytics"",1}","{""Requirements & Systems Engineer"",1}","{""Test Automation Engineer (234406)"",1}","{""Designer - Design Studios"",1}","{""Mobile Developer iOS/Android"",1}","{""Oracle Integration Cloud Developer�// Remote Mexico"",1}","{""Softwaretester - Luftfahrt (all gender)"",1}","{""Reverse Engineering Analyst (8624)"",1}","{""Senior Full Stack Software Engineer (Remote)"",1}","{""Cloud Information Systems Security Specialist (Active Secret)"",1}","{""(Sr.) Product Manager"",1}","{""Cost Engineer (Life Sciences/Pharma/Data Centres)"",1}","{""Blockchain Security Engineer (Contractor)"",1}","{""Senior Cybersecurity Penetration Test Specialist"",1}","{""Transitioning Military Talent - Field Service Engineer Opportunities"",1}","{""Security Technology Sales Engineer"",1}","{""Fire Protection Engineer (3+ years)"",1}","{""Computer Vision Engineer (Chennai)"",1}","{""Associate Engineering Specialist - FITS 083"",1}"


In [24]:
whole_df.select(pl.col('company_name').value_counts(sort=True))

company_name
struct[2]
"{null,6266}"
"{""IBM"",2683}"
"{""Allied Universal"",1057}"
"{""CLBPTS"",668}"
"{""Bosch Group"",533}"
…
"{""Ground Recruitment"",1}"
"{""ramblr.ai"",1}"
"{""UPL-"",1}"
"{""Mogo Finance Technology"",1}"


In [25]:
(whole_df
 .group_by('company_name')
 .agg(pl.col('company_name').count().alias('count'))
 .filter(pl.col('count')>1)
 .sort('count', descending=True)
)

company_name,count
str,u32
"""IBM""",2683
"""Allied Univers…",1057
"""CLBPTS""",668
"""Bosch Group""",533
"""Schneider Elec…",397
…,…
"""Amazon (China)…",2
"""Gap""",2
"""Cherry Venture…",2
"""collectAI""",2


In [26]:
(whole_df
 .group_by('job_name')
 .agg(pl.col('job_name').count().alias('count'))
 .filter(pl.col('count')>2)
 .sort('count', descending=True)
)

job_name,count
str,u32
"""Software Engin…",804
"""Senior Softwar…",609
"""Product Manage…",452
"""Data Engineer""",372
"""Project Engine…",370
…,…
"""Data Engineer …",3
"""Clinical Engin…",3
"""Information Se…",3
"""Junior Civil E…",3


In [27]:
data_job = (whole_df
 .filter(pl.col('job_name').str.contains('(?i)data'))
 .group_by('job_name')
 .agg(pl.col('job_name').count().alias('count'))
 .sort('count', descending=True)
)

In [28]:
!pip install thefuzz



In [29]:
from thefuzz import process

ModuleNotFoundError: No module named 'thefuzz'

In [None]:
data_job.map_rows(lambda t: process.extract('data', t[0], scorer=fuzz.ratio))

In [None]:
whole_df.select(pl.col('company_name').value_counts(sort=True)).transpose()

In [None]:
newss_df.select(pl.col('seniority').value_counts(sort=True)).transpose()

In [None]:
newss_df.select(pl.col('hours').value_counts(sort=True)).transpose()

In [None]:
newss_df.select(pl.col('language').value_counts(sort=True)).transpose()

In [None]:
newss_df.select(pl.col('country').value_counts(sort=True)).transpose()

In [None]:
date_data_new = newss_df.select(cs.date())
bool_data_new = newss_df.select(cs.by_dtype(pl.Boolean))
string_data_new = newss_df.select(cs.string(include_categorical=True))
nested_data_new = newss_df.select(cs.by_name('tags_matched', 'tag_categories','categories'))
num_data_new = newss_df.select(cs.float())

In [None]:
newss_df.select(pl.col('job_location').value_counts(sort=True))

In [None]:
print(f'date type columns:{date_data_new.columns}')
print(f'bool type columns:{bool_data_new.columns}')
print(f'string type columns:{string_data_new.columns}')
print(f'nested type columns:{nested_data_new.columns}')

In [None]:
date_data_old = oldss_df.select(cs.date())
bool_data_old = oldss_df.select(cs.by_dtype(pl.Boolean))
string_data_old = oldss_df.select(cs.string(include_categorical=True))

In [None]:
print(f'date type columns:{date_data_old.columns}')
print(f'bool type columns:{bool_data_old.columns}')
print(f'string type columns:{string_data_old.columns}')

In [None]:
missing_old = (
    oldss_df.select(pl.all().is_null().sum())
    .melt(value_name="missing")
    .filter(pl.col("missing") > 0)
)

In [None]:
missing_new = (
    newss_df.select(pl.all().is_null().sum())
    .melt(value_name="missing")
    .filter(pl.col("missing") > 0)
)

In [None]:
missing_old.transpose()

In [None]:
missing_new.transpose()

In [None]:
print(string_data_new)

In [None]:
import missingno as msno