# Data Job Postings Analysis

In this case study, we will analyze a dataset of job postings for data science and analytics roles. The dataset contains various columns such as job title, company, location, salary, job description, and required skills. However, some columns may have mostly missing values, which we will need to handle during preprocessing. For example, most companies may prefer not to disclose salary information, resulting in a column with many null values.


In [1]:
import polars as pl

## Load Dataset


The dataset is uploaded to a GitHub repository for convenience. You can also access the original dataset on Hugging Face Datasets at [`lukebarousse/data_jobs`](https://huggingface.co/datasets/lukebarousse/data_jobs). The original CSV file is quite large (~230 MB) due to the large number of rows and text fields, but it has been converted to Parquet format for this case study, which significantly reduces the file size to ~30 MB while preserving all data.


In [2]:
df = pl.read_parquet(
    "https://github.com/bdi593/datasets/raw/refs/heads/main/data-jobs/data_jobs.parquet"
)
df

job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
str,str,str,str,str,bool,str,str,bool,bool,str,str,f64,f64,str,str,str
"""Senior Data Engineer""","""Senior Clinical Data Engineer …","""Watertown, CT""","""via Work Nearby""","""Full-time""",false,"""Texas, United States""","""2023-06-16 13:44:15""",false,false,"""United States""",,,,"""Boehringer Ingelheim""",,
"""Data Analyst""","""Data Analyst""","""Guadalajara, Jalisco, Mexico""","""via BeBee México""","""Full-time""",false,"""Mexico""","""2023-01-14 13:18:07""",false,false,"""Mexico""",,,,"""Hewlett Packard Enterprise""","""['r', 'python', 'sql', 'nosql'…","""{'analyst_tools': ['power bi',…"
"""Data Engineer""","""Data Engineer/Scientist/Analys…","""Berlin, Germany""","""via LinkedIn""","""Full-time""",false,"""Germany""","""2023-10-10 13:14:55""",false,false,"""Germany""",,,,"""ALPHA Augmented Services""","""['python', 'sql', 'c#', 'azure…","""{'analyst_tools': ['dax'], 'cl…"
"""Data Engineer""","""LEAD ENGINEER - PRINCIPAL ANAL…","""San Antonio, TX""","""via Diversity.com""","""Full-time""",false,"""Texas, United States""","""2023-07-04 13:01:41""",true,false,"""United States""",,,,"""Southwest Research Institute""","""['python', 'c++', 'java', 'mat…","""{'cloud': ['aws'], 'libraries'…"
"""Data Engineer""","""Data Engineer- Sr Jobs""","""Washington, DC""","""via Clearance Jobs""","""Full-time""",false,"""Sudan""","""2023-08-07 14:29:36""",false,false,"""Sudan""",,,,"""Kristina Daniel""","""['bash', 'python', 'oracle', '…","""{'cloud': ['oracle', 'aws'], '…"
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Software Engineer""","""DevOps Engineer""","""Singapura""","""melalui Trabajo.org""","""Pekerjaan tetap""",false,"""Singapore""","""2023-03-13 06:16:16""",false,false,"""Singapore""",,,,"""CAREERSTAR INTERNATIONAL PTE. …","""['bash', 'python', 'perl', 'li…","""{'os': ['linux', 'unix'], 'oth…"
"""Data Analyst""","""CRM Data Analyst""","""Bad Rodach, Jerman""","""melalui BeBee Deutschland""","""Pekerjaan tetap""",false,"""Germany""","""2023-03-12 06:18:18""",false,false,"""Germany""",,,,"""HABA FAMILYGROUP""","""['sas', 'sas', 'sql', 'excel']""","""{'analyst_tools': ['sas', 'exc…"
"""Business Analyst""","""Commercial Analyst - Start Now""","""Malaysia""","""melalui Ricebowl""","""Pekerjaan tetap""",false,"""Malaysia""","""2023-03-12 06:32:36""",false,false,"""Malaysia""",,,,"""Lendlease Corporation""","""['powerpoint', 'excel']""","""{'analyst_tools': ['powerpoint…"
"""Data Engineer""","""Principal Associate, Data Engi…","""Newark, New Jersey, Amerika Se…","""melalui Recruit.net""","""Pekerjaan tetap""",false,"""Sudan""","""2023-03-12 06:32:15""",false,false,"""Sudan""",,,,"""Capital One""","""['python', 'go', 'nosql', 'sql…","""{'cloud': ['aws', 'snowflake',…"


### Check the Number of Rows

The number of rows and columns in the dataset can be checked using the `shape` attribute of the Polars DataFrame. The `shape` attribute returns a tuple containing the number of rows and columns in the DataFrame.


In [3]:
df.shape

(785741, 17)

### Check Schema


Check the schema to understand the data types of each column.


In [4]:
df.schema

Schema([('job_title_short', String),
        ('job_title', String),
        ('job_location', String),
        ('job_via', String),
        ('job_schedule_type', String),
        ('job_work_from_home', Boolean),
        ('search_location', String),
        ('job_posted_date', String),
        ('job_no_degree_mention', Boolean),
        ('job_health_insurance', Boolean),
        ('job_country', String),
        ('salary_rate', String),
        ('salary_year_avg', Float64),
        ('salary_hour_avg', Float64),
        ('company_name', String),
        ('job_skills', String),
        ('job_type_skills', String)])

## Preprocess Dataset


### Check Missing Values


Check the number of missing values.


In [5]:
df.null_count()

job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,1,1045,8,12667,0,0,0,0,0,49,752674,763738,775079,1,117037,117037


The output is in a "wide" format, which makes it difficult to compare the number of missing values across columns. We can transpose the output to make it easier to read and analyze.


In [6]:
df.null_count().transpose(include_header=True)

column,column_0
str,u32
"""job_title_short""",0
"""job_title""",1
"""job_location""",1045
"""job_via""",8
"""job_schedule_type""",12667
…,…
"""salary_year_avg""",763738
"""salary_hour_avg""",775079
"""company_name""",1
"""job_skills""",117037


We can go one step further and only display columns with one or more missing values to focus our attention on the columns that require preprocessing.


In [7]:
(
    df.null_count()
    .transpose(include_header=True, header_name="column", column_names=["null_count"])
    .filter(pl.col("null_count") > 0)
)

column,null_count
str,u32
"""job_title""",1
"""job_location""",1045
"""job_via""",8
"""job_schedule_type""",12667
"""job_country""",49
…,…
"""salary_year_avg""",763738
"""salary_hour_avg""",775079
"""company_name""",1
"""job_skills""",117037


While the output is helpful, the number of rows displayed in the output is limited to 10 rows by default by Polars. There are more columns with missing values that are not displayed in the output. To see all columns with missing values, we can temporarily set the maximum number of rows to display to a higher value, such as 50, to ensure that all columns with missing values are shown in the output.


In [8]:
# Display up to 50 rows of the null count table
with pl.Config(tbl_rows=50):
    display(
        (
            df.null_count()
            .transpose(
                include_header=True, header_name="column", column_names=["null_count"]
            )
            .filter(pl.col("null_count") > 0)
        )
    )

column,null_count
str,u32
"""job_title""",1
"""job_location""",1045
"""job_via""",8
"""job_schedule_type""",12667
"""job_country""",49
"""salary_rate""",752674
"""salary_year_avg""",763738
"""salary_hour_avg""",775079
"""company_name""",1
"""job_skills""",117037


:::{tip} What does the `with` statement do in Python?

The `with` statement in Python is used to wrap the execution of a block of code with methods defined by a context manager. It is commonly used for resource management, such as opening files or managing database connections, ensuring that resources are properly released after their use, even if an error occurs.

In the code example above, the `with` statement is used to temporarily set the maximum number of rows displayed in the output to 50. This means that when the block of code inside the `with` statement is executed, 50 rows of the DataFrame will be shown in the output, regardless of how many rows are actually in the DataFrame. After the block of code is executed, the maximum number of rows displayed will return to its previous setting.

The default maximum number of rows displayed in Polars is 10, so using the `with` statement allows you to temporarily change this setting for a specific block of code without affecting the global configuration.

:::


### Drop Rows with Missing Job Title


There is one row with a null value in the `job_title` column. Filter the DataFrame to show only this row.


In [9]:
df.filter(df["job_title"].is_null())

job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
str,str,str,str,str,bool,str,str,bool,bool,str,str,f64,f64,str,str,str
"""Data Engineer""",,,"""via Jobs In France - Mustakbil…",,False,"""Saint Lucia""","""2023-12-14 07:32:05""",False,False,,,,,,,


The dataset has over 780,000 rows, but only one row has a null value in the `job_title` and the `company` column.


In [10]:
df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).row(0)

("['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']",)

### Parse `"job_posted_date"` Column

The `"job_posted_date"` column contains date information in string format.


In [37]:
df["job_posted_date"].head(5)

job_posted_date
str
"""2023-06-16 13:44:15"""
"""2023-01-14 13:18:07"""
"""2023-10-10 13:14:55"""
"""2023-07-04 13:01:41"""
"""2023-08-07 14:29:36"""


Check the data type of the `"job_posted_date"` column to confirm that it is currently stored as a string.


In [35]:
df.schema["job_posted_date"]

String

We can parse this column into a proper date format using the `pl.col().str.strptime()` method in Polars, which allows us to specify the date format and handle any parsing errors gracefully.

If you don't specify the date format, Polars will attempt to infer the format. Since the date format in the `"job_posted_date"` column is consistent (e.g., "2023-06-16 13:44:15"), Polars should be able to parse it correctly without explicitly providing the format.

However, if you want to ensure that the parsing is done correctly and to handle any potential variations in date formats, you can specify the format using the `format` parameter in the `str.strptime()` method.


In [None]:
df = df.with_columns(pl.col("job_posted_date").str.to_datetime())

df.select(pl.col("job_posted_date")).head(5)

job_posted_date
datetime[μs]
2023-06-16 13:44:15
2023-01-14 13:18:07
2023-10-10 13:14:55
2023-07-04 13:01:41
2023-08-07 14:29:36


In [40]:
df.schema["job_posted_date"]

Datetime(time_unit='us', time_zone=None)

In [None]:
df.select(pl.col("job_posted_date").dt.month().value_counts())

job_posted_date
struct[2]
"{3,64084}"
"{9,62359}"
"{2,64578}"
"{10,66611}"
"{8,75162}"
…
"{4,62919}"
"{7,63777}"
"{6,61572}"
"{1,91822}"


### Parse `"job_skills"` Column


The `"job_skills"` column is stored as a string that looks like a list, but it is not actually a list data type.

We can check the data type of the `"job_skills"` column to confirm that it is currently stored as a string. The output should indicate that the data type of the `"job_skills"` column is `String`.


In [11]:
df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)

job_skills
str
"""['r', 'python', 'sql', 'nosql'…"
"""['python', 'sql', 'c#', 'azure…"
"""['python', 'c++', 'java', 'mat…"
"""['bash', 'python', 'oracle', '…"
"""['python', 'sql', 'gcp']"""


In [12]:
df.schema["job_skills"]

String

To convert it to a list, we can use the `str.replace_all()` method to replace single quotes with double quotes, and then use the `str.json_decode()` method to parse the string as JSON. This will give us a proper list of skills for each job.

:::{tip} Why do we need to replace single quotes with double quotes before parsing the string as JSON?

The JSON format requires that string values be enclosed in double quotes. If the string uses single quotes, it will not be valid JSON and the `str.json_decode()` method will fail to parse it correctly. By replacing single quotes with double quotes, we ensure that the string conforms to the JSON format, allowing us to successfully decode it into a list data type.

:::


In [13]:
df = df.with_columns(
    pl.when(pl.col("job_skills").is_not_null())
    .then(
        pl.col("job_skills")
        .str.replace_all("'", '"')
        .str.json_decode(dtype=pl.List(pl.Utf8))
    )
    .otherwise(None)
    .alias("job_skills")
)

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)

job_skills
list[str]
"[""r"", ""python"", … ""tableau""]"
"[""python"", ""sql"", … ""jenkins""]"
"[""python"", ""c++"", … ""pytorch""]"
"[""bash"", ""python"", … ""git""]"
"[""python"", ""sql"", ""gcp""]"


Print the first non-null value in the `"job_skills"` column to see a non-truncated output.


In [14]:
df.select(pl.col("job_skills")).filter(pl.col("job_skills").is_not_null()).row(0)[0]

['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']

Verify that the `"job_skills"` column has been successfully parsed as a list by checking the data type of the column. The output should indicate that the data type of the `"job_skills"` column is now a list.


In [15]:
df.schema["job_skills"]

List(String)

#### How many jobs require "Python" as a skill?

We can use the `list.contains()` method to filter the DataFrame for rows where the `"job_skills"` list contains the skill "Python".

The code below filters the DataFrame to include only rows where the `"job_skills"` column contains "Python".


In [16]:
df.filter(pl.col("job_skills").list.contains("python")).select(
    pl.col("job_title"), pl.col("company_name"), pl.col("job_skills")
)

job_title,company_name,job_skills
str,str,list[str]
"""Data Analyst""","""Hewlett Packard Enterprise""","[""r"", ""python"", … ""tableau""]"
"""Data Engineer/Scientist/Analys…","""ALPHA Augmented Services""","[""python"", ""sql"", … ""jenkins""]"
"""LEAD ENGINEER - PRINCIPAL ANAL…","""Southwest Research Institute""","[""python"", ""c++"", … ""pytorch""]"
"""Data Engineer- Sr Jobs""","""Kristina Daniel""","[""bash"", ""python"", … ""git""]"
"""GCP Data Engineer""","""smart folks inc""","[""python"", ""sql"", ""gcp""]"
…,…,…
"""Data Engineer""","""Shamrock Trading Corporation""","[""nosql"", ""mongodb"", … ""git""]"
"""Data Engineer (f/m/d)""","""Heidelberg Materials""","[""python"", ""c#"", … ""terraform""]"
"""Senior Data Engineer""","""Pure App""","[""sql"", ""python"", … ""docker""]"
"""DevOps Engineer""","""CAREERSTAR INTERNATIONAL PTE. …","[""bash"", ""python"", … ""ansible""]"


The `height` attribute can be used to count the number of rows that match this condition.

It is equivalent to using the `len()` function, or `.shape[0]` to get the number of rows in the filtered DataFrame.


In [17]:
df.filter(pl.col("job_skills").list.contains("python")).height

380909

### Parse `"job_type_skills"` column


Similar to the `"job_skills"` column, the `"job_type_skills"` column is also stored as a string. However, the column contains dictionary-like string values as opposed to the list-like string values in the `"job_skills"` column.

Print the first five non-null values in the `"job_type_skills"` column to understand its structure and confirm that it is stored as a string.


In [18]:
df.select("job_type_skills").filter(pl.col("job_type_skills").is_not_null()).head(5)

job_type_skills
str
"""{'analyst_tools': ['power bi',…"
"""{'analyst_tools': ['dax'], 'cl…"
"""{'cloud': ['aws'], 'libraries'…"
"""{'cloud': ['oracle', 'aws'], '…"
"""{'cloud': ['gcp'], 'programmin…"


To see a non-truncated view of the first row, retrieve the first non-null value in the `"job_type_skills"` column and print it.


In [19]:
df.select("job_type_skills").filter(pl.col("job_type_skills").is_not_null()).row(0)[0]

"{'analyst_tools': ['power bi', 'tableau'], 'programming': ['r', 'python', 'sql', 'nosql']}"

Check the data type of the `"job_type_skills"` column to confirm that it is currently stored as a string.


In [20]:
df.schema["job_type_skills"]

String

The output shows that the `"job_type_skills"` column contains string representations of dictionaries, where each dictionary has a job type as the key and a list of skills as the value. For example, one of the values is (added line breaks for readability):

```
"{
    'analyst_tools': ['power bi', 'tableau'],
    'programming': ['r', 'python', 'sql', 'nosql']
}"
```

There are two keys in the dictionary: `"analyst_tools"` and `"programming"`. The value for each key is a list of skills relevant to that job type. For instance, the "programming" key has a list of programming languages and technologies such as "r", "python", "sql", and "nosql".


:::{tip} Structs in Polars

A `Struct` in Polars is:

> A single column that contains multiple named, typed subcolumns.

Example:

`Struct({
    analyst_tools: List(Utf8),
    programming: List(Utf8)
})`

Visually:

| job_type_skills                            |
| ------------------------------------------ |
| {analyst_tools: [...], programming: [...]} |
| {analyst_tools: [...], programming: [...]} |

But internally it is not stored row-by-row like Python dictionaries. While you don't need to understand the internal storage format of `Struct` to work with it, it is helpful to know that it is not stored as a string, but rather as a structured data type that allows for efficient querying and manipulation of the nested data.

Polars is built on Apache Arrow, which is:

- Columnar
- Typed
- Memory-contiguous
- Zero-copy friendly

A Polars `Struct` column's each field is stored as its own full column, which allows for efficient access and manipulation of the nested data without needing to parse strings or perform complex operations on row-by-row data.

So instead of:

```
Row 1 → {a:1, b:2}
Row 2 → {a:3, b:4}
```

Memory looks like:

```
a column → [1, 3]
b column → [2, 4]
```

The struct is just a logical grouping.

:::


:::{attention} What if the dictionary-like string values have varying keys across rows?

If the dictionary-like string values in the `"job_type_skills"` column have varying keys across rows, it can pose challenges for parsing and analyzing the data. In such cases, you may need to:

1. **Approach 1**: Use a parser that converts a string containing a Python literal into an actual Python object, such as `ast.literal_eval()` from the `ast` module in Python. This function can safely evaluate a string containing a Python literal (like a dictionary) and convert it into the corresponding Python data structure.
   - This will be more flexible if the keys in the dictionary-like string values vary widely across rows, as it can dynamically parse any valid Python literal. However, it may be less efficient and potentially unsafe if the input is not controlled, as it can execute arbitrary code if the input string is malicious.
2. **Approach 2**: If you know all possible keys in advance, you can create a schema for the `Struct` and use conditional logic to handle missing keys when parsing the string values.

- This is more efficient if the set of possible keys is limited and known, as it allows you to directly parse the string values into a structured format without needing to evaluate arbitrary Python literals, which can be less efficient and potentially unsafe if the input is not controlled.

3. **Approach 3**: Find all unique keys across the dataset and create a schema that includes all possible keys, then parse the string values accordingly.
   - While this is the slowest approach, it will ensure that you use a predefined schema for the `Struct` that includes all possible keys, allowing you to parse the string values into a structured format.

:::


#### Approach 1: Use `ast.literal_eval()` to parse the string values into dictionaries

The `ast.literal_eval()` function from the `ast` module in Python can be used to safely evaluate a string containing a Python literal (like a dictionary) and convert it into the corresponding Python data structure. This approach is flexible and can handle varying keys across rows, but it may be less efficient and potentially unsafe if the input is not controlled, as it can execute arbitrary code if the input string is malicious.

Although the sample code is provided below, we will not run it in the notebook due to its slow execution time. You can try running it on your own machine if you have sufficient resources, but be aware that it may take a long time to execute.

:::{danger} Slow execution warning!

The code below uses `ast.literal_eval()` to parse the string values in the `"job_type_skills"` column into dictionaries, and use `%%time` to measure the execution time of this code block.

```python
%%time

import ast

df.with_columns(
    pl.when(pl.col("job_type_skills").is_not_null())
    .then(
        pl.col("job_type_skills").map_elements(
            ast.literal_eval,
            return_dtype=pl.Object,  # important: avoid inference surprises
        )
    )
    .otherwise(None)
    .alias("job_type_skills")
)

```

:::


#### Approach 2: Create a schema for the `Struct` in advance

This approach can only be used if you know all possible keys in advance. Because we don't know all possible keys in the `"job_type_skills"` column, we will skip this approach for now. However, if you have a limited and known set of keys, you can create a schema for the `Struct` and use conditional logic to handle missing keys when parsing the string values.


#### Approach 3: Find all unique keys across the dataset and create a schema

This will be the slowest approach, but it will allow you to create a `struct` type column instead of a generic `object` type column, which will enable you to work with the nested data more efficiently in Polars.


First, find all unique keys across the dataset in the `"job_type_skills"` column.


In [None]:
%%time

import ast

keys = (
    df.filter(pl.col("job_type_skills").is_not_null())
    .select(
        pl.col("job_type_skills")
        .map_elements(ast.literal_eval, return_dtype=pl.Object)
        .map_elements(lambda d: list(d.keys()), return_dtype=pl.List(pl.Utf8))
        .alias("keys")
    )
    .explode("keys")
    .unique()
)

keys

CPU times: user 18.5 s, sys: 1.11 s, total: 19.6 s
Wall time: 19.8 s


keys
str
"""databases"""
"""os"""
"""webframeworks"""
"""analyst_tools"""
"""programming"""
"""cloud"""
"""libraries"""
"""sync"""
"""async"""
"""other"""


Then, use the list of unique keys to create a schema for the `Struct` and parse the string values in the `"job_type_skills"` column accordingly.


In [23]:
key_list = keys.get_column("keys").to_list()

skills_struct_dtype = pl.Struct([pl.Field(k, pl.List(pl.Utf8)) for k in key_list])

df = df.with_columns(
    pl.when(pl.col("job_type_skills").is_not_null())
    .then(
        pl.col("job_type_skills")
        .str.replace_all("'", '"')
        .str.json_decode(dtype=skills_struct_dtype)
    )
    .otherwise(None)
    .alias("job_type_skills")
)

Confirm that the `"job_type_skills"` column has been successfully parsed as a `Struct` by checking the data type of the column. The output should indicate that the data type of the `"job_type_skills"` column is now a `Struct` with fields corresponding to the unique keys found in the previous step.


In [24]:
df.schema["job_type_skills"]

Struct({'databases': List(String), 'os': List(String), 'webframeworks': List(String), 'analyst_tools': List(String), 'programming': List(String), 'cloud': List(String), 'libraries': List(String), 'sync': List(String), 'async': List(String), 'other': List(String)})

Print the first non-null value in the `"job_type_skills"` column to see the structured data format after parsing it as a `Struct`.


In [25]:
df.select(pl.col("job_type_skills")).filter(
    pl.col("job_type_skills").is_not_null()
).row(0)[0]

{'databases': None,
 'os': None,
 'webframeworks': None,
 'analyst_tools': ['power bi', 'tableau'],
 'programming': ['r', 'python', 'sql', 'nosql'],
 'cloud': None,
 'libraries': None,
 'sync': None,
 'async': None,
 'other': None}

The parsed `Struct` keeps all 10 keys, even if some rows have missing values for certain keys. This allows you to work with the nested data in a consistent way, regardless of whether all keys are present in every row. You can access the fields of the `Struct` using dot notation or by selecting specific fields as needed for your analysis.

Below is an example of how to access the "programming" field of the `Struct` in the `"job_type_skills"` column:


In [26]:
df.select(pl.col("job_type_skills").struct.field("programming"))

programming
list[str]
""
"[""r"", ""python"", … ""nosql""]"
"[""python"", ""sql"", ""c#""]"
"[""python"", ""c++"", … ""matlab""]"
"[""bash"", ""python""]"
…
"[""bash"", ""python"", ""perl""]"
"[""sas"", ""sql""]"
""
"[""python"", ""go"", … ""shell""]"


#### How many jobs require "sql" as a programming skill in the `"job_type_skills"` column?


In [None]:
sql_skill_required = df.filter(
    pl.col("job_type_skills").struct.field("programming").list.contains("sql")
).select(pl.col("job_type_skills").struct.field("programming").alias("programming"))

sql_skill_required

programming
list[str]
"[""r"", ""python"", … ""nosql""]"
"[""python"", ""sql"", ""c#""]"
"[""python"", ""sql""]"
"[""sql"", ""python"", ""java""]"
"[""sql"", ""nosql""]"
…
"[""python"", ""sql""]"
"[""python"", ""c#"", … ""sql""]"
"[""sql"", ""python""]"
"[""sas"", ""sql""]"


In [34]:
num_sql_required = sql_skill_required.height
print(
    f"Number of jobs that require 'sql' as a programming skill: {num_sql_required} out of {df.height} total jobs ({num_sql_required / df.height:.1%})."
)

Number of jobs that require 'sql' as a programming skill: 384849 out of 785741 total jobs (49.0%).
