## apply(): 

#### The apply() function in pandas lets you apply a function to each row or column of a DataFrame — kind of like a for-loop shortcut for rows or columns.

### Note: Apply functions to columns or rows.

| Parameter         | Description                                                                                                                                                                                             |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`func`**        | The function to apply to each row or column. Can be a built-in function, lambda, or custom function.                                                                                                    |
| **`axis`**        | Direction to apply the function: <br> `0` or `'index'` → apply to each **column** (default) <br> `1` or `'columns'` → apply to each **row**                                                             |
| **`raw`**         | If `True`, passes raw numpy arrays to the function instead of Series. This can improve performance.                                                                                                     |
| **`result_type`** | Only used when `axis=1`. Can be: <br> `'expand'`: list-like results become columns <br> `'reduce'`: results are reduced to a Series <br> `'broadcast'`: results are broadcast across the original shape |
| **`args`**        | A tuple of positional arguments to pass to `func`.                                                                                                                                                      |
| **`**kwds`**      | Additional keyword arguments passed to `func`.                                                                                                                                                          |


##### df.apply(func, axis=0)  # apply to each column (default)
##### df.apply(func, axis=1)  # apply to each row

### Example: 🔁 axis=0 → Apply function to each column

Think: "Down each column"

Each column (as a Series) is passed one-by-one to the function.

In [37]:
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2],
    'B': [10, 20]
})
df


Unnamed: 0,A,B
0,1,10
1,2,20


In [38]:
df.apply(sum, axis=0)

A     3
B    30
dtype: int64

### Example ↔️ axis=1 → Apply function to each row

Think: "Across each row"

Each row (as a Series) is passed to the function.

In [39]:
df.apply(lambda row: row['A'] + row['B'], axis=1)


0    11
1    22
dtype: int64

| Method              | Applies to          | Use for                                               |
| ------------------- | ------------------- | ----------------------------------------------------- |
| `df['col'].apply()` | A single **Series** | Element-wise transformation (e.g. clean, map, format) |
| `df.apply(axis=0)`  | Each **column**     | Column-wise ops (e.g. col stats)                      |
| `df.apply(axis=1)`  | Each **row**        | Row-level logic (e.g. combine fields, custom rules)   |


In [1]:
# Import libraries and data, then clean up data
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])


In [4]:
pd.notna(df['salary_year_avg']) # return boolean value, Ture or False
df[pd.notna(df['salary_year_avg'])] # then get the filtered dataframe which salary_year_avg is not na
df[pd.notna(df['salary_year_avg'])]['salary_year_avg']

28        109500.0
77        140000.0
92        120000.0
100       228222.0
109        89000.0
            ...   
785624    139216.0
785641    150000.0
785648    221875.0
785682    157500.0
785692    157500.0
Name: salary_year_avg, Length: 22003, dtype: float64

### Calculate projected salary

Simple introduction to apply() function, which is easier to understand

In [None]:
def projected_salary(salary):
    return salary * 1.03

# when need alteration of the original datafrane, remember to add a copy
df_salary = df[pd.notna(df['salary_year_avg'])].copy()

df_salary['salary_year_inflated'] = df_salary['salary_year_avg'].apply(projected_salary)

df_salary[['salary_year_inflated','salary_year_avg']].head(10)



Unnamed: 0,salary_year_inflated,salary_year_avg
28,112785.0,109500.0
77,144200.0,140000.0
92,123600.0,120000.0
100,235068.66,228222.0
109,91670.0,89000.0
116,117420.0,114000.0
146,133385.0,129500.0
180,92957.5,90250.0
212,162225.0,157500.0
257,106221.84,103128.0


use lambda function

In [30]:
# when need alteration of the original datafrane, remember to add a copy
df_salary_1 = df[pd.notna(df['salary_year_avg'])].copy()

df_salary_1['salary_year_inflated']=df_salary_1['salary_year_avg'].apply(lambda salary: salary* 1.03)
df_salary_1[['salary_year_inflated','salary_year_avg']].head(10)

Unnamed: 0,salary_year_inflated,salary_year_avg
28,112785.0,109500.0
77,144200.0,140000.0
92,123600.0,120000.0
100,235068.66,228222.0
109,91670.0,89000.0
116,117420.0,114000.0
146,133385.0,129500.0
180,92957.5,90250.0
212,162225.0,157500.0
257,106221.84,103128.0


In [31]:
df_salary_2 = df[pd.notna(df['salary_year_avg'])].copy()

df_salary_2['salary_year_inflated']=df_salary_2['salary_year_avg']*1.03

df_salary_2[['salary_year_inflated','salary_year_avg']].head(10)

Unnamed: 0,salary_year_inflated,salary_year_avg
28,112785.0,109500.0
77,144200.0,140000.0
92,123600.0,120000.0
100,235068.66,228222.0
109,91670.0,89000.0
116,117420.0,114000.0
146,133385.0,129500.0
180,92957.5,90250.0
212,162225.0,157500.0
257,106221.84,103128.0


## Job Skills Column

In [14]:
# Import libraries and data, then clean up data
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

In [20]:
type(df['job_skills'][1])

str

In [19]:
df['job_skills'][1]

"['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']"

## ast.literal_eval()
##### ast.literal_eval() is a built-in Python function from the **ast module** that safely evaluates a string containing a Python literal — like a list, dict, tuple, int, float, or boolean.
#### 👉 It evaluates a string containing a Python literal — which means it parses and converts the string into a real Python object.

In [26]:
import ast

s= ast.literal_eval(df['job_skills'][1])
s

['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']

In [25]:
type(s)

list

If use this function to the whole columns, what happened?

In [None]:
df['job_skills']=ast.literal_eval(df['job_skills'])
# this will run error

#### Need to use apply to use this function

In [29]:
'''
def clean_list(job_skill):
    return ast.literal_eval(job_skill)

df['job_skills']=df['job_skills'].apply(clean_list)
df['job_skills']
# ValueError: malformed node or string: nan
'''
# Import libraries and data, then clean up data
import pandas as pd
import ast
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

def clean_list(job_skill):
    if pd.notna(job_skill):
        return ast.literal_eval(job_skill)

df['job_skills']=df['job_skills'].apply(clean_list)
df['job_skills']


0                                                      None
1                [r, python, sql, nosql, power bi, tableau]
2         [python, sql, c#, azure, airflow, dax, docker,...
3         [python, c++, java, matlab, aws, tensorflow, k...
4         [bash, python, oracle, aws, ansible, puppet, j...
                                ...                        
785736    [bash, python, perl, linux, unix, kubernetes, ...
785737                               [sas, sas, sql, excel]
785738                                  [powerpoint, excel]
785739    [python, go, nosql, sql, mongo, shell, mysql, ...
785740                                          [aws, flow]
Name: job_skills, Length: 785741, dtype: object

#### Another way: use lambda instead of define a function

In [34]:
import pandas as pd
import ast
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

df['job_skills']=df['job_skills'].apply(lambda job_skill: ast.literal_eval(job_skill) if pd.notna(job_skill) else job_skill)
df['job_skills'].head()

0                                                  NaN
1           [r, python, sql, nosql, power bi, tableau]
2    [python, sql, c#, azure, airflow, dax, docker,...
3    [python, c++, java, matlab, aws, tensorflow, k...
4    [bash, python, oracle, aws, ansible, puppet, j...
Name: job_skills, dtype: object

In [35]:
type(df['job_skills'][1])

list

### Calculate project salary next year

- senior roles assume 5%
- other roles assume 3%

In [42]:
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])


In [45]:
df_salary = df[pd.notna(df['salary_year_avg'])].copy()

def projected_salary(row):
    if 'Senior' in row['job_title_short']:
        return row['salary_year_avg']*1.05
    else:
        return row['salary_year_avg']*1.03

df_salary['salary_year_inflated']=df_salary.apply(projected_salary,axis=1) # apply to each row
df_salary[['job_title_short','salary_year_avg','salary_year_inflated']]

Unnamed: 0,job_title_short,salary_year_avg,salary_year_inflated
28,Data Scientist,109500.0,112785.00
77,Data Engineer,140000.0,144200.00
92,Data Engineer,120000.0,123600.00
100,Data Scientist,228222.0,235068.66
109,Data Analyst,89000.0,91670.00
...,...,...,...
785624,Data Engineer,139216.0,143392.48
785641,Data Engineer,150000.0,154500.00
785648,Data Scientist,221875.0,228531.25
785682,Data Scientist,157500.0,162225.00


### Practice 1. Convert Date to String

Convert the job_posted_date column to a string format 'YYYY-MM-DD' and create a new column job_posted_date_str.

| Code | Meaning                     | Example  |
| ---- | --------------------------- | -------- |
| `%Y` | 4-digit year                | `2024`   |
| `%y` | 2-digit year                | `24`     |
| `%m` | 2-digit month (01–12)       | `06`     |
| `%b` | Abbreviated month name      | `Jun`    |
| `%B` | Full month name             | `June`   |
| `%d` | 2-digit day of month        | `03`     |
| `%A` | Full weekday name           | `Monday` |
| `%a` | Abbreviated weekday name    | `Mon`    |
| `%H` | Hour (00–23)                | `14`     |
| `%I` | Hour (01–12, 12-hour clock) | `02`     |
| `%p` | AM/PM                       | `PM`     |
| `%M` | Minute                      | `30`     |
| `%S` | Second                      | `45`     |


In [46]:
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

df['job_posted_date_str']=df['job_posted_date'].dt.strftime('%Y-%m-%d')
df[['job_posted_date_str','job_posted_date']].head()

Unnamed: 0,job_posted_date_str,job_posted_date
0,2023-06-16,2023-06-16 13:44:15
1,2023-01-14,2023-01-14 13:18:07
2,2023-10-10,2023-10-10 13:14:55
3,2023-07-04,2023-07-04 13:01:41
4,2023-08-07,2023-08-07 14:29:36


In [47]:
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

df['job_posted_date_str']=df['job_posted_date'].apply(lambda x: x.strftime('%Y-%m-%d'))
df[['job_posted_date_str','job_posted_date']].head()

Unnamed: 0,job_posted_date_str,job_posted_date
0,2023-06-16,2023-06-16 13:44:15
1,2023-01-14,2023-01-14 13:18:07
2,2023-10-10,2023-10-10 13:14:55
3,2023-07-04,2023-07-04 13:01:41
4,2023-08-07,2023-08-07 14:29:36


##### .dt is a pandas accessor meant for Series, not individual datetime objects.

| Expression                                | What's happening              | Use when...                          |
| ----------------------------------------- | ----------------------------- | ------------------------------------ |
| `x.strftime('%Y-%m-%d')` inside `apply()` | `x` is a single datetime      | You're using `apply()` row-by-row    |
| `df['col'].dt.strftime('%Y-%m-%d')`       | `dt` acts on the whole Series | You want fast, vectorized formatting |


### Practice 2. Days since posted

Calculate the number of days since each job was posted. Create a new column days_since_posted that contains this value. Use the job_posted_date column.

Note: You need to import the datetime library and use the datetime module to get the current date using .now().

In [None]:
import pandas as pd
from datetime import datetime
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

# Get current timestamp
current_date = datetime.now()
df['days_since_posted'] = df['job_posted_date'].apply(lambda date: (current_date - date).days)
df[['job_posted_date', 'days_since_posted']].head()



Unnamed: 0,job_posted_date,days_since_posted
0,2023-06-16 13:44:15,718
1,2023-01-14 13:18:07,871
2,2023-10-10 13:14:55,602
3,2023-07-04 13:01:41,700
4,2023-08-07 14:29:36,666


### .days is valid — but only on a timedelta object.

**Timedelta** It's the result of subtracting two datetime or date objects

.days gives you just the number of whole days in the time difference.  
If you want hours or seconds, you can use .seconds, .total_seconds(), etc.

In [54]:
from datetime import datetime

current_date = datetime.now()
past_date = datetime(2025, 5, 30)

delta = current_date - past_date
print(current_date)
print(past_date)
print(delta)      
print(delta.days)  

2025-06-03 20:09:39.273190
2025-05-30 00:00:00
4 days, 20:09:39.273190
4


### .now() from datetime module
-  one option  

import datetime  

datetime.datetime.now()  # ✅ Works!  
datetime.now()  ❌ Error  

Because when you import datetime, you're importing the module, not the datetime class inside it.

- recommended option

from datetime import datetime  

datetime.now()  # ✅ Works!

In [56]:
from datetime import datetime

current_timestamp= datetime.now()
current_date=datetime.now().date()
current_year=datetime.now().year # year is attribute, not method
current_day=datetime.now().day

print(current_timestamp)
print(current_date)
print(current_year)
print(current_day)

2025-06-03 20:19:26.160479
2025-06-03
2025
3


In [None]:
now = datetime.now()
print(now.month)    
print(now.day)      
print(now.hour)    
print(now.minute)  
print(now.second)  

6
3
20
21
11


### Practice 3. Salary Category
Create a copy of the DataFrame called df_filtered and drop the NaN values for salary_year_avg.  

Then, create a new column salary_category that categorizes the salary_year_avg into three categories: 'Low' for salaries less than 60,000, 'Medium' for salaries between 60,000 and 100,000, and 'High' for salaries greater than 100,000.  

Then show the df_filtered DataFrame and the salary_year_avg and salary_category columns.

In [62]:
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

#df1=df.copy()
#df_filtered=df1.dropna(subset='salary_year_avg')

# drop NaN values for salary_year_avg
df_filtered=df[pd.notna(df['salary_year_avg'])].copy()

def salary_categories(row):
    if row['salary_year_avg'] < 60000:
        return 'Low'
    elif row['salary_year_avg'] > 100000:
        return 'High'
    else:
        return 'Medium'
df_filtered['salary_category']=df_filtered.apply(salary_categories,axis=1)
df_filtered[['salary_year_avg','salary_category']]


Unnamed: 0,salary_year_avg,salary_category
28,109500.0,High
77,140000.0,High
92,120000.0,High
100,228222.0,High
109,89000.0,Medium
...,...,...
785624,139216.0,High
785641,150000.0,High
785648,221875.0,High
785682,157500.0,High


In [63]:
import pandas as pd
df=pd.read_csv('/Users/abby/Python_data_project/2_Advanced/data_jobs.csv')
df['job_posted_date']=pd.to_datetime(df['job_posted_date'])

# drop NaN values for salary_year_avg
df_filtered=df.dropna(subset='salary_year_avg').copy()

df_filtered['salary_category']=df['salary_year_avg'].apply(lambda salary: 'High'if salary >100000 else 'Low' if salary < 60000 else 'Medium')
df_filtered[['salary_year_avg','salary_category']]

Unnamed: 0,salary_year_avg,salary_category
28,109500.0,High
77,140000.0,High
92,120000.0,High
100,228222.0,High
109,89000.0,Medium
...,...,...
785624,139216.0,High
785641,150000.0,High
785648,221875.0,High
785682,157500.0,High


Lambda function works on columns, not the whole dataframe

In [None]:
# First way
df_filtered['salary_category']=df['salary_year_avg'].apply(lambda salary: 'High'if salary >100000 else 'Low' if salary < 60000 else 'Medium')

# Second way
def salary_categories(row):
    if row['salary_year_avg'] < 60000:
        return 'Low'
    elif row['salary_year_avg'] > 100000:
        return 'High'
    else:
        return 'Medium'
df_filtered['salary_category']=df_filtered.apply(salary_categories,axis=1)
