# US Monster Jobs Dataset Cleansing

Personal checklist:
-> Standarization of data: Fix inconsistent column names and convert into a standard format

In [2]:
import pandas as pd

import numpy as np

In [3]:
df = pd.read_csv('C:\\Users\\jorge\\Desktop\\monster_com-job_sample.csv')

# Overview of dataset

In [4]:
df.shape

(22000, 14)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          22000 non-null  object
 1   country_code     22000 non-null  object
 2   date_added       122 non-null    object
 3   has_expired      22000 non-null  object
 4   job_board        22000 non-null  object
 5   job_description  22000 non-null  object
 6   job_title        22000 non-null  object
 7   job_type         20372 non-null  object
 8   location         22000 non-null  object
 9   organization     15133 non-null  object
 10  page_url         22000 non-null  object
 11  salary           3446 non-null   object
 12  sector           16806 non-null  object
 13  uniq_id          22000 non-null  object
dtypes: object(14)
memory usage: 2.3+ MB


# 1. Standarization of data
-> Column names/format: I don't see any relevant errors in the column name format. Am I missing something?
 <br />No caps, no format differences between one name and another.

In [6]:
df.columns

Index(['country', 'country_code', 'date_added', 'has_expired', 'job_board',
       'job_description', 'job_title', 'job_type', 'location', 'organization',
       'page_url', 'salary', 'sector', 'uniq_id'],
      dtype='object')

# 2. Data Type Conversion
Data in df is of 'object' type. We'll have to convert the data type of the columns according to the data in each one of them.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          22000 non-null  object
 1   country_code     22000 non-null  object
 2   date_added       122 non-null    object
 3   has_expired      22000 non-null  object
 4   job_board        22000 non-null  object
 5   job_description  22000 non-null  object
 6   job_title        22000 non-null  object
 7   job_type         20372 non-null  object
 8   location         22000 non-null  object
 9   organization     15133 non-null  object
 10  page_url         22000 non-null  object
 11  salary           3446 non-null   object
 12  sector           16806 non-null  object
 13  uniq_id          22000 non-null  object
dtypes: object(14)
memory usage: 2.3+ MB


# 2.1 Empty cells per column?

We have a ton of missing data in date_added, job_type, organization, salary and sector.

In [8]:
df.isnull().sum()

country                0
country_code           0
date_added         21878
has_expired            0
job_board              0
job_description        0
job_title              0
job_type            1628
location               0
organization        6867
page_url               0
salary             18554
sector              5194
uniq_id                0
dtype: int64

The only value in the 'Country' column is the US. I guess it makes sense when using a 'US Monster Jobs' dataset.

In [9]:
print(df['country'].value_counts())

print('-----------------------------------')

print(df['country_code'].value_counts())

United States of America    22000
Name: country, dtype: int64
-----------------------------------
US    22000
Name: country_code, dtype: int64


# 2.2 Convert date_added to date dtype

In [10]:
df.loc[df['date_added'].notnull()]

Unnamed: 0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector,uniq_id
133,United States of America,US,5/10/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Multibed Technician Job in Deer Park,Full Time Employee,"Deer Park, TX",Other/Not Classified,http://jobview.monster.com/Multibed-Technician...,,Other,6f6e952b8b0a2bb55e9feada54db2347
140,United States of America,US,5/13/2016,No,jobs.monster.com,Equal Opportunity Employer: Minority/Female/Di...,Principal Cyber Security Engineer Job in Houston,Full Time Employee,"Houston, TX",Computer SoftwareComputer/IT Services,http://jobview.monster.com/Principal-Cyber-Sec...,,IT/Software Development,1127457851cf28d79a39fd4b35867982
251,United States of America,US,5/9/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Field Supervisor IS Job in Deer Park,Full Time Employee,"Deer Park, TX",Other/Not Classified,http://jobview.monster.com/Field-Supervisor-IS...,,Other,94b49d291a16d01b27378ca97e653910
279,United States of America,US,6/10/2016,No,jobs.monster.com,"At American Family Insurance, we're firmly com...",Insurance Sales - Customer Service Job in Eden...,Full Time Employee,"Eden Prairie, MN 55344",Insurance,http://jobview.monster.com/insurance-sales-cus...,15.00 - 21.00 $ /hour,Accounting/Finance/Insurance,64a597e5dd17740aadf4b0e8047b51a5
366,United States of America,US,1/2/2017,No,jobs.monster.com,Description The Opportunity The Vehicle Mainte...,Vehicle Maintenance Mechanic - Las Vegas,Full Time Employee,"Las Vegas, NV",Energy and Utilities,http://jobview.monster.com/vehicle-maintenance...,,Installation/Maintenance/Repair,886903d4dda03046c2a826c44bfff3dc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20760,United States of America,US,9/27/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Central Maintenance Planner Job in Norwell,Full Time Employee,"Norwell, MA",Other/Not Classified,http://jobview.monster.com/central-maintenance...,,Administrative/Clerical,7b115d764f741821bae4ac95bfcf3f04
21342,United States of America,US,3/30/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Branch Manager Job in Cincinnati,Full Time Employee,"Cincinnati, OH",Other/Not Classified,http://jobview.monster.com/Branch-Manager-Job-...,,Other,1db7d013265871214d3f4e7ed80d8a23
21391,United States of America,US,3/24/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Field Service Driver Job in Cincinnati,Full Time Employee,"Cincinnati, OH",Other/Not Classified,http://jobview.monster.com/Field-Service-Drive...,,Logistics/Transportation,4f304e6285b240f8442a028bc3716273
21631,United States of America,US,4/4/2016,No,jobs.monster.com,"#TrackingJobBody table, #TrackingJobBody a {<b...",Field Project Manager Job in Cincinnati,Full Time Employee,"Cincinnati, OH",Other/Not Classified,http://jobview.monster.com/Field-Project-Manag...,,Other,6f86e1f35ad082be591bcec15c75f947


In [11]:
lista = [(i + ''': ''' + str(df.dtypes[i])) for i in dict(df.dtypes)]

print(lista)

['country: object', 'country_code: object', 'date_added: object', 'has_expired: object', 'job_board: object', 'job_description: object', 'job_title: object', 'job_type: object', 'location: object', 'organization: object', 'page_url: object', 'salary: object', 'sector: object', 'uniq_id: object']


# Fix salary column

# Visualize full df rows

Code in next cell allows us to view rows in text editor. 

In [12]:
with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(df[['salary']].head(15))

                           salary
0                             NaN
1                             NaN
2                             NaN
3                             NaN
4                             NaN
5                             NaN
6                             NaN
7                             NaN
8                             NaN
9                             NaN
10                            NaN
11                            NaN
12                            NaN
13           9.00 - 13.00 $ /hour
14  80,000.00 - 95,000.00 $ /year


I wanted to have a better idea of the data in the column.

The "Salary" column is a mess. We got:\
80,000.00 - 95,000.00 /year\
45,000.00 - 100,000.00 /yearBonus, Benefits,\
40,000.00 - 50,000.00 /yearsalary\
56,000.00 - 64,000.00 /yearHighly Competitiv\
13.00 - 16.00 /year\
0.00 - 90,000.00 /year\
45,000.00+ /year\
0.00 - 1.00 /year

Next code block shows rows with "year" string in it. "Salary" column.

In [13]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains('year') == True)

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(df.loc[filt]['salary'].head(15))

14                         80,000.00 - 95,000.00 $ /year
19                         60,000.00 - 72,000.00 $ /year
29                        70,000.00 - 100,000.00 $ /year
32                        75,000.00 - 100,000.00 $ /year
36                         68,000.00 - 72,000.00 $ /year
41                         58,000.00 - 65,000.00 $ /year
61     45,000.00 - 100,000.00 $ /yearBonus, Benefits,...
64                   40,000.00 - 50,000.00 $ /yearsalary
82                         80,000.00 - 90,000.00 $ /year
83                         35,000.00 - 45,000.00 $ /year
88                        80,000.00 - 100,000.00 $ /year
100                        31,000.00 - 33,000.00 $ /year
117                      100,000.00 - 120,000.00 $ /year
127                       75,000.00 - 100,000.00 $ /year
132                                    $50,000.00+ /year
Name: salary, dtype: object


# Add extra columns for analysis

In [14]:
df.columns.get_loc('salary')

11

In [15]:
df.insert(12, 'from_salary', np.nan)

df.columns

Index(['country', 'country_code', 'date_added', 'has_expired', 'job_board',
       'job_description', 'job_title', 'job_type', 'location', 'organization',
       'page_url', 'salary', 'from_salary', 'sector', 'uniq_id'],
      dtype='object')

In [16]:
df.insert(13, 'to_salary', np.nan)

df.columns

Index(['country', 'country_code', 'date_added', 'has_expired', 'job_board',
       'job_description', 'job_title', 'job_type', 'location', 'organization',
       'page_url', 'salary', 'from_salary', 'to_salary', 'sector', 'uniq_id'],
      dtype='object')

In [17]:
df.insert(14, 'yearly_hourly', np.nan)

df.columns

Index(['country', 'country_code', 'date_added', 'has_expired', 'job_board',
       'job_description', 'job_title', 'job_type', 'location', 'organization',
       'page_url', 'salary', 'from_salary', 'to_salary', 'yearly_hourly',
       'sector', 'uniq_id'],
      dtype='object')

In [18]:
df.index.names = ['id']

# Salary column copy
I copied the data in "salary" to "salary_copy"

In [19]:
df['salary_copy'] = df['salary']

In [20]:
df[['salary', 'salary_copy']]

Unnamed: 0_level_0,salary,salary_copy
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,,
1,,
2,,
3,,
4,,
...,...,...
21995,"120,000.00 - 160,000.00 $ /yearbonus","120,000.00 - 160,000.00 $ /yearbonus"
21996,"45,000.00 - 60,000.00 $ /year","45,000.00 - 60,000.00 $ /year"
21997,,
21998,25.00 - 28.00 $ /hour,25.00 - 28.00 $ /hour


# Split salary string
Now that I got the columns I'm splitting the strings into each one of them.\
Try to make sense of the data.

In [21]:
df['yearly_hourly']

id
0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
         ..
21995   NaN
21996   NaN
21997   NaN
21998   NaN
21999   NaN
Name: yearly_hourly, Length: 22000, dtype: float64

In [22]:
df['from_salary'] = df['from_salary'].astype(str)

df['to_salary'] = df['to_salary'].astype(str)


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          22000 non-null  object 
 1   country_code     22000 non-null  object 
 2   date_added       122 non-null    object 
 3   has_expired      22000 non-null  object 
 4   job_board        22000 non-null  object 
 5   job_description  22000 non-null  object 
 6   job_title        22000 non-null  object 
 7   job_type         20372 non-null  object 
 8   location         22000 non-null  object 
 9   organization     15133 non-null  object 
 10  page_url         22000 non-null  object 
 11  salary           3446 non-null   object 
 12  from_salary      22000 non-null  object 
 13  to_salary        22000 non-null  object 
 14  yearly_hourly    0 non-null      float64
 15  sector           16806 non-null  object 
 16  uniq_id          22000 non-null  object 
 17  salary_copy 

# Salary/Year Loop

Set "Yearly_Hourly" column to "year" where "Salary" column contains "year"

In [23]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains('year') == True)

for i in df.index[filt == True]:
    from_sal = df.iloc[i, df.columns.get_loc('from_salary')]
    to_sal = df.iloc[i, df.columns.get_loc('to_salary')]
    
    df.iloc[i, df.columns.get_loc('yearly_hourly')] = 'year'
    
    split_salary = df.iloc[i, df.columns.get_loc('salary')].split('-', 1)  # Split "Salary" string in 2. That's why we specify "1" inside the split. That means first ocurrence.
    
    if len(split_salary) == 1:
        df.at[i, 'from_salary'] = np.nan   # Set to NaN while I find what to do with salaries without specific range like "Up to $60,000"
        df.at[i, 'to_salary'] = split_salary[0]   # Set to the max value, for ex. if it says "Up to $60,000" then it is $60,000
    elif len(split_salary) == 2:
        df.at[i, 'from_salary'] = split_salary[0]   # We assign FIRST value of the split. Most of the strings are of the "$50,000 - $70,000" type
        df.at[i, 'to_salary'] = split_salary[1]   # We assign SECOND value of the split. Most of the strings are of the "$50,000 - $70,000" type

# Salary/Hour Loop

In [24]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains('hour') == True)

for i in df.index[filt == True]:
    from_sal = df.iloc[i, df.columns.get_loc('from_salary')]
    to_sal = df.iloc[i, df.columns.get_loc('to_salary')]
    
    df.iloc[i, df.columns.get_loc('yearly_hourly')] = 'hour'
    
    split_salary = df.iloc[i, df.columns.get_loc('salary')].split('-', 1)
    
    if len(split_salary) == 1:
        df.at[i, 'from_salary'] = np.nan
        df.at[i, 'to_salary'] = split_salary[0]
    elif len(split_salary) == 2:
        df.at[i, 'from_salary'] = split_salary[0]
        df.at[i, 'to_salary'] = split_salary[1]

# Salary per Month

In [25]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains("""/month""") == True)

for i in df.index[filt == True]:
    from_sal = df.iloc[i, df.columns.get_loc('from_salary')]
    to_sal = df.iloc[i, df.columns.get_loc('to_salary')]
    
    df.iloc[i, df.columns.get_loc('yearly_hourly')] = 'month'
    
    split_salary = df.iloc[i, df.columns.get_loc('salary')].split('-', 1)
    
    if len(split_salary) == 1:
        df.at[i, 'from_salary'] = np.nan
        df.at[i, 'to_salary'] = split_salary[0]
    elif len(split_salary) == 2:
        df.at[i, 'from_salary'] = split_salary[0]
        df.at[i, 'to_salary'] = split_salary[1]

# Up to... Loop

In [26]:
filt_upto = (pd.isna(df['salary']) == False) & (df['salary'].str.contains('Up to') == True)

df.loc[filt_upto]['salary']

for i in df.index[filt_upto == True]:
    split_salary = df.iloc[i, df.columns.get_loc('salary')].split('$', 1)
    if len(split_salary[1]) > 5:
        df.at[i, 'from_salary'] = np.nan
        df.at[i, 'to_salary'] = split_salary[1]
        df.at[i, 'yearly_hourly'] = 'upto'
    elif len(split_salary[1]) < 6:
        df.at[i, 'from_salary'] = np.nan
        df.at[i, 'to_salary'] = split_salary[1]
        df.at[i, 'yearly_hourly'] = 'uptohour'

# View year results

In [27]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains('year') == True)

df.loc[filt][['salary', 'from_salary', 'to_salary', 'yearly_hourly']]

Unnamed: 0_level_0,salary,from_salary,to_salary,yearly_hourly
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14,"80,000.00 - 95,000.00 $ /year",80000.00,"95,000.00 $ /year",year
19,"60,000.00 - 72,000.00 $ /year",60000.00,"72,000.00 $ /year",year
29,"70,000.00 - 100,000.00 $ /year",70000.00,"100,000.00 $ /year",year
32,"75,000.00 - 100,000.00 $ /year",75000.00,"100,000.00 $ /year",year
36,"68,000.00 - 72,000.00 $ /year",68000.00,"72,000.00 $ /year",year
...,...,...,...,...
21982,"70,000.00 - 80,000.00 $ /year",70000.00,"80,000.00 $ /year",year
21987,"$80,000.00+ /year",,"$80,000.00+ /year",year
21995,"120,000.00 - 160,000.00 $ /yearbonus",120000.00,"160,000.00 $ /yearbonus",year
21996,"45,000.00 - 60,000.00 $ /year",45000.00,"60,000.00 $ /year",year


# View hour results

In [28]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains('hour') == True)

df.loc[filt][['salary', 'from_salary', 'to_salary', 'yearly_hourly']]

Unnamed: 0_level_0,salary,from_salary,to_salary,yearly_hourly
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
13,9.00 - 13.00 $ /hour,9.00,13.00 $ /hour,hour
30,62.00 - 81.00 $ /hour,62.00,81.00 $ /hour,hour
43,15.00 - 16.00 $ /hour,15.00,16.00 $ /hour,hour
68,13.75 - 16.75 $ /hourYear End Bonus,13.75,16.75 $ /hourYear End Bonus,hour
80,40.00 - 50.00 $ /hour,40.00,50.00 $ /hour,hour
...,...,...,...,...
21909,17.00 - 21.00 $ /hour,17.00,21.00 $ /hour,hour
21912,20.00 - 22.00 $ /hour,20.00,22.00 $ /hour,hour
21913,50.00 - 55.00 $ /hour,50.00,55.00 $ /hour,hour
21927,30.00 - 35.00 $ /hour,30.00,35.00 $ /hour,hour


# View Monthly

In [29]:
filt = (pd.isna(df['salary']) == False) & (df['salary'].str.contains("""/month""") == True)

df.loc[filt][['salary', 'from_salary', 'to_salary', 'yearly_hourly']]

Unnamed: 0_level_0,salary,from_salary,to_salary,yearly_hourly
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2662,"17,688.94 - 20,971.00 $ /month",17688.94,"20,971.00 $ /month",month
2900,"3,137.89 - 3,865.66 $ /month",3137.89,"3,865.66 $ /month",month
3300,"17,688.94 - 20,971.00 $ /month",17688.94,"20,971.00 $ /month",month
3317,"6,833.33 - 8,060.00 $ /month",6833.33,"8,060.00 $ /month",month
3345,"4,224.31 - 6,579.41 $ /month",4224.31,"6,579.41 $ /month",month
3622,"4,905.22 - 6,579.41 $ /month",4905.22,"6,579.41 $ /month",month
5949,"4,362.39 - 5,445.57 $ /month",4362.39,"5,445.57 $ /month",month
12333,"1,800.00 - 3,500.00 $ /month",1800.0,"3,500.00 $ /month",month
13554,"5,882.93 - 7,883.20 $ /month",5882.93,"7,883.20 $ /month",month
14150,"3,674.00 - 4,690.00 $ /month",3674.0,"4,690.00 $ /month",month


# View Up to...

In [30]:
filt = (pd.isna(df['salary']) == False) & (df['yearly_hourly'].str.contains('upto') == True)

df.loc[filt][['salary', 'from_salary', 'to_salary', 'yearly_hourly']]

Unnamed: 0_level_0,salary,from_salary,to_salary,yearly_hourly
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
42,Up to $32000.00,,32000.00,upto
154,Up to $45000.00,,45000.00,upto
2630,Up to $18.00,,18.00,uptohour
2639,Up to $13.00,,13.00,uptohour
2651,Up to $18.00,,18.00,uptohour
...,...,...,...,...
14069,Up to $15.00,,15.00,uptohour
20014,Up to $14.00,,14.00,uptohour
21168,Up to $20.00,,20.00,uptohour
21270,Up to $31500.00,,31500.00,upto


# View others
There are still some hourly/yearly salaries in a different format.\
I don't think including any of these is relevant for the data nor is going to make a difference.
I am leaving the rest of the rows as NaN.

In [64]:
filt = (pd.isna(df['salary']) == False) & (pd.isna(df['yearly_hourly']) == True)

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(df.loc[filt][['salary', 'from_salary', 'to_salary', 'yearly_hourly']].head(20))

                                                 salary from_salary to_salary  \
id                                                                              
23                         Excellent Pay and Incentives         NaN       nan   
58                              Salary, plus commission         NaN       nan   
70                                     To be discussed.         NaN       nan   
92              bonus, 401K matching, medical, vacation         NaN       nan   
125                                                 DOE         NaN       nan   
179                      Negotiable based on experience         NaN       nan   
183                                   Competitive Wages         NaN       nan   
209   Burg Simpson offers excellent benefits and com...         NaN       nan   
225                 Excellent compensation and benefits         NaN       nan   
451                                       Yearly Salary         NaN       nan   
484                         

# Clean "from_salary" & "to_salary" columns
We have the salaries mixed with different strings and symbols.

In [65]:
df[['from_salary', 'to_salary']]

filt = (pd.isna(df['salary']) == False) & (pd.isna(df['yearly_hourly']) == True)

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(df.loc[:, ['from_salary', 'to_salary', 'yearly_hourly']].head(20))

   from_salary           to_salary yearly_hourly
id                                              
0          NaN                 nan           NaN
1          NaN                 nan           NaN
2          NaN                 nan           NaN
3          NaN                 nan           NaN
4          NaN                 nan           NaN
5          NaN                 nan           NaN
6          NaN                 nan           NaN
7          NaN                 nan           NaN
8          NaN                 nan           NaN
9          NaN                 nan           NaN
10         NaN                 nan           NaN
11         NaN                 nan           NaN
12         NaN                 nan           NaN
13       9.00        13.00 $ /hour          hour
14  80,000.00    95,000.00 $ /year          year
15         NaN                 nan           NaN
16         NaN                 nan           NaN
17         NaN                 nan           NaN
18         NaN      

Copy of "from_salary" and "to_salary" columns

In [33]:
df['copy_from_salary'] = df['from_salary']

df['copy_to_salary'] = df['to_salary']

Make corrections for the nan values in the "from_salary" column

In [34]:
df.loc[42, ['from_salary']].apply(type) #NaN IS EQUAL TO FLOAT TYPE

from_salary    <class 'float'>
Name: 42, dtype: object

In [35]:
df.loc[0, ['from_salary']].apply(type) #nan IS EQUAL TO STRING TYPE

from_salary    <class 'str'>
Name: 0, dtype: object

Some of the nan values were actually strings, so the np.nan filter didn't get them. I had to look them as 'nan'.

In [36]:
df['from_salary'] = df['from_salary'].astype(str)

In [37]:
df.loc[42, ['from_salary']].apply(type) #value is now changed to string type

from_salary    <class 'str'>
Name: 42, dtype: object

Next code block replaces the 'nan' values for np.nan or Null values.

In [38]:
df['from_salary'].replace(to_replace= 'nan', value= np.nan, inplace=True)

Now we have the 'nan' values as Null and the rest of the values (70,000.00, $10.00) as strings.

In [41]:
filt_from_salary = df.loc[df['from_salary'].notnull(), ['from_salary']]

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(filt_from_salary)

                                             from_salary
id                                                      
13                                                 9.00 
14                                            80,000.00 
19                                            60,000.00 
29                                            70,000.00 
30                                                62.00 
32                                            75,000.00 
36                                            68,000.00 
41                                            58,000.00 
43                                                15.00 
61                                            45,000.00 
64                                            40,000.00 
68                                                13.75 
80                                                40.00 
82                                            80,000.00 
83                                            35,000.00 
88                             

In [63]:
filt_from_salary = df['from_salary'].notnull() == True

for i in df.index[filt_from_salary]:
    # GetVals = list([char for char in ejemplo
    #            if char.isnumeric() or char == '.'])
    print(i)

13
14
19
29
30
32
36
41
43
61
64
68
80
82
83
88
100
111
117
121
127
134
136
137
146
166
182
190
201
204
218
227
240
249
253
274
279
284
312
327
343
350
369
379
386
393
401
404
411
416
422
427
445
509
512
518
581
605
627
640
674
681
706
710
737
748
754
755
764
765
767
784
806
812
816
833
850
855
856
857
860
869
870
872
899
903
904
934
964
972
974
976
980
1030
1043
1054
1066
1073
1075
1096
1113
1160
1173
1174
1177
1199
1214
1223
1235
1250
1254
1275
1288
1294
1319
1322
1336
1340
1352
1370
1384
1387
1394
1408
1412
1425
1426
1436
1437
1442
1450
1487
1488
1490
1495
1497
1503
1509
1524
1527
1528
1529
1534
1535
1554
1555
1572
1573
1582
1587
1590
1595
1598
1599
1611
1616
1630
1634
1640
1642
1645
1655
1675
1680
1684
1701
1716
1739
1745
1749
1754
1760
1762
1775
1835
1849
1886
1887
1898
1912
1930
1937
1952
1965
1968
1982
1990
1993
2005
2012
2015
2024
2025
2031
2035
2045
2062
2067
2073
2090
2091
2093
2125
2132
2135
2141
2142
2161
2210
2211
2216
2217
2233
2285
2298
2299
2300
2307
2309
2313
2322
2324

In [45]:
ejemplo = '40,000.00'

# num_list = [i for i in range(11)]

# allowed_symbols = ['.']

# allowed_chars = num_list + allowed_symbols

for i in 

getVals = list([char for char in ejemplo
               if char.isnumeric() or char == '.'])



print(''.join(getVals))

40000.00


Empty cells by column
What can we do about these columns?

-> date_added:
 <br />-> job_type: 
 <br />-> organization: 
 <br />-> salary:
 <br />-> sector: 