Data Dictionary for initial dataframe (data2.csv)

* **Job Title**: contains the title of the job position that is being advertise
* **Job Info**: contains information about the job, such as whether it is full-time or part-time, and whether it is a job or an internship
* **Company Name**: contains the name of the company that is offering the job
* **Company Location**: contains the location of the company, including the city and state
* **Employees**: contains information about the size of the company, such as the number of employees
* **Industry**: contains the industry in which the company operates
* **Headquarters**: contains the location of the company's headquarters
* **Application deadline**: contains the date and time by which job applications must be submitted
* **Posted date**: contains the date on which the job was posted
* **Location type**: contains information about the type of location where the job is located, such as whether it is on-site or remote
* **US work authorization**: contains information about whether the company requires candidates to have US work authorization
* **Estimated pay**: contains information about the estimated pay for the job position.
* **Seasonal role**: contains information about the working perid for sesonal roles.
* **Company division**: contains information about the division of the company that is offering the job
* **Work study**: contains information about whether the job is a work study program



Important libraries

In [503]:
import pandas as pd
import numpy as np
import math

Read data from Exel file and create DataFrame for future analysis

In [504]:
df = pd.read_csv("data2.csv")
df_modify = df.copy()

Explore initial datafram

In [505]:
df_modify.head()

Unnamed: 0,Job Title,Job Info,Company Name,Company Location,Employees,Industry,Headquarters,Application deadline,Posted date,Location type,US work authorization,Estimated pay,Seasonal role,Company division,Work study
0,Entry Level Insurance Underwriter,Full-Time ∙ Job,Auto-Owners Insurance Company,"Lakeland, FL\nIrmo, SC\nForest, VA\nLexington,...","5,000 - 10,000",Insurance,"Lansing, MI",4/4/2023 15:30,13-Jul-16,On-site,Required,,,,
1,Internships,Full-Time ∙ Internship,Auto-Owners Insurance Company,"Lakeland, FL\nIrmo, SC\nCharlotte, NC\nForest,...","5,000 - 10,000",Insurance,"Lansing, MI",4/4/2023 15:30,13-Jul-16,On-site,Required,,,,
2,Blood Bank Associate Technical Support Specialist,Full-Time ∙ Job,SCC Soft Computer,"Clearwater, FL","250 - 1,000",Internet & Software,"Clearwater, FL",3/13/2023 0:00,22-Aug-16,On-site,Required,$19.00 per hour,,,
3,Lab/Mic Associate Technical Support Specialist,Full-Time ∙ Job,SCC Soft Computer,"Clearwater, FL","250 - 1,000",Internet & Software,"Clearwater, FL",3/13/2023 0:00,22-Aug-16,On-site,Required,$18.00 per hour,,,
4,Computer Science Intern - Online Computer Scie...,Part-Time ∙ Internship,Coding4Youth,"Atlanta, GA\nHouston, TX\nPhiladelphia, PA\nCh...","10,000 - 25,000",Other Education,"San Jose, CA",9/29/2050 21:31,19-Sep-16,Remote,Required,$20.00 per hour,(6/5/17 - 9/12/90),,


In [506]:
df_modify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19447 entries, 0 to 19446
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Job Title              19447 non-null  object
 1   Job Info               19447 non-null  object
 2   Company Name           19447 non-null  object
 3   Company Location       19447 non-null  object
 4   Employees              19447 non-null  object
 5   Industry               19447 non-null  object
 6   Headquarters           19442 non-null  object
 7   Application deadline   19447 non-null  object
 8   Posted date            19447 non-null  object
 9   Location type          19447 non-null  object
 10  US work authorization  19244 non-null  object
 11  Estimated pay          11334 non-null  object
 12  Seasonal role          4530 non-null   object
 13  Company division       3220 non-null   object
 14  Work study             1 non-null      object
dtypes: object(15)
memor

In [507]:
df_modify.shape

(19447, 15)

In [508]:
df_modify.columns

Index(['Job Title', 'Job Info', 'Company Name', 'Company Location',
       'Employees', 'Industry', 'Headquarters', 'Application deadline',
       'Posted date', 'Location type', 'US work authorization',
       'Estimated pay', 'Seasonal role', 'Company division', 'Work study'],
      dtype='object')

***STEP 1: DATA MODIFICATION***

Application deadline/ Post date

In [509]:
# Convert 'Application deadline' column to datetime format
df_modify['Application deadline'] = pd.to_datetime(df_modify['Application deadline'])

# Convert 'Posted date' column to datetime format
df_modify['Posted date'] = pd.to_datetime(df_modify['Posted date'])

# Extract the date and time components from 'Application deadline' column and create new columns
df_modify['Application deadline (date)'] = df_modify['Application deadline'].dt.date
df_modify['Application deadline (time)'] = df_modify['Application deadline'].dt.time

# Calculate the time difference between 'Application deadline (date)' and 'Posted date' columns and create a new column 'Application Window'
df_modify['Application Window (weeks)'] = pd.to_datetime(df_modify['Application deadline (date)']) - df_modify['Posted date']
# Convert 'Application Window' from timedelta64 to int and show in weeks
df_modify['Application Window (weeks)'] = df_modify['Application Window (weeks)'].dt.days // 7

# Convert 'Application deadline (date)' column to datetime format
df_modify['Application deadline (date)'] = pd.to_datetime(df_modify['Application deadline (date)'])

# Drop 'Application deadline' column since we don't need it anymore
df_modify.drop('Application deadline', axis=1, inplace=True)

In [510]:
df_modify

Unnamed: 0,Job Title,Job Info,Company Name,Company Location,Employees,Industry,Headquarters,Posted date,Location type,US work authorization,Estimated pay,Seasonal role,Company division,Work study,Application deadline (date),Application deadline (time),Application Window (weeks)
0,Entry Level Insurance Underwriter,Full-Time ∙ Job,Auto-Owners Insurance Company,"Lakeland, FL\nIrmo, SC\nForest, VA\nLexington,...","5,000 - 10,000",Insurance,"Lansing, MI",2016-07-13,On-site,Required,,,,,2023-04-04,15:30:00,350
1,Internships,Full-Time ∙ Internship,Auto-Owners Insurance Company,"Lakeland, FL\nIrmo, SC\nCharlotte, NC\nForest,...","5,000 - 10,000",Insurance,"Lansing, MI",2016-07-13,On-site,Required,,,,,2023-04-04,15:30:00,350
2,Blood Bank Associate Technical Support Specialist,Full-Time ∙ Job,SCC Soft Computer,"Clearwater, FL","250 - 1,000",Internet & Software,"Clearwater, FL",2016-08-22,On-site,Required,$19.00 per hour,,,,2023-03-13,00:00:00,342
3,Lab/Mic Associate Technical Support Specialist,Full-Time ∙ Job,SCC Soft Computer,"Clearwater, FL","250 - 1,000",Internet & Software,"Clearwater, FL",2016-08-22,On-site,Required,$18.00 per hour,,,,2023-03-13,00:00:00,342
4,Computer Science Intern - Online Computer Scie...,Part-Time ∙ Internship,Coding4Youth,"Atlanta, GA\nHouston, TX\nPhiladelphia, PA\nCh...","10,000 - 25,000",Other Education,"San Jose, CA",2016-09-19,Remote,Required,$20.00 per hour,(6/5/17 - 9/12/90),,,2050-09-29,21:31:00,1775
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19442,Coaching Assistant - Summer camp,Full-Time ∙ Internship,Camp Skylemar,"Naples, ME",100 - 250,Sports & Leisure,"Naples, ME",2023-03-12,On-site,Accepts OPT/CPT,"$3,000 per year",(6/11/23 - 8/6/23),,,2023-04-22,00:00:00,5
19443,Videographer/Content Creator,Full-Time ∙ Internship,Camp Skylemar,"Naples, ME",100 - 250,Sports & Leisure,"Naples, ME",2023-03-12,On-site,Accepts OPT/CPT,"$3,500 per year",(6/13/23 - 8/7/23),,,2023-04-29,00:00:00,6
19444,Talent Acquisition Intern - Remote,Full-Time ∙ Internship,CONMED,"Tampa, FL","1,000 - 5,000",Medical Devices,"Largo, FL",2023-03-12,Remote,Required,$15.00-20.00 per hour,,,,2023-03-31,00:00:00,2
19445,Camp Counselor - Summer 2023,Full-Time ∙ Internship,Camp Danbee,"Peru, MA","250 - 1,000",Summer Camps/Outdoor Recreation,"101 W Main Rd, Peru, Massachusetts 01235, Unit...",2023-03-12,On-site,Will sponsor a work visa and accepts OPT/CPT,"$1,000-2,000 per month",(6/15/23 - 8/11/23),,,2023-04-30,12:00:00,7


Job Type


In [511]:
# split the job_type column into two separate columns based on the " ∙ " separator
df_modify[['Employment Type', 'Job Type', "Payment Status"]] = df_modify['Job Info'].str.split(' ∙ ', expand=True)

# Modify the 'Payment Status' column to contain boolean values indicating if the job is paid or unpaid
df_modify['Payment Status'] = df_modify['Estimated pay'].apply(lambda x: 'Paid' if pd.notnull(x) else 'Unpaid')

# drop the original job_type column
df_modify.drop('Job Info', axis=1, inplace=True)


Location

In [512]:
# Caluclate the number of company locations based on the "\n" separator, if row contains one, else use 1
df_modify['Number of Location'] = df_modify['Company Location'].apply(lambda x: x.count('\n') + 1 if '\n' in x else 1)
#Replace '\n' with ' ' if 'Company Location' row contains one, else keep row without changes
df_modify['Company Location'] = df_modify['Company Location'].apply(lambda x: x.replace('\n', ' ') if '\n' in x else x)


Seasonal Role

In [513]:
# create role_start_date and role_end_date columns from seasonal_role
df_modify[['Role start date', 'Role end date']] = df_modify['Seasonal role'].str.split(' - ', expand=True)

# Remove the brackets from 'Role start date' and 'Role end date'
df_modify['Role start date'] = df_modify['Role start date'].str.replace('(','')
df_modify['Role end date'] = df_modify['Role end date'].str.replace(')','')

# Convert 'Role start date' and 'Role end date' to datetime format
df_modify['Role start date'] = pd.to_datetime(df_modify['Role start date'])
df_modify['Role end date'] = pd.to_datetime(df_modify['Role end date'])

df_modify['Role Duration'] = (df_modify['Role end date'] - df_modify['Role start date']).dt.days
# Convert Role Duration from days to weeks
df_modify['Role Duration (weeks)'] = df_modify['Role Duration'].apply(lambda x: math.ceil(x/7) if pd.notna(x) else None)

# modify seasonal_role column to contain boolean values
df_modify['Seasonal role'] = df_modify['Seasonal role'].notna()

df_modify.drop('Role Duration', axis=1, inplace=True)


  df_modify['Role start date'] = df_modify['Role start date'].str.replace('(','')
  df_modify['Role end date'] = df_modify['Role end date'].str.replace(')','')


In [514]:
df_modify.tail(15)

Unnamed: 0,Job Title,Company Name,Company Location,Employees,Industry,Headquarters,Posted date,Location type,US work authorization,Estimated pay,...,Application deadline (date),Application deadline (time),Application Window (weeks),Employment Type,Job Type,Payment Status,Number of Location,Role start date,Role end date,Role Duration (weeks)
19432,Resource Team Intern Summer 2023 (Spanish),InReach (formerly AsylumConnect),"Miami, FL Atlanta, GA Asheville, NC Nashville,...",1 - 10,Non-Profit - Other,"228 Park Ave S Suite # 90945 New York, NY 1000...",2023-03-11,Remote,Required,,...,2023-05-08,17:00:00,8,Part-Time,Internship,Unpaid,20,2023-05-22,2023-08-11,12.0
19433,Respite Counselor - Milford - Full Time,"Riverside Community Care, Inc.","Milford, MA","1,000 - 5,000",Healthcare,"270 Bridge Street, Suite 301, Dedham, Massachu...",2023-03-11,On-site,Required,,...,2023-08-31,00:00:00,24,Full-Time,Job,Unpaid,1,NaT,NaT,
19434,Respite Counselor - Milford - Relief,"Riverside Community Care, Inc.","Milford, MA","1,000 - 5,000",Healthcare,"270 Bridge Street, Suite 301, Dedham, Massachu...",2023-03-11,On-site,Required,,...,2023-08-31,00:00:00,24,Part-Time,Job,Unpaid,1,NaT,NaT,
19435,Respite Counselor - Norwood,"Riverside Community Care, Inc.","Norwood, MA","1,000 - 5,000",Healthcare,"270 Bridge Street, Suite 301, Dedham, Massachu...",2023-03-11,On-site,Required,,...,2023-08-31,00:00:00,24,Full-Time,Job,Unpaid,1,NaT,NaT,
19436,Respite Counselor - Norwood - Relief,"Riverside Community Care, Inc.","Norwood, MA","1,000 - 5,000",Healthcare,"270 Bridge Street, Suite 301, Dedham, Massachu...",2023-03-11,On-site,Required,,...,2023-08-31,00:00:00,24,Part-Time,Job,Unpaid,1,NaT,NaT,
19437,Social Media Intern Summer 2023,InReach (formerly AsylumConnect),"Miami, FL New Orleans, LA Washington, DC India...",1 - 10,Non-Profit - Other,"228 Park Ave S Suite # 90945 New York, NY 1000...",2023-03-11,Remote,Required,,...,2023-05-08,17:00:00,8,Part-Time,Internship,Unpaid,12,2023-05-22,2023-08-11,12.0
19438,Taxpayer Advocate Service (TAS) - Case Advocat...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,Remote,Required,"$30,000-40,000 per year",...,2023-10-11,02:00:00,30,Full-Time,Job,Paid,1,NaT,NaT,
19439,Taxpayer Advocate Service (TAS) - Case Advocat...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,On-site,Required,"$30,000-40,000 per year",...,2023-09-27,02:00:00,28,Full-Time,Job,Paid,1,NaT,NaT,
19440,Taxpayer Advocate Service (TAS) - Intake Advoc...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,Remote,Required,"$30,000-40,000 per year",...,2023-09-26,02:00:00,28,Full-Time,Job,Paid,1,NaT,NaT,
19441,Taxpayer Advocate Service (TAS) - Intake Advoc...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,Remote,Required,"$30,000-40,000 per year",...,2023-09-20,02:00:00,27,Full-Time,Job,Paid,1,NaT,NaT,


Estimated Salary

In [515]:
# Split the 'Estimated pay' column into two separate columns 'Pay rate' and 'Payment Period' based on the separator 'per'
df_modify[['Pay rate','Payment Period']] = df_modify['Estimated pay'].str.split('per', expand=True)

# Strip any leading or trailing white spaces in the 'Payment Period' column
df_modify['Payment Period'] = df_modify['Payment Period'].str.strip()


# Remove the dollar sign '$' from the 'Pay rate' column
df_modify['Pay rate'] = df_modify['Pay rate'].str.replace('$','')

# Extract the minimum and maximum salary values from the 'Pay rate' column and store them in new columns 'Min salary' and 'Max salary', respectively
df_modify['Min salary'] = df_modify['Pay rate'].apply(lambda x: x.split('-')[0].strip() if isinstance(x, str) and '-' in x else None)
df_modify['Max salary'] = df_modify['Pay rate'].apply(lambda x: x.split('-')[1].strip() if isinstance(x, str) and '-' in x and len(x.split('-')) >= 2 else None)
df_modify['Min salary'] = df_modify['Pay rate'].apply(lambda x: x.split('-')[0].strip().replace(',', '') if isinstance(x, str) else None)
df_modify['Max salary'] = df_modify['Pay rate'].apply(lambda x: x.split('-')[1].strip().replace(',', '') if isinstance(x, str) and '-' in x else None)

# Convert the 'Min salary' and 'Max salary' columns to numeric data types
df_modify['Min salary'] = pd.to_numeric(df_modify['Min salary'])
df_modify['Max salary'] = pd.to_numeric(df_modify['Max salary'])

# Fill any missing values in the 'Max salary' column with the corresponding value from the 'Min salary' column
df_modify['Max salary'].fillna(df_modify['Min salary'], inplace=True)

# Drop the original 'Estimated pay' column from the DataFrame using the drop() method with axis=1, inplace=True
df_modify.drop('Estimated pay', axis=1, inplace=True)

  df_modify['Pay rate'] = df_modify['Pay rate'].str.replace('$','')


Work study

In [516]:
#Drop the 'Work study' column from the dataframe.
df_modify.drop('Work study', axis=1, inplace=True)

In [517]:
df_modify.tail(10)

Unnamed: 0,Job Title,Company Name,Company Location,Employees,Industry,Headquarters,Posted date,Location type,US work authorization,Seasonal role,...,Job Type,Payment Status,Number of Location,Role start date,Role end date,Role Duration (weeks),Pay rate,Payment Period,Min salary,Max salary
19437,Social Media Intern Summer 2023,InReach (formerly AsylumConnect),"Miami, FL New Orleans, LA Washington, DC India...",1 - 10,Non-Profit - Other,"228 Park Ave S Suite # 90945 New York, NY 1000...",2023-03-11,Remote,Required,True,...,Internship,Unpaid,12,2023-05-22,2023-08-11,12.0,,,,
19438,Taxpayer Advocate Service (TAS) - Case Advocat...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,Remote,Required,False,...,Job,Paid,1,NaT,NaT,,"30,000-40,000",year,30000.0,40000.0
19439,Taxpayer Advocate Service (TAS) - Case Advocat...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,On-site,Required,False,...,Job,Paid,1,NaT,NaT,,"30,000-40,000",year,30000.0,40000.0
19440,Taxpayer Advocate Service (TAS) - Intake Advoc...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,Remote,Required,False,...,Job,Paid,1,NaT,NaT,,"30,000-40,000",year,30000.0,40000.0
19441,Taxpayer Advocate Service (TAS) - Intake Advoc...,Taxpayer Advocate Service - Internal Revenue S...,United States,"25,000+","Government - Local, State & Federal","Laguna Niguel, CA",2023-03-11,Remote,Required,False,...,Job,Paid,1,NaT,NaT,,"30,000-40,000",year,30000.0,40000.0
19442,Coaching Assistant - Summer camp,Camp Skylemar,"Naples, ME",100 - 250,Sports & Leisure,"Naples, ME",2023-03-12,On-site,Accepts OPT/CPT,True,...,Internship,Paid,1,2023-06-11,2023-08-06,8.0,3000,year,3000.0,3000.0
19443,Videographer/Content Creator,Camp Skylemar,"Naples, ME",100 - 250,Sports & Leisure,"Naples, ME",2023-03-12,On-site,Accepts OPT/CPT,True,...,Internship,Paid,1,2023-06-13,2023-08-07,8.0,3500,year,3500.0,3500.0
19444,Talent Acquisition Intern - Remote,CONMED,"Tampa, FL","1,000 - 5,000",Medical Devices,"Largo, FL",2023-03-12,Remote,Required,False,...,Internship,Paid,1,NaT,NaT,,15.00-20.00,hour,15.0,20.0
19445,Camp Counselor - Summer 2023,Camp Danbee,"Peru, MA","250 - 1,000",Summer Camps/Outdoor Recreation,"101 W Main Rd, Peru, Massachusetts 01235, Unit...",2023-03-12,On-site,Will sponsor a work visa and accepts OPT/CPT,True,...,Internship,Paid,1,2023-06-15,2023-08-11,9.0,"1,000-2,000",month,1000.0,2000.0
19446,Outdoor Adventure & Ropes Instructor - Summer ...,Camp Danbee,"Peru, MA","250 - 1,000",Summer Camps/Outdoor Recreation,"101 W Main Rd, Peru, Massachusetts 01235, Unit...",2023-03-12,On-site,Will sponsor a work visa and accepts OPT/CPT,True,...,Internship,Paid,1,2023-06-09,2023-08-11,9.0,"2,000-3,000",month,2000.0,3000.0


***EXTRA FEATURES***

In [518]:
df_modify.columns

Index(['Job Title', 'Company Name', 'Company Location', 'Employees',
       'Industry', 'Headquarters', 'Posted date', 'Location type',
       'US work authorization', 'Seasonal role', 'Company division',
       'Application deadline (date)', 'Application deadline (time)',
       'Application Window (weeks)', 'Employment Type', 'Job Type',
       'Payment Status', 'Number of Location', 'Role start date',
       'Role end date', 'Role Duration (weeks)', 'Pay rate', 'Payment Period',
       'Min salary', 'Max salary'],
      dtype='object')

In [543]:
#Method to convert monthly/hourly pay to yearly
def convert_salary(df):
    for index, row in df.iterrows():
        if row['Payment Period'] == 'hour':
            df.at[index, 'Min salary'] *= 2080
            df.at[index, 'Max salary'] *= 2080
            df.at[index, 'Payment Period'] = 'year'
        elif row['Payment Period'] == 'month':
            df.at[index, 'Min salary'] *= 12
            df.at[index, 'Max salary'] *= 12
            df.at[index, 'Payment Period'] = 'year'
    return df

In [544]:
#Aplly method convert_salary()
df_modify = convert_salary(df_modify)

***STEP 2: DATA CLEANING***

Explore modified dataframe

In [522]:
df_modify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19447 entries, 0 to 19446
Data columns (total 25 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Job Title                    19447 non-null  object        
 1   Company Name                 19447 non-null  object        
 2   Company Location             19447 non-null  object        
 3   Employees                    19447 non-null  object        
 4   Industry                     19447 non-null  object        
 5   Headquarters                 19442 non-null  object        
 6   Posted date                  19447 non-null  datetime64[ns]
 7   Location type                19447 non-null  object        
 8   US work authorization        19244 non-null  object        
 9   Seasonal role                19447 non-null  bool          
 10  Company division             3220 non-null   object        
 11  Application deadline (date)  19447 non-nu

In [523]:
df_modify.describe()

Unnamed: 0,Application Window (weeks),Number of Location,Role Duration (weeks),Min salary,Max salary
count,19447.0,19447.0,4530.0,11334.0,11334.0
mean,40.375071,2.01121,21.090949,613517.5,745943.1
std,93.379761,6.797362,48.847758,10951040.0,12569660.0
min,-671.0,1.0,-1394.0,1.0,1.0
25%,7.0,1.0,10.0,31200.0,36000.0
50%,24.0,1.0,12.0,41600.0,50000.0
75%,47.0,1.0,19.0,60000.0,70000.0
max,4173.0,322.0,1601.0,832000000.0,832000000.0


In [524]:
df_modify.shape

(19447, 25)

In [525]:
df_modify.nunique()

Job Title                      16776
Company Name                    4953
Company Location                4395
Employees                         10
Industry                          74
Headquarters                    2672
Posted date                      958
Location type                      2
US work authorization              5
Seasonal role                      2
Company division                1067
Application deadline (date)      697
Application deadline (time)      305
Application Window (weeks)       380
Employment Type                    3
Job Type                           8
Payment Status                     2
Number of Location                74
Role start date                  431
Role end date                    453
Role Duration (weeks)            139
Pay rate                        2659
Payment Period                     2
Min salary                      1774
Max salary                      1872
dtype: int64

Find duplicates

In [526]:
#Drop rows with duplicated values
df_modify.drop_duplicates(inplace=True)

In [527]:
#Check number of duplicates
df_modify.duplicated().value_counts()

False    19447
dtype: int64

Empoyment type

In [528]:
df_modify['Employment Type'].unique()

array(['Full-Time', 'Part-Time', 'Seasonal'], dtype=object)

In [529]:
filt1 = (df_modify['Employment Type']=='Seasonal')
df_modify.loc[filt1,['Employment Type','Seasonal role']].value_counts()

Employment Type  Seasonal role
Seasonal         True             1
dtype: int64

In [530]:
df_modify.drop(df[filt1].index, inplace=True)

Payment Period

In [533]:
df_modify['Payment Period'].unique()

array([nan, 'year', 'year or more'], dtype=object)

In [542]:
# Assign NaN to the "Max value" column where "Payment period" is "year or more"
df_modify.loc[df_modify['Payment Period'] == 'year or more', 'Max salary'] = np.nan


Role Duration (weeks)

In [536]:
df_modify.drop(df_modify[df_modify['Role Duration (weeks)'] < 0].index, inplace=True)

Application deadline

In [537]:
df_modify.drop(df_modify[df_modify['Application Window (weeks)'] < 0].index, inplace=True)

***STEP 3: SAVE NEW (COMPLETE) DATAFRAME TO .csv FILE***

In [538]:
print(df.columns)
print("_________________________")
print(df_modify.columns)
print("_________________________")
print(df_modify.shape)

Index(['Job Title', 'Job Info', 'Company Name', 'Company Location',
       'Employees', 'Industry', 'Headquarters', 'Application deadline',
       'Posted date', 'Location type', 'US work authorization',
       'Estimated pay', 'Seasonal role', 'Company division', 'Work study'],
      dtype='object')
_________________________
Index(['Job Title', 'Company Name', 'Company Location', 'Employees',
       'Industry', 'Headquarters', 'Posted date', 'Location type',
       'US work authorization', 'Seasonal role', 'Company division',
       'Application deadline (date)', 'Application deadline (time)',
       'Application Window (weeks)', 'Employment Type', 'Job Type',
       'Payment Status', 'Number of Location', 'Role start date',
       'Role end date', 'Role Duration (weeks)', 'Pay rate', 'Payment Period',
       'Min salary', 'Max salary'],
      dtype='object')
_________________________
(19438, 25)


In [540]:
df_modify = df_modify[['Job Title',
                       'Company Name',
                       'Industry',
                       'Company division',
                       
                       'Posted date',
                       'Application deadline (date)', 
                       'Application deadline (time)',
                       'Application Window (weeks)',

                       'Employment Type',
                       'Job Type',
                       'Location type', 

                       'US work authorization',

                       'Seasonal role',
                       'Role start date', 
                       'Role end date',
                       'Role Duration (weeks)',

                       'Payment Status',
                       'Payment Period',
                       'Pay rate', 
                       'Min salary', 
                       'Max salary',

                       'Headquarters',
                       'Company Location',
                       'Number of Location',

                       'Employees'
                       ]]

Data Dictionary for modified table (modified_data.csv)

* **Job Title**: The title of the job position being advertised.
* **Company Name**: The name of the company offering the job.
* **Industry**: The industry in which the company operates.
* **Company division**: The division of the company that is offering the job.
* **Posted date**: The date on which the job was posted.
* **Application deadline (date)**: The date by which job applications must be submitted.
* **Application deadline (time)**: The time of day by which job applications must be submitted.
* **Application Window (weeks)**: The time period between the job being posted and the application deadline (weeks).
* **Employment Type**: Whether the job is full-time, part-time, or some other type of employment.
* **Job Type**: Whether the job is a job or an internship.
* **Location type**: Whether the job is on-site or remote.
* **US work authorization**: Whether the company requires candidates to have US work authorization.
* **Seasonal role**: Whether the job is a seasonal role.
* **Role start date**: The date on which the seasonal role starts.
* **Role end date**: The date on which the seasonal role ends.
* **Role Duration (weeks)**: Describer the number of weeks to work, for esonal roles
* **Payment Status**: Whether the job is paid or unpaid.
* **Payment Period**: Describes the period for payment in the job posting (year/year or more).
* **Pay rate**: The rate at which the job pays.
* **Min salary**: The minimum salary for the job.
* **Max salary**: The maximum salary for the job.
* **Headquarters**: The location of the company's headquarters.
* **Company Location**: The location of the company offering the job.
* **Number of Location**: The number of locations where the company has offices.
* **Employees**: Information about the size of the company, such as the number of employees

In [541]:
df_modify.head()

Unnamed: 0,Job Title,Company Name,Industry,Company division,Posted date,Application deadline (date),Application deadline (time),Application Window (weeks),Employment Type,Job Type,...,Role Duration (weeks),Payment Status,Payment Period,Pay rate,Min salary,Max salary,Headquarters,Company Location,Number of Location,Employees
0,Entry Level Insurance Underwriter,Auto-Owners Insurance Company,Insurance,,2016-07-13,2023-04-04,15:30:00,350,Full-Time,Job,...,,Unpaid,,,,,"Lansing, MI","Lakeland, FL Irmo, SC Forest, VA Lexington, KY...",18,"5,000 - 10,000"
1,Internships,Auto-Owners Insurance Company,Insurance,,2016-07-13,2023-04-04,15:30:00,350,Full-Time,Internship,...,,Unpaid,,,,,"Lansing, MI","Lakeland, FL Irmo, SC Charlotte, NC Forest, VA...",20,"5,000 - 10,000"
2,Blood Bank Associate Technical Support Specialist,SCC Soft Computer,Internet & Software,,2016-08-22,2023-03-13,00:00:00,342,Full-Time,Job,...,,Paid,year,19.0,39520.0,39520.0,"Clearwater, FL","Clearwater, FL",1,"250 - 1,000"
3,Lab/Mic Associate Technical Support Specialist,SCC Soft Computer,Internet & Software,,2016-08-22,2023-03-13,00:00:00,342,Full-Time,Job,...,,Paid,year,18.0,37440.0,37440.0,"Clearwater, FL","Clearwater, FL",1,"250 - 1,000"
5,Instructional Design and Delivery Internship,24/7 Education,K-12 Education,,2016-09-26,2023-12-11,00:00:00,376,Part-Time,Internship,...,185.0,Unpaid,,,,,"New York City, NY","New York, NY",1,10 - 50


In [545]:
df_modify.to_csv('modified_data.csv', index=False)