# Data Science Job Postings in Glassdoor

In this project, I used the dataset of <a href="https://www.kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor/data" target="_blank">Data Science Job Postings in Glassdoor</a>.

I wanted to practice data cleaning and transformation techniques using Python. I was able to remove duplicates, drop unnecessary columns, create new columns from the original data, fill null values, and standardize column values using replace.

### Data Dictionary of the Clean Dataset
- **job_title_cleaned**: Title of the job posting.
- **job_level**: The level of the job position (e.g., Junior, Senior, Entry-level).
- **Rating**: Rating of the job posting (range from 1 to 5; -1 indicates missing data).
- **Company Name**: Name of the company offering the job.
- **Location**: Location of the company where the job is offered.
- **Headquarters**: Location of the company's headquarters.
- **job_state**: State where the company is located.
- **same_location**: Boolean column indicating whether the company's headquarters and job location are in the same state (1 for yes, 0 for no).
- **Size**: Company size based on the number of employees (e.g., 1 to 50, 51 to 200, 1001 to 5000).
- **years_of_operation**: The number of years the company has been in operation.
- **Type of Ownership**: Type of ownership structure of the company (e.g., non-profit, public, private).
- **Industry**: The specific field in which the company operates (e.g., Biotech, IT Services).
- **Sector**: The broader field of work (e.g., Technology, Healthcare).
- **Revenue**: The total revenue of the company.
- **min_salary**: The minimum salary offered for the position (in thousands).
- **max_salary**: The maximum salary offered for the position (in thousands).
- **avg_salary**: The average salary calculated for the job post (in thousands).
- **python**: Boolean column indicating whether Python is required for the job (True, False).
- **r**: Boolean column indicating whether R is required for the job (True, False).
- **excel**: Boolean column indicating whether Excel is required for the job (True, False).
- **sql**: Boolean column indicating whether SQL is required for the job (True, False).
- **tableau**: Boolean column indicating whether Tableau is required for the job (True, False).
- **power_bi**: Boolean column indicating whether Power BI is required for the job (True, False).
- **hadoop**: Boolean column indicating whether Hadoop is required for the job (True, False).
- **spark**: Boolean column indicating whether Apache Spark is required for the job (True, False).
- **aws**: Boolean column indicating whether AWS (Amazon Web Services) is required for the job (True, False).
- **big_data**: Boolean column indicating whether Big Data technologies are required for the job (True, False).
- **machine_learning**: Boolean column indicating whether machine learning knowledge is required for the job (True, False).

In [2]:
# Importing packages
import pandas as pd
import numpy as np

# Reading the data
df = pd.read_csv(r"C:\Users\User\Portfolio Projects\Data Science Job Postings Project\Uncleaned_DS_jobs.csv")
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


In [3]:
# Checking for missing data
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

index - 0%
Job Title - 0%
Salary Estimate - 0%
Job Description - 0%
Rating - 0%
Company Name - 0%
Location - 0%
Headquarters - 0%
Size - 0%
Founded - 0%
Type of ownership - 0%
Industry - 0%
Sector - 0%
Revenue - 0%
Competitors - 0%


In [4]:
# Checking for duplicates
duplicates_check = df.duplicated().any()
duplicates_check

# Removing the duplicates
# df = df.drop_duplicates()
# df

False

In [5]:
# Checking the data types of the columns
print(df.dtypes)

index                  int64
Job Title             object
Salary Estimate       object
Job Description       object
Rating               float64
Company Name          object
Location              object
Headquarters          object
Size                  object
Founded                int64
Type of ownership     object
Industry              object
Sector                object
Revenue               object
Competitors           object
dtype: object


In [6]:
# Categorizing Job Titles and reducing those categories
def job_title_cleaning(title):
  if 'data scientist' in title.lower():
    return 'Data Scientist'
  elif ' data engineer' in title.lower():
    return 'Data Engineer'
  elif 'analyst' in title.lower():
    return 'Analyst'
  elif 'manager' in title.lower():
    return 'Manager'
  elif 'director' in title.lower():
    return 'Director'
  else:
    return 'Other'

In [7]:
df['job_title_cleaned'] = df['Job Title'].apply(job_title_cleaning)
df['job_title_cleaned'].value_counts()

job_title_cleaned
Data Scientist    455
Other             138
Analyst            55
Data Engineer      14
Manager             7
Director            3
Name: count, dtype: int64

In [8]:
# Cleaning Job Titles by their levels
def level_cleaning(title):
  if 'sr' in title.lower() or 'senior' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
    return 'Senior'
  elif 'jr' in title.lower() or 'jr.' in title.lower():
    return 'Junior'
  else:
    return ' Other'

In [9]:
df['job_level'] = df['Job Title'].apply(level_cleaning)
df['job_level'].value_counts()

job_level
 Other    576
Senior     94
Junior      2
Name: count, dtype: int64

In [10]:
# Cleaning the 'Salary Estimate' column
char_to_replace = {'K': '', '(Glassdoor est.)': '', '(Employer est.)': '', '$':''}

# Iterating over allthe  key-value pairs in the char_to_replace dictionary 
for key, value in char_to_replace.items():
    # Replacing key characters with value characters in string
    df['Salary Estimate'] = df['Salary Estimate'].str.replace(key, value)

df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,job_title_cleaned,job_level
0,0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",Data Scientist,Senior
1,1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,Data Scientist,Other
2,2,Data Scientist,137-171,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,Data Scientist,Other
3,3,Data Scientist,137-171,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",Data Scientist,Other
4,4,Data Scientist,137-171,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",Data Scientist,Other


In [11]:
# Creating two columns from the 'Salary Estimate' column to differentiate minimum and maximum salaries
df[['min_salary', 'max_salary']] = df['Salary Estimate'].str.split('-', n=2, expand=True)

# Making a new column as the average salary using the mean of 'min_salary' and 'max_salary' columns
df[['min_salary', 'max_salary']] = df[['min_salary', 'max_salary']].astype(int)
df['avg_salary'] = df[['min_salary', 'max_salary']].mean(axis=1).astype(int)

df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,job_title_cleaned,job_level,min_salary,max_salary,avg_salary
0,0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",Data Scientist,Senior,137,171,154
1,1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,Data Scientist,Other,137,171,154
2,2,Data Scientist,137-171,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,Data Scientist,Other,137,171,154
3,3,Data Scientist,137-171,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",Data Scientist,Other,137,171,154
4,4,Data Scientist,137-171,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",Data Scientist,Other,137,171,154


In [12]:
# Cleaning the 'Company Name' column
df['Company Name'] = df['Company Name'].str.replace(r'\n[0-9]\.[0-9]', '', regex=True)
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,job_title_cleaned,job_level,min_salary,max_salary,avg_salary
0,0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",Data Scientist,Senior,137,171,154
1,1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1,Data Scientist,Other,137,171,154
2,2,Data Scientist,137-171,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1,Data Scientist,Other,137,171,154
3,3,Data Scientist,137-171,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",Data Scientist,Other,137,171,154
4,4,Data Scientist,137-171,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",Data Scientist,Other,137,171,154


In [13]:
# Creating a state column from the 'Location' column
states = {'Patuxent, Anne Arundel, MD':'MD', 'California':'CA', 'New Jersey':'NJ', 'Texas':'TX', 'Utah':'UT'}

df.loc[:, 'job_state'] = df['Location'].str.split(',', expand=True)[1]
df['job_state'] = df['job_state'].fillna(df['Location'].map(states))
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Industry,Sector,Revenue,Competitors,job_title_cleaned,job_level,min_salary,max_salary,avg_salary,job_state
0,0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,...,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",Data Scientist,Senior,137,171,154,NY
1,1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,...,Research & Development,Business Services,$1 to $2 billion (USD),-1,Data Scientist,Other,137,171,154,VA
2,2,Data Scientist,137-171,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,...,Consulting,Business Services,$100 to $500 million (USD),-1,Data Scientist,Other,137,171,154,MA
3,3,Data Scientist,137-171,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,...,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",Data Scientist,Other,137,171,154,MA
4,4,Data Scientist,137-171,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,...,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",Data Scientist,Other,137,171,154,NY


In [14]:
# Creating a column that states if location and headquarters are in the same place
df.loc[:, 'same_location'] = np.where(df['Location'] == df['Headquarters'], True, False)
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Sector,Revenue,Competitors,job_title_cleaned,job_level,min_salary,max_salary,avg_salary,job_state,same_location
0,0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000 employees,1993,...,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",Data Scientist,Senior,137,171,154,NY,True
1,1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,...,Business Services,$1 to $2 billion (USD),-1,Data Scientist,Other,137,171,154,VA,False
2,2,Data Scientist,137-171,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,...,Business Services,$100 to $500 million (USD),-1,Data Scientist,Other,137,171,154,MA,True
3,3,Data Scientist,137-171,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,...,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",Data Scientist,Other,137,171,154,MA,False
4,4,Data Scientist,137-171,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200 employees,1998,...,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",Data Scientist,Other,137,171,154,NY,True


In [15]:
df['Size'].value_counts()

Size
51 to 200 employees        135
1001 to 5000 employees     104
1 to 50 employees           86
201 to 500 employees        85
10000+ employees            80
501 to 1000 employees       77
5001 to 10000 employees     61
-1                          27
Unknown                     17
Name: count, dtype: int64

In [16]:
# Cleaning the 'Size' column
df['Size'] = df['Size'].apply(lambda x: x.split(' employees')[0])
df['Size'] = df['Size'].replace({'-1' : 'NA', 'Unknown' : 'NA'})
df['Size']

0       1001 to 5000
1      5001 to 10000
2       1001 to 5000
3        501 to 1000
4          51 to 200
           ...      
667     1001 to 5000
668               NA
669               NA
670          1 to 50
671     1001 to 5000
Name: Size, Length: 672, dtype: object

In [17]:
# Creating a column that states the years of operation using the 'Founded' column
df.loc[:, 'years_of_operation'] = np.where(df['Founded'] != -1, 2024 - df['Founded'], -1)
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Revenue,Competitors,job_title_cleaned,job_level,min_salary,max_salary,avg_salary,job_state,same_location,years_of_operation
0,0,Sr Data Scientist,137-171,Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000,1993,...,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna",Data Scientist,Senior,137,171,154,NY,True,31
1,1,Data Scientist,137-171,"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000,1968,...,$1 to $2 billion (USD),-1,Data Scientist,Other,137,171,154,VA,False,56
2,2,Data Scientist,137-171,Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000,1981,...,$100 to $500 million (USD),-1,Data Scientist,Other,137,171,154,MA,True,43
3,3,Data Scientist,137-171,JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000,2000,...,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech...",Data Scientist,Other,137,171,154,MA,False,24
4,4,Data Scientist,137-171,Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200,1998,...,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee",Data Scientist,Other,137,171,154,NY,True,26


In [18]:
df['Type of ownership'].value_counts()

Type of ownership
Company - Private                 397
Company - Public                  153
Nonprofit Organization             36
Subsidiary or Business Segment     28
-1                                 27
Government                         10
Other Organization                  5
Private Practice / Firm             4
Unknown                             4
College / University                3
Self-employed                       2
Contract                            2
Hospital                            1
Name: count, dtype: int64

In [19]:
# Cleaning the 'Type of ownership' column
df['Type of ownership'] = df['Type of ownership'].replace({'-1' : 'Other Organization', 'Unknown' : 'Other Organization'})
df['Type of ownership'].value_counts()

Type of ownership
Company - Private                 397
Company - Public                  153
Nonprofit Organization             36
Other Organization                 36
Subsidiary or Business Segment     28
Government                         10
Private Practice / Firm             4
College / University                3
Self-employed                       2
Contract                            2
Hospital                            1
Name: count, dtype: int64

In [20]:
df['Industry'].value_counts()

Industry
-1                                          71
Biotech & Pharmaceuticals                   66
IT Services                                 61
Computer Hardware & Software                57
Aerospace & Defense                         46
Enterprise Software & Network Solutions     43
Consulting                                  38
Staffing & Outsourcing                      36
Insurance Carriers                          28
Internet                                    27
Advertising & Marketing                     23
Health Care Services & Hospitals            21
Research & Development                      17
Federal Agencies                            16
Investment Banking & Asset Management       13
Banks & Credit Unions                        8
Lending                                      8
Energy                                       5
Consumer Products Manufacturing              5
Telecommunications Services                  5
Insurance Agencies & Brokerages              4
Food

In [21]:
# Cleaning the 'Industry' column
df['Industry'] = df['Industry'].replace({'-1' : 'Other'})
df['Industry'].value_counts()

Industry
Other                                       71
Biotech & Pharmaceuticals                   66
IT Services                                 61
Computer Hardware & Software                57
Aerospace & Defense                         46
Enterprise Software & Network Solutions     43
Consulting                                  38
Staffing & Outsourcing                      36
Insurance Carriers                          28
Internet                                    27
Advertising & Marketing                     23
Health Care Services & Hospitals            21
Research & Development                      17
Federal Agencies                            16
Investment Banking & Asset Management       13
Banks & Credit Unions                        8
Lending                                      8
Energy                                       5
Consumer Products Manufacturing              5
Telecommunications Services                  5
Insurance Agencies & Brokerages              4
Food

In [22]:
df['Revenue'].value_counts()

Revenue
Unknown / Non-Applicable            213
$100 to $500 million (USD)           94
$10+ billion (USD)                   63
$2 to $5 billion (USD)               45
$10 to $25 million (USD)             41
$1 to $2 billion (USD)               36
$25 to $50 million (USD)             36
$50 to $100 million (USD)            31
$1 to $5 million (USD)               31
-1                                   27
$500 million to $1 billion (USD)     19
$5 to $10 million (USD)              14
Less than $1 million (USD)           14
$5 to $10 billion (USD)               8
Name: count, dtype: int64

In [23]:
# Cleaning the 'Revenue' column
df['Revenue'] = df['Revenue'].replace({'-1' : 'NA', 'Unknown / Non-Applicable' : 'NA'})
df['Revenue'].value_counts()

Revenue
NA                                  240
$100 to $500 million (USD)           94
$10+ billion (USD)                   63
$2 to $5 billion (USD)               45
$10 to $25 million (USD)             41
$1 to $2 billion (USD)               36
$25 to $50 million (USD)             36
$50 to $100 million (USD)            31
$1 to $5 million (USD)               31
$500 million to $1 billion (USD)     19
$5 to $10 million (USD)              14
Less than $1 million (USD)           14
$5 to $10 billion (USD)               8
Name: count, dtype: int64

In [58]:
# Creating columns that whether python, excel, hadoop, spark, aws, tableau, and big data are needed for the job
df.loc[:, 'python'] = np.where(df['Job Description'].str.contains(r'(?i)\bpython\b'), True, False)
df.loc[:, 'r'] = np.where(df['Job Description'].str.contains(r'(?i)\br\b'), True, False)
df.loc[:, 'excel'] = np.where(df['Job Description'].str.contains(r'(?i)\bexcel\b'), True, False)
df.loc[:, 'sql'] = np.where(df['Job Description'].str.contains(r'(?i)\bsql\b'), True, False)
df.loc[:, 'tableau'] = np.where(df['Job Description'].str.contains(r'(?i)\btableau\b'), True, False)
df.loc[:, 'power_bi'] = np.where(df['Job Description'].str.contains(r'(?i)\bpower bi\b'), True, False)
df.loc[:, 'hadoop'] = np.where(df['Job Description'].str.contains(r'(?i)\bhadoop\b'), True, False)
df.loc[:, 'spark'] = np.where(df['Job Description'].str.contains(r'(?i)\bspark\b'), True, False)
df.loc[:, 'kafka'] = np.where(df['Job Description'].str.contains(r'(?i)\bkafka\b'), True, False)
df.loc[:, 'aws'] = np.where(df['Job Description'].str.contains(r'(?i)\baws\b'), True, False)
df.loc[:, 'big_data'] = np.where(df['Job Description'].str.contains(r'(?i)\bbig data\b'), True, False)
df.loc[:, 'machine_learning'] = np.where(df['Job Description'].str.contains(r'(?i)\bmachine learning\b'), True, False)

# Skills Count
skills_count = pd.DataFrame({
    'python': df['python'].value_counts(),
    'r': df['r'].value_counts(),
    'excel': df['excel'].value_counts(),
    'sql': df['excel'].value_counts(),
    'tableau': df['tableau'].value_counts(),
    'power_bi': df['power_bi'].value_counts(),
    'hadoop': df['hadoop'].value_counts(),
    'spark': df['spark'].value_counts(),
    'kafka': df['kafka'].value_counts(),
    'aws': df['aws'].value_counts(),
    'big_data': df['big_data'].value_counts(),
    'machine_learning': df['machine_learning'].value_counts()
    })
skills_count

KeyError: 'Job Description'

In [25]:
# Dropping unnecessary columns
df = df.drop(columns = ['index', 'Job Title', 'Salary Estimate', 'Job Description', 'Founded', 'Competitors'])

# Reset index
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Rating,Company Name,Location,Headquarters,Size,Type of ownership,Industry,Sector,Revenue,job_title_cleaned,...,excel,sql,tableau,power_bi,hadoop,spark,kafka,aws,big_data,machine_learning
0,3.1,Healthfirst,"New York, NY","New York, NY",1001 to 5000,Nonprofit Organization,Insurance Carriers,Insurance,,Data Scientist,...,False,False,False,False,False,False,False,True,False,True
1,4.2,ManTech,"Chantilly, VA","Herndon, VA",5001 to 10000,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),Data Scientist,...,False,True,False,False,True,False,False,False,True,True
2,3.8,Analysis Group,"Boston, MA","Boston, MA",1001 to 5000,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),Data Scientist,...,False,False,False,False,False,False,False,True,False,True
3,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),Data Scientist,...,False,True,False,False,False,False,False,False,False,True
4,2.9,Affinity Solutions,"New York, NY","New York, NY",51 to 200,Company - Private,Advertising & Marketing,Business Services,,Data Scientist,...,False,True,False,False,False,False,False,False,False,True


In [26]:
cols = df.columns.tolist()
cols

['Rating',
 'Company Name',
 'Location',
 'Headquarters',
 'Size',
 'Type of ownership',
 'Industry',
 'Sector',
 'Revenue',
 'job_title_cleaned',
 'job_level',
 'min_salary',
 'max_salary',
 'avg_salary',
 'job_state',
 'same_location',
 'years_of_operation',
 'python',
 'r',
 'excel',
 'sql',
 'tableau',
 'power_bi',
 'hadoop',
 'spark',
 'kafka',
 'aws',
 'big_data',
 'machine_learning']

In [27]:
# Rearrange the columns of the dataframe
dfCleaned = df.copy()

cols = ['job_title_cleaned', 'job_level', 'Rating', 'Company Name', 'Location', 'Headquarters', 'job_state', 'same_location', 
        'Size', 'years_of_operation', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'min_salary', 'max_salary', 
        'avg_salary', 'python', 'r', 'excel', 'sql', 'tableau', 'power_bi', 'hadoop', 'spark', 'kafka', 'aws', 'big_data', 'machine_learning']
dfCleaned = dfCleaned[cols]
dfCleaned.head()

Unnamed: 0,job_title_cleaned,job_level,Rating,Company Name,Location,Headquarters,job_state,same_location,Size,years_of_operation,...,excel,sql,tableau,power_bi,hadoop,spark,kafka,aws,big_data,machine_learning
0,Data Scientist,Senior,3.1,Healthfirst,"New York, NY","New York, NY",NY,True,1001 to 5000,31,...,False,False,False,False,False,False,False,True,False,True
1,Data Scientist,Other,4.2,ManTech,"Chantilly, VA","Herndon, VA",VA,False,5001 to 10000,56,...,False,True,False,False,True,False,False,False,True,True
2,Data Scientist,Other,3.8,Analysis Group,"Boston, MA","Boston, MA",MA,True,1001 to 5000,43,...,False,False,False,False,False,False,False,True,False,True
3,Data Scientist,Other,3.5,INFICON,"Newton, MA","Bad Ragaz, Switzerland",MA,False,501 to 1000,24,...,False,True,False,False,False,False,False,False,False,True
4,Data Scientist,Other,2.9,Affinity Solutions,"New York, NY","New York, NY",NY,True,51 to 200,26,...,False,True,False,False,False,False,False,False,False,True


In [28]:
dfCleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   job_title_cleaned   672 non-null    object 
 1   job_level           672 non-null    object 
 2   Rating              672 non-null    float64
 3   Company Name        672 non-null    object 
 4   Location            672 non-null    object 
 5   Headquarters        672 non-null    object 
 6   job_state           655 non-null    object 
 7   same_location       672 non-null    bool   
 8   Size                672 non-null    object 
 9   years_of_operation  672 non-null    int64  
 10  Type of ownership   672 non-null    object 
 11  Industry            672 non-null    object 
 12  Sector              672 non-null    object 
 13  Revenue             672 non-null    object 
 14  min_salary          672 non-null    int32  
 15  max_salary          672 non-null    int32  
 16  avg_sala

In [29]:
# Saving the dataframe into a CSV
dfCleaned.to_csv(r"C:\Users\User\Portfolio Projects\Data Science Job Postings Project\Cleaned_DS_jobs.csv", index=False)