## Import Data

In [1]:
import pandas as pd

In [2]:
pd.set_option('display.max_rows', None)
dataframe = pd.read_csv('data-science.csv')
dataframe

Unnamed: 0.1,Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,Founded,Revenue
0,0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,2005.0,Unknown / Non-Applicable
1,1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,1912.0,$10+ billion (USD)
2,2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,1946.0,$5 to $10 billion (USD)
3,3,EXL Services\n3.7,3.7,Fully Remote Data Scientist,Remote,Company Overview and Culture\nEXL (NASDAQ: EXL...,,10000+ Employees,Company - Public,Management & Consulting,Business Consulting,1999.0,$500 million to $1 billion (USD)
4,4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,2004.0,$100 to $500 million (USD)
5,5,Bose\n3.8,3.8,Data Engineer/Scientist – Development Program,Remote,"Job Description\nBose is about better sound, b...",,5001 to 10000 Employees,Company - Private,Manufacturing,Consumer Product Manufacturing,1964.0,$2 to $5 billion (USD)
6,6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,,Unknown / Non-Applicable
7,7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,1818.0,$2 to $5 billion (USD)
8,8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,1958.0,$10+ billion (USD)
9,9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,,Unknown / Non-Applicable


In [3]:
dataframe.shape

(900, 13)

## Clean the duplicate value

I noticed that glasssdoor repeated the same job title in the same company on the search job results and it makes the data repeated when I scrape it. So I cleaned it by dropping the duplicated values on the subset Company Name, Job Title, and Job Description

In [4]:
df = dataframe.drop_duplicates(subset =['Company Name','Job Title','Job Description'], keep = 'first', inplace = False)

In [5]:
df.shape

(242, 13)

In [6]:
df.duplicated().sum()

0

Now, the dataset has 0 duplicate values and have 242 row and 13 columns.

## Salary Parsing

Remove the values that has NaN on the Salary Estimate column

In [7]:
df = df[df['Salary Estimate'].notna()]
df

Unnamed: 0.1,Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,Founded,Revenue
0,0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,2005.0,Unknown / Non-Applicable
1,1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,1912.0,$10+ billion (USD)
2,2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,1946.0,$5 to $10 billion (USD)
4,4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,2004.0,$100 to $500 million (USD)
6,6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,,Unknown / Non-Applicable
7,7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,1818.0,$2 to $5 billion (USD)
8,8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,1958.0,$10+ billion (USD)
9,9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,,Unknown / Non-Applicable
10,10,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,,,Unknown / Non-Applicable
11,11,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,,,Unknown / Non-Applicable


append the hourly and employer provided salary to calculate the annual rate from per hour

In [8]:
df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df

Unnamed: 0.1,Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,Founded,Revenue,hourly,employer_provided
0,0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,2005.0,Unknown / Non-Applicable,0,0
1,1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,1912.0,$10+ billion (USD),0,1
2,2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,1946.0,$5 to $10 billion (USD),0,0
4,4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,2004.0,$100 to $500 million (USD),0,0
6,6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,,Unknown / Non-Applicable,1,1
7,7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,1818.0,$2 to $5 billion (USD),0,0
8,8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,1958.0,$10+ billion (USD),0,0
9,9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,,Unknown / Non-Applicable,0,1
10,10,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,,,Unknown / Non-Applicable,1,1
11,11,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,,,Unknown / Non-Applicable,0,1


Remove the text from field 'Salary Estimate' that says 'per hour' and 'emoloyer provided salary', parenthesis, remove the K and dollar sign

In [9]:
salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = salary.apply(lambda x: x.replace('K','').replace('$',''))
min_hrPS = minus_Kd.apply(lambda x: x.lower().replace('per hour','').replace('employer provided salary:',''))
min_hrPS

0       80 - 126 
1        99 - 115
2       97 - 141 
4       91 - 124 
6        40 - 50 
7       67 - 104 
8      106 - 146 
9       100 - 150
10       60 - 65 
11      110 - 130
12     127 - 180 
16      70 - 104 
17            25 
18        65 - 80
19      81 - 114 
20        76 - 88
21       64 - 94 
22      115 - 130
23      78 - 119 
24      97 - 137 
27       50 - 60 
28      94 - 138 
29       88 - 220
31       99 - 115
32      80 - 126 
35        50 - 65
36     106 - 146 
38      91 - 124 
41      85 - 119 
44     127 - 180 
46       67 - 120
47       38 - 58 
51      77 - 115 
53       46 - 73 
55       90 - 130
56     100 - 142 
57      70 - 104 
66     106 - 146 
70      77 - 115 
73       90 - 135
79      93 - 134 
80      67 - 100 
88       70 - 110
91       99 - 115
101      67 - 120
103     81 - 114 
111     77 - 115 
118      75 - 150
139      46 - 73 
144      61 - 96 
146     86 - 119 
147     66 - 113 
158      61 - 96 
174     66 - 113 
193     70 - 104 
205     66

add min_salary, max_salary, and avg_salary columns

In [10]:
df['min_salary'] = min_hrPS.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)
df['max_salary'] = min_hrPS.apply(lambda x: int(x.split('-')[1]) if '-' in x else x)
df['avg_salary'] = (df.min_salary.astype(int) + df.max_salary.astype(int))/2
df

Unnamed: 0.1,Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,Founded,Revenue,hourly,employer_provided,min_salary,max_salary,avg_salary
0,0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,2005.0,Unknown / Non-Applicable,0,0,80,126,103.0
1,1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,1912.0,$10+ billion (USD),0,1,99,115,107.0
2,2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,1946.0,$5 to $10 billion (USD),0,0,97,141,119.0
4,4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,2004.0,$100 to $500 million (USD),0,0,91,124,107.5
6,6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,,Unknown / Non-Applicable,1,1,40,50,45.0
7,7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,1818.0,$2 to $5 billion (USD),0,0,67,104,85.5
8,8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,1958.0,$10+ billion (USD),0,0,106,146,126.0
9,9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,,Unknown / Non-Applicable,0,1,100,150,125.0
10,10,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,,,Unknown / Non-Applicable,1,1,60,65,62.5
11,11,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,,,Unknown / Non-Applicable,0,1,110,130,120.0


## Company Name Text Only

In [11]:
df['company_txt'] = df.apply(lambda x: x['Company Name'] if x['Rating']<0 else x['Company Name'][:-4], axis = 1)
df

Unnamed: 0.1,Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,Founded,Revenue,hourly,employer_provided,min_salary,max_salary,avg_salary,company_txt
0,0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,2005.0,Unknown / Non-Applicable,0,0,80,126,103.0,Jun Group
1,1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,1912.0,$10+ billion (USD),0,1,99,115,107.0,Liberty Mutual Insurance
2,2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,1946.0,$5 to $10 billion (USD),0,0,97,141,119.0,the NBA
4,4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,2004.0,$100 to $500 million (USD),0,0,91,124,107.5,KAYAK
6,6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,,Unknown / Non-Applicable,1,1,40,50,45.0,Kaizen Dyna
7,7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,1818.0,$2 to $5 billion (USD),0,0,67,104,85.5,LexisNexis
8,8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,1958.0,$10+ billion (USD),0,0,106,146,126.0,Visa
9,9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,,Unknown / Non-Applicable,0,1,100,150,125.0,Longevity InTime
10,10,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,,,Unknown / Non-Applicable,1,1,60,65,62.5,DSMH
11,11,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,,,Unknown / Non-Applicable,0,1,110,130,120.0,Fathom Management LLC


## State Field

In [12]:
df['job_state'] = df['Location'].apply(lambda x: x.split(',')[1] if ',' in x else x)
df.job_state.value_counts()

Remote       34
 CA          33
 NY          24
 TX          16
 WA          12
 PA          11
 VA          10
 CO          10
 MO           7
 IL           6
 MD           5
 UT           5
 DC           4
 IN           4
 FL           4
 NC           3
 MA           3
 GA           2
 IA           2
 NJ           2
 TN           2
 NE           2
Maryland      1
Manhattan     1
 RI           1
 KS           1
 WI           1
 MN           1
Tennessee     1
 CT           1
Name: job_state, dtype: int64

## Age of Company

In [13]:
df['age'] = df['Founded'].apply(lambda x: x if x <1 else 2022-x)
df

Unnamed: 0.1,Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,...,Founded,Revenue,hourly,employer_provided,min_salary,max_salary,avg_salary,company_txt,job_state,age
0,0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,...,2005.0,Unknown / Non-Applicable,0,0,80,126,103.0,Jun Group,NY,17.0
1,1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,...,1912.0,$10+ billion (USD),0,1,99,115,107.0,Liberty Mutual Insurance,Remote,110.0
2,2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",...,1946.0,$5 to $10 billion (USD),0,0,97,141,119.0,the NBA,NY,76.0
4,4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,...,2004.0,$100 to $500 million (USD),0,0,91,124,107.5,KAYAK,MA,18.0
6,6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,...,,Unknown / Non-Applicable,1,1,40,50,45.0,Kaizen Dyna,Remote,
7,7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,...,1818.0,$2 to $5 billion (USD),0,0,67,104,85.5,LexisNexis,DC,204.0
8,8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,...,1958.0,$10+ billion (USD),0,0,106,146,126.0,Visa,CA,64.0
9,9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,...,,Unknown / Non-Applicable,0,1,100,150,125.0,Longevity InTime,Remote,
10,10,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,...,,Unknown / Non-Applicable,1,1,60,65,62.5,DSMH,Remote,
11,11,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,...,,Unknown / Non-Applicable,0,1,110,130,120.0,Fathom Management LLC,Remote,


## Parsing of Job Description

Popular tools that needed by the Data Scientist

1. Python

In [14]:
df['python'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
df.python.value_counts()

1    153
0     56
Name: python, dtype: int64

2. R Studio

In [15]:
df['r_studio'] = df['Job Description'].apply(lambda x: 1 if 'r studio' in x.lower() or ' r ' in x.lower() else 0)
df.r_studio.value_counts()

0    167
1     42
Name: r_studio, dtype: int64

3. Spark

In [16]:
df['spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)
df.spark.value_counts()

0    182
1     27
Name: spark, dtype: int64

4. AWS

In [17]:
df['aws'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)
df.aws.value_counts()

0    151
1     58
Name: aws, dtype: int64

5. Excel

In [18]:
df['excel'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
df.excel.value_counts()

0    115
1     94
Name: excel, dtype: int64

In [19]:
df_out = df.drop(['Unnamed: 0'], axis = 1)
df_out

Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,...,max_salary,avg_salary,company_txt,job_state,age,python,r_studio,spark,aws,excel
0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,...,126,103.0,Jun Group,NY,17.0,1,0,0,0,1
1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,...,115,107.0,Liberty Mutual Insurance,Remote,110.0,0,0,0,0,0
2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,...,141,119.0,the NBA,NY,76.0,1,1,0,0,1
4,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,...,124,107.5,KAYAK,MA,18.0,1,0,0,0,0
6,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,...,50,45.0,Kaizen Dyna,Remote,,1,0,0,0,1
7,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,...,104,85.5,LexisNexis,DC,204.0,1,0,1,1,0
8,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,...,146,126.0,Visa,CA,64.0,1,0,0,0,1
9,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,...,150,125.0,Longevity InTime,Remote,,1,0,0,0,0
10,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,,...,65,62.5,DSMH,Remote,,1,0,0,1,0
11,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,,...,130,120.0,Fathom Management LLC,Remote,,0,0,0,0,0


In [20]:
df_out.to_csv('Salary Data Cleaned.csv', index = False)

In [21]:
pd.read_csv('Salary Data Cleaned.csv')

Unnamed: 0,Company Name,Rating,Job Title,Location,Job Description,Salary Estimate,Size,Type of Ownership,Sector,Industry,...,max_salary,avg_salary,company_txt,job_state,age,python,r_studio,spark,aws,excel
0,Jun Group\n4.2,4.2,Data Scientist (Remote),"New York, NY",Do you hate advertising? We do too. Our view i...,$80K - $126K (Glassdoor est.),51 to 200 Employees,Company - Private,Information Technology,Internet & Web Services,...,126,103.0,Jun Group,NY,17.0,1,0,0,0,1
1,Liberty Mutual Insurance\n3.9,3.9,Junior Data Scientist REMOTE,Remote,The Advanced Analytics and Modeling department...,Employer Provided Salary:$99K - $115K,10000+ Employees,Company - Private,Insurance,Insurance Carriers,...,115,107.0,Liberty Mutual Insurance,Remote,110.0,0,0,0,0,0
2,the NBA\n4.2,4.2,Data Scientist,"New York, NY","At the NBA, we’re passionate about growing and...",$97K - $141K (Glassdoor est.),1001 to 5000 Employees,Company - Private,"Arts, Entertainment & Recreation",Sports & Recreation,...,141,119.0,the NBA,NY,76.0,1,1,0,0,1
3,KAYAK\n4.4,4.4,Data Scientist,"Boston, MA","KAYAK, part of Booking Holdings (NASDAQ: BKNG)...",$91K - $124K (Glassdoor est.),1001 to 5000 Employees,Subsidiary or Business Segment,Information Technology,Internet & Web Services,...,124,107.5,KAYAK,MA,18.0,1,0,0,0,0
4,Kaizen Dynamics,,Data Scientist,Remote,"Assist with complex data cleaning, integration...",Employer Provided Salary:$40 - $50 Per Hour,Unknown,Company - Public,,,...,50,45.0,Kaizen Dyna,Remote,,1,0,0,0,1
5,LexisNexis\n3.9,3.9,Junior Data Scientist (Remote),"Washington, DC",Who We Are:\nKnowable is the market leader in ...,$67K - $104K (Glassdoor est.),5001 to 10000 Employees,Company - Public,Management & Consulting,Business Consulting,...,104,85.5,LexisNexis,DC,204.0,1,0,1,1,0
6,Visa\n4.0,4.0,Data Scientist,"San Francisco, CA",Company Description\nVisa is a world leader in...,$106K - $146K (Glassdoor est.),10000+ Employees,Company - Public,Information Technology,Information Technology Support Services,...,146,126.0,Visa,CA,64.0,1,0,0,0,1
7,Longevity InTime\n3.9,3.9,Computer Vision ML Engineer Data Scientist,Remote,"3 years ago, we founded a biotech company in t...",Employer Provided Salary:$100K - $150K,1 to 50 Employees,Company - Public,Healthcare,Health Care Services & Hospitals,...,150,125.0,Longevity InTime,Remote,,1,0,0,0,0
8,DSMH LLC,,Data Scientist,Remote,"Required:\n*BS in information technology, comp...",Employer Provided Salary:$60 - $65 Per Hour,1 to 50 Employees,Company - Private,,,...,65,62.5,DSMH,Remote,,1,0,0,1,0
9,Fathom Management LLC\n2.0,2.0,Data Scientist 100% REMOTE Opportunity,Remote,"Data Scientist\nSalary range of $110,000 to $1...",Employer Provided Salary:$110K - $130K,1 to 50 Employees,Self-employed,,,...,130,120.0,Fathom Management LLC,Remote,,0,0,0,0,0
