# Data Cleaning Notebook

The following notebook is used to clean the data for the DataAnalyst.csv file. Many of the techniques are similar to those used by Ken Jee in his tutorial. I found them to be extremely useful, clear, and effective which is why I wanted to use them here. I created and used different features than him, as well as analyzed more nuanced aspects of the data analyst position, such as degrees required and how they relate to the average salary. 

In [2]:
import pandas as pd 
import numpy as np 

In [3]:
df=pd.read_csv('DataAnalyst.csv')

In [4]:
df.sample(20)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
2005,2005,Data Analyst,$65K-$120K (Glassdoor est.),Job Description\nEssential Functions:\n\n• Des...,2.5,Enclipse\n2.5,"Oakland, CA","Minneapolis, MN",51 to 200 employees,-1,Company - Private,IT Services,Information Technology,$5 to $10 million (USD),"Accenture, Deloitte",-1
769,769,Data Analyst,$73K-$82K (Glassdoor est.),Job Description\nJob description\n\n• Interpre...,5.0,"Staffigo Technical Services, LLC\n5.0","Chicago, IL","Woodridge, IL",51 to 200 employees,2008,Company - Private,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
69,69,Data/Reporting Analyst,$51K-$88K (Glassdoor est.),Position Summary\nA data management/analyst wh...,3.9,Diverse Lynx\n3.9,"New York, NY","Princeton, NJ",501 to 1000 employees,2002,Company - Private,IT Services,Information Technology,$100 to $500 million (USD),-1,-1
1433,1433,Data Management Analyst,$48K-$88K (Glassdoor est.),Principal AccountabilityReporting to the Manag...,3.2,Community Hospital Corporation\n3.2,"Plano, TX","Plano, TX",1001 to 5000 employees,1996,Company - Private,Health Care Services & Hospitals,Health Care,Unknown / Non-Applicable,-1,-1
317,317,"Data Management, Firmwide Data Quality, Analyst",$42K-$74K (Glassdoor est.),About the Chief Data Office\n\nThe FDM Data Qu...,3.9,J.P. Morgan\n3.9,"Newark, NJ","New York, NY",10000+ employees,1799,Company - Public,Investment Banking & Asset Management,Finance,$10+ billion (USD),-1,-1
696,696,Music Copyright Data Analyst,$113K-$132K (Glassdoor est.),Music Copyright Data Analyst position availabl...,3.3,AppleOne\n3.3,"Woodland Hills, CA","Glendale, CA",1001 to 5000 employees,1964,Company - Private,Staffing & Outsourcing,Business Services,$1 to $2 billion (USD),"Kelly, Manpower",-1
1260,1260,Junior Data Analyst,$76K-$122K (Glassdoor est.),Job Description\nJob description\nInterpret da...,5.0,"Staffigo Technical Services, LLC\n5.0","San Diego, CA","Woodridge, IL",51 to 200 employees,2008,Company - Private,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
2058,2058,Data and Policy Analyst - Writer/Coordinator,$47K-$74K (Glassdoor est.),"Job Description\n\nAt Acumen, LLC / The SPHERE...",3.2,Acumen\n3.2,"Burlingame, CA","Burlingame, CA",201 to 500 employees,1996,Company - Private,Federal Agencies,Government,$10 to $25 million (USD),Acumen,-1
695,695,Data Analyst-HEDIS & Star,$113K-$132K (Glassdoor est.),"Data Analyst– HEDIS& Star\nLong Beach, CA\nFul...",3.2,SCAN Health Plan\n3.2,"Long Beach, CA","Long Beach, CA",501 to 1000 employees,1977,Nonprofit Organization,Insurance Carriers,Insurance,$2 to $5 billion (USD),-1,-1
1828,1828,Senior Data Analyst,$54K-$75K (Glassdoor est.),Req ID: 94674\n\nNTT DATA Services strives to ...,3.4,NTT DATA Corporation\n3.4,"Charlotte, NC","Tokyo, Japan",10000+ employees,1967,Company - Public,IT Services,Information Technology,$10+ billion (USD),"Capgemini, Accenture, Deloitte",-1


In [5]:
#get more succinct job titles. Idea for such a function taken from Ken Jee's data science 
#tutorial 
def job_simplifier(title):
    if 'data analyst' in title.lower() or 'analytics' in title.lower():
        return 'data analyst'
    elif 'data scientist' in title.lower() or 'data science analyst' in title.lower() or 'data science' in title.lower():
        return 'data scientist'
    elif 'business analyst' in title.lower():
        return 'business analyst'
    elif 'data management' in title.lower():
        return 'data management'
    elif 'data warehouse' in title.lower():
        return 'data warehouse enginner'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'security analyst' in title.lower():
        return 'data security analyst'
    elif 'risk' in title.lower():
        return 'risk analyst'
    else: 
        return 'other'

In [6]:
#Idea for such a function taken from Ken Jee's data science 
#tutorial
def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'sr' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
            return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower() or 'junior' in title.lower():
        return 'jr'
    else:
        return 'na'

In [7]:
df['simplified_title']=df['Job Title'].apply(job_simplifier)
df.simplified_title.value_counts()

data analyst               1708
other                       363
business analyst             50
data management              39
data scientist               38
data warehouse enginner      25
data security analyst        12
data engineer                10
risk analyst                  8
Name: simplified_title, dtype: int64

In [8]:
#get the junior and senior titles 
df['seniority'] = df['Job Title'].apply(seniority)
df.seniority.value_counts()

na        1699
senior     481
jr          73
Name: seniority, dtype: int64

In [9]:
pd.DataFrame(df.loc[df.simplified_title=='other']['Job Title'].value_counts())

Unnamed: 0,Job Title
Data Quality Analyst,17
Data Governance Analyst,16
Data Reporting Analyst,13
NY Healthcare Data/Reporting Analyst,5
Healthcare Data/Reporting Analyst,4
...,...
"Lead Analyst, Data Mgmt / Quant Analysis",1
Data Systems Analyst III,1
Data/Business Solutions Analyst - Jr,1
Data Quality Control Analyst,1


In [10]:
string='Senior Data Science Analyst'
'data analyst' in string.lower()

False

In [11]:
#get the average salary 
#looking at the salary ranges for the jobs 
df['Salary Estimate'].value_counts()

$42K-$76K (Glassdoor est.)    57
$41K-$78K (Glassdoor est.)    57
$50K-$86K (Glassdoor est.)    41
$35K-$67K (Glassdoor est.)    33
$43K-$76K (Glassdoor est.)    31
                              ..
$47K-$81K (Glassdoor est.)     3
$43K-$77K (Glassdoor est.)     3
$36K-$67K (Glassdoor est.)     3
$57K-$70K (Glassdoor est.)     2
-1                             1
Name: Salary Estimate, Length: 90, dtype: int64

In [12]:
#no hourly jobs 
df.loc[df['Salary Estimate']=='per hour']

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,simplified_title,seniority


In [13]:
#no employer provided salaries 
df.loc[df['Salary Estimate']=='employer provided salary']

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,simplified_title,seniority


In [14]:
#want to get companies that only have numeric salaries 
df=df[df['Salary Estimate']!='-1']
df['Salary Estimate'].value_counts()

$42K-$76K (Glassdoor est.)    57
$41K-$78K (Glassdoor est.)    57
$50K-$86K (Glassdoor est.)    41
$35K-$67K (Glassdoor est.)    33
$43K-$76K (Glassdoor est.)    31
                              ..
$42K-$63K (Glassdoor est.)     4
$47K-$81K (Glassdoor est.)     3
$43K-$77K (Glassdoor est.)     3
$36K-$67K (Glassdoor est.)     3
$57K-$70K (Glassdoor est.)     2
Name: Salary Estimate, Length: 89, dtype: int64

In [15]:
#parsing out the salary to get the average 

#just getting the salary range from the text inside the 'Salary Estimate' column
df['salary_range']=df['Salary Estimate'].apply(lambda x: x.split('(')[0])

#get the 'K' values out to find the average 
df['salary_range']=df['salary_range'].apply(lambda x: x.replace('K', ""))

#get the minimum salary 
df['min_salary']=df['salary_range'].apply(lambda x: x.split('-')[0]).apply(lambda x: x.replace('$', ''))
df['min_salary']=df['min_salary'].apply(lambda x: int(x))

#get the maximum salary 
df['max_salary']=df['salary_range'].apply(lambda x: x.split('-')[1]).apply(lambda x: x.replace('$', ''))
df['max_salary']=df['max_salary'].apply(lambda x: int(x))

#get the average salary column (this will be the target variable)
df['avg_salary']=(df['max_salary']+df['min_salary'])/2

In [16]:
#get age of company
df['company_age']=df.Founded.apply(lambda x: x if x<1 else 2021-x)

In [17]:
#get name of company 
df['company_name']=df.apply(lambda x: x['Company Name'] if x.Rating<0 else x['Company Name'][:-4], axis=1)

In [18]:
#more information on the location of each position

#get state of company
df['state']=df.Location.apply(lambda x: x.split(',')[1])
#city of company
df['city']=df.Location.apply(lambda x: x.split(',')[0])
#state or country of headquarters
df['headquarters_state']=df.Headquarters.apply(lambda x: -1 if x=='-1' else x.split(",")[1])
#position is in the same state as headquarter
df['same_location']=df.apply(lambda x: 1 if x.state==x.headquarters_state else 0, axis=1)

In [19]:
#change 'Arapahoe' to Colorado since they county resides there
df['state']=df.state.apply(lambda x: ' CO' if x==' Arapahoe' else x )

In [20]:
df.state.value_counts()

 CA    626
 TX    394
 NY    345
 IL    164
 PA    114
 AZ     97
 CO     96
 NC     90
 NJ     86
 WA     53
 VA     48
 OH     35
 UT     33
 FL     27
 IN     23
 DE     11
 GA      4
 KS      3
 SC      3
Name: state, dtype: int64

In [21]:
df.loc[df.state==' CO']

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,salary_range,min_salary,max_salary,avg_salary,company_age,company_name,state,city,headquarters_state,same_location
2157,2157,"Data Analyst - Denver, CO",$57K-$67K (Glassdoor est.),­­As a member of the company’s information tec...,3.8,CoreSite\n3.8,"Denver, CO","Denver, CO",201 to 500 employees,2001,...,$57-$67,57,67,62.0,20,CoreSite,CO,Denver,CO,1
2158,2158,Inside Sales Data Analyst,$57K-$67K (Glassdoor est.),Job Requisition ID #\n20WD40934\n\nPosition Ov...,4.0,Autodesk\n4.0,"Denver, CO","San Rafael, CA",5001 to 10000 employees,1982,...,$57-$67,57,67,62.0,39,Autodesk,CO,Denver,CA,0
2159,2159,Data Analyst,$57K-$67K (Glassdoor est.),Position Type: Permanent\n\nCompensation: Comm...,3.6,Dire Wolf Digital\n3.6,"Denver, CO","Denver, CO",51 to 200 employees,2010,...,$57-$67,57,67,62.0,11,Dire Wolf Digital,CO,Denver,CO,1
2160,2160,"Senior Data Analyst, Sales Compensation",$57K-$67K (Glassdoor est.),"Because you belong at Twilio\n\nThe Who, What,...",4.0,Twilio\n4.0,"Denver, CO","San Francisco, CA",1001 to 5000 employees,2008,...,$57-$67,57,67,62.0,13,Twilio,CO,Denver,CA,0
2161,2161,Data Analyst,$57K-$67K (Glassdoor est.),The Data Analyst is responsible for the all da...,3.8,BridgeView IT\n3.8,"Denver, CO","Denver, CO",1 to 50 employees,2005,...,$57-$67,57,67,62.0,16,BridgeView IT,CO,Denver,CO,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2248,2248,RQS - IHHA - 201900004460 -1q Data Security An...,$78K-$104K (Glassdoor est.),Maintains systems to protect data from unautho...,2.5,"Avacend, Inc.\n2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,-1,...,$78-$104,78,104,91.0,-1,"Avacend, Inc.",CO,Denver,GA,0
2249,2249,Senior Data Analyst (Corporate Audit),$78K-$104K (Glassdoor est.),Position:\nSenior Data Analyst (Corporate Audi...,2.9,Arrow Electronics\n2.9,"Centennial, CO","Centennial, CO",10000+ employees,1935,...,$78-$104,78,104,91.0,86,Arrow Electronics,CO,Centennial,CO,1
2250,2250,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),"Title: Technical Business Analyst (SQL, Data a...",-1.0,Spiceorb,"Denver, CO",-1,-1,-1,...,$78-$104,78,104,91.0,-1,Spiceorb,CO,Denver,-1,0
2251,2251,"Data Analyst 3, Customer Experience",$78K-$104K (Glassdoor est.),Summary\n\nResponsible for working cross-funct...,3.1,Contingent Network Services\n3.1,"Centennial, CO","West Chester, OH",201 to 500 employees,1984,...,$78-$104,78,104,91.0,37,Contingent Network Services,CO,Centennial,OH,0


In [22]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,salary_range,min_salary,max_salary,avg_salary,company_age,company_name,state,city,headquarters_state,same_location
789,789,Data Operations Analyst,$67K-$92K (Glassdoor est.),Overview\n\nPDI is seeking a Data Operations A...,3.2,PDI Software\n3.2,"Chicago, IL","Atlanta, GA",501 to 1000 employees,1983,...,$67-$92,67,92,79.5,38,PDI Software,IL,Chicago,GA,0
1550,1550,Master Data Analyst,$69K-$127K (Glassdoor est.),Title Master Data AnalystStart Date Immediatel...,4.7,Method360\n4.7,"San Jose, CA","San Francisco, CA",51 to 200 employees,2000,...,$69-$127,69,127,98.0,21,Method360,CA,San Jose,CA,1
1640,1640,Junior Data Analyst,$42K-$76K (Glassdoor est.),Job Description\nThe Details:\n\nData drives s...,2.6,Assembly\n2.6,"Austin, TX","New York, NY",201 to 500 employees,2014,...,$42-$76,42,76,59.0,7,Assembly,TX,Austin,NY,0
548,548,Senior Data Intelligence Analyst,$37K-$70K (Glassdoor est.),Grow your career at Cedars-Sinai!\n\n\nCome jo...,3.9,Cedars-Sinai\n3.9,"Los Angeles, CA","Los Angeles, CA",10000+ employees,1902,...,$37-$70,37,70,53.5,119,Cedars-Sinai,CA,Los Angeles,CA,1
2091,2091,Data Analyst Junior,$34K-$61K (Glassdoor est.),Job Description\nJob description\nInterpret da...,5.0,"Staffigo Technical Services, LLC\n5.0","Indianapolis, IN","Woodridge, IL",51 to 200 employees,2008,...,$34-$61,34,61,47.5,13,"Staffigo Technical Services, LLC",IN,Indianapolis,IL,0
907,907,Data Analyst II,$29K-$38K (Glassdoor est.),Home » New Job from Competentia\nData Analyst ...,4.9,Competentia Holding\n4.9,"Houston, TX","Stavanger, Norway",501 to 1000 employees,-1,...,$29-$38,29,38,33.5,-1,Competentia Holding,TX,Houston,Norway,0
1781,1781,"JPSC-7975 - Data Analyst Lead- Columbus, OH (L...",$28K-$52K (Glassdoor est.),Overview\n\nRole: Data Analyst Lead – Informat...,4.5,Avani Technology Solutions\n4.5,"Columbus, OH","Rochester, NY",501 to 1000 employees,2008,...,$28-$52,28,52,40.0,13,Avani Technology Solutions,OH,Columbus,NY,0
1731,1731,"Assistant Vice President, Anti Financial Crime...",$40K-$72K (Glassdoor est.),Job Description\n\nJob Description:\n\nDB USA ...,3.5,Deutsche Bank\n3.5,"Jacksonville, FL","Frankfurt am Main, Germany",10000+ employees,1870,...,$40-$72,40,72,56.0,151,Deutsche Bank,FL,Jacksonville,Germany,0
457,457,Tactical Data Link (TDL) Analyst,$43K-$69K (Glassdoor est.),Working at USfalcon is about providing excepti...,4.3,USfalcon\n4.3,"Hampton, VA","Cary, NC",201 to 500 employees,1984,...,$43-$69,43,69,56.0,37,USfalcon,VA,Hampton,NC,0
22,22,Data Analyst - Intex Developer,$37K-$66K (Glassdoor est.),Data Analyst - Intex Developer\n\n\nNew York\n...,3.3,Macquarie Group\n3.3,"New York, NY","Sydney, Australia",10000+ employees,1969,...,$37-$66,37,66,51.5,52,Macquarie Group,NY,New York,Australia,0


In [23]:
#get average size of company 
df.Size.unique()

array(['201 to 500 employees', '10000+ employees',
       '1001 to 5000 employees', '501 to 1000 employees',
       '5001 to 10000 employees', '1 to 50 employees',
       '51 to 200 employees', 'Unknown', '-1'], dtype=object)

In [24]:
df.Size.value_counts()

51 to 200 employees        420
10000+ employees           375
1001 to 5000 employees     348
1 to 50 employees          347
201 to 500 employees       249
501 to 1000 employees      211
-1                         163
5001 to 10000 employees     97
Unknown                     42
Name: Size, dtype: int64

In [25]:
#get size range of company 
df['size_range']=df.Size.apply(lambda x: x if x=='Unknown' or x=='-1' else x.split('employees')[0])

In [26]:
#get minimum size of company
df['min_size']=df.size_range.apply(lambda x: x if x=='Unknown' or x=='-1' or x=='10000+ ' else x.split('to')[0])

df['min_size']=df.min_size.apply(lambda x: 10000 if '+ ' in x else x)

df['min_size']=df.min_size.apply(lambda x: 0 if x=='-1' or x=='Unknown' else x).apply(lambda x: int(x))

In [27]:
#get maximum size of company
df['max_size']=df.size_range.apply(lambda x: x if x=='Unknown' or x=='-1' or x=='10000+ ' else x.split('to')[1])

#setting the maximum size of a company to 300000
df['max_size']=df.max_size.apply(lambda x: 30000 if '10000+' in x else x)

df['max_size']=df.max_size.apply(lambda x: 0 if x=='-1' or x=='Unknown' else x).apply(lambda x: int(x))

In [28]:
#get the average size of the company 
df['avg_size']=(df.max_size+df.min_size)/2

In [29]:
#get mean median size of company and replace the 0's with it. 
#that way we are not prone to outliers 
median_size=df.loc[df.avg_size!=0].avg_size.median()

df['avg_size']=df.avg_size.replace(0, median_size).replace(0.0, median_size)

In [30]:
#python in description
df['python'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
df.python.value_counts()

0    1615
1     637
Name: python, dtype: int64

In [31]:
#sql in description
df['sql'] = df['Job Description'].apply(lambda x: 1 if 'SQL' in x else 0)
df.sql.value_counts()

1    1378
0     874
Name: sql, dtype: int64

In [32]:
#Excel in description
df['excel'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)
df.excel.value_counts()

1    1353
0     899
Name: excel, dtype: int64

In [33]:
#R in description
df['R'] = df['Job Description'].apply(lambda x: 1 if 'R' in x else 0)
df.R.value_counts()

1    2106
0     146
Name: R, dtype: int64

In [47]:
#deep learning in description
df['deep_learning'] = df['Job Description'].apply(lambda x: 1 if 'Deep Learning' in x or 'deep learning' in x 
                                                  or 'Tensorflow' in x or 'Keras' in x 
                                                  or 'tensorflow' in x or 'keras' in x
                                                  or 'neural nets' in x else 0)
df.deep_learning.value_counts()

0    2241
1      11
Name: deep_learning, dtype: int64

In [35]:
#PhD in description
df['PhD'] = df['Job Description'].apply(lambda x: 1 if 'PhD' in x or 'phd' in x or 'Ph.D' in x 
                                        or 'Doctor of Philosophy' in x or 'Ph.D' in x else 0)
df.PhD.value_counts()

0    2203
1      49
Name: PhD, dtype: int64

In [36]:
#bachelor degree in description
df['bachelor'] = df['Job Description'].apply(lambda x: 1 if 'bachelor' in x or 'bachelors' in x or "bachelor's" in x 
                                             or 'Bachelor' in x or 'Bachelors' in x or "Bachelor's" else 0)
df.bachelor.value_counts()
#All want 

1    2252
Name: bachelor, dtype: int64

In [37]:
#masters degree in description
df['masters'] = df['Job Description'].apply(lambda x: 1 if 'masters' in x or "master's" in x 
                                            or "Masters" in x or "Master's" in x or 'MS' in x 
                                            or 'M.S' in x or 'MA' in x or 'M.A' in x else 0)
df.masters.value_counts()

0    1430
1     822
Name: masters, dtype: int64

In [38]:
#Power BI in description
df['power_bi'] = df['Job Description'].apply(lambda x: 1 if 'Power BI' in x or 'power BI' in x or 'Power bi' in x
                                             or 'power BI' in x else 0)
df.power_bi.value_counts()

0    2072
1     180
Name: power_bi, dtype: int64

In [39]:
#Power BI in description
df['tableau'] = df['Job Description'].apply(lambda x: 1 if 'tableau' in x or 'Tableau' in x else 0)
df.tableau.value_counts()

0    1632
1     620
Name: tableau, dtype: int64

In [40]:
#Problem Solver in description
df['prob_solver'] = df['Job Description'].apply(lambda x: 1 if 'problem solver' in x.lower() else 0)
df.prob_solver.value_counts()

0    2214
1      38
Name: prob_solver, dtype: int64

In [41]:
#Problem Solver in description
df['critical_thinker'] = df['Job Description'].apply(lambda x: 1 if 'critical thinker' in x.lower() else 0)
df.critical_thinker.value_counts()

0    2234
1      18
Name: critical_thinker, dtype: int64

In [42]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,excel,R,deep_learning,PhD,bachelor,masters,power_bi,tableau,prob_solver,critical_thinker
1938,1938,Data Analyst,$99K-$178K (Glassdoor est.),Voleon is a technology company that applies st...,4.6,The Voleon Group\n4.6,"Berkeley, CA","Berkeley, CA",51 to 200 employees,2007,...,0,1,0,0,1,0,0,0,0,1
576,576,Senior Data Analyst,$57K-$103K (Glassdoor est.),"**This position is located in Irvine, CA. Relo...",4.5,Mobilityware\n4.5,"Los Angeles, CA","Irvine, CA",51 to 200 employees,1990,...,0,1,0,0,1,1,0,0,0,0
716,716,Radar Telemetry Data Analyst,$42K-$63K (Glassdoor est.),Job Description\nOur prestigious client is loo...,-1.0,TopTalentFetch,"Anaheim, CA",-1,-1,-1,...,0,1,0,0,1,0,0,0,0,0
738,738,Business Data Analyst,$60K-$66K (Glassdoor est.),Work. Serve. Thrive.Imagine a place where your...,3.3,Feeding America\n3.3,"Chicago, IL","Chicago, IL",51 to 200 employees,1979,...,1,1,0,0,1,1,0,1,0,0
876,876,Analyst IV Systems - Data Engineering,$68K-$87K (Glassdoor est.),About Retail Business Services\nRetail Busines...,3.9,"Retail Business Services, LLC\n3.9","Chicago, IL","Salisbury, NC",1001 to 5000 employees,-1,...,1,1,0,0,1,0,0,0,0,0
1836,1836,Data Analyst/ Python with Banking Experience,$54K-$75K (Glassdoor est.),Title Sr Data Analyst with Python Location Cha...,2.9,"Shimento, Inc.\n2.9","Charlotte, NC","Benicia, CA",1 to 50 employees,-1,...,1,1,0,0,1,1,0,0,0,0
1678,1678,Data Analyst II,$53K-$104K (Glassdoor est.),Professional\n\nPosition Purpose:\nAnalyze hea...,3.3,Centene Corporation\n3.3,"Austin, TX","Saint Louis, MO",10000+ employees,1984,...,1,1,0,0,1,1,0,0,0,0
975,975,Data Warehouse Analyst III,$53K-$94K (Glassdoor est.),Rethink what’s possible with SwitchThink! Swit...,3.2,Desert Financial Credit Union\n3.2,"Phoenix, AZ","Phoenix, AZ",1001 to 5000 employees,1939,...,0,1,0,0,1,0,1,0,0,0
1817,1817,Data Analyst,$50K-$86K (Glassdoor est.),"Hello Associates,\n\n*****Greetings from Conch...",4.6,"Conch Technologies, Inc\n4.6","Charlotte, NC","Memphis, TN",51 to 200 employees,-1,...,0,1,0,0,1,0,0,0,0,0
750,750,Data Analyst,$73K-$82K (Glassdoor est.),Senior AWS Developer\n\nUST Global is looking ...,4.2,UST Global\n4.2,"Chicago, IL","Aliso Viejo, CA",10000+ employees,1999,...,1,1,0,0,1,1,0,0,0,0


In [43]:
df.columns

Index(['Unnamed: 0', 'Job Title', 'Salary Estimate', 'Job Description',
       'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'Easy Apply', 'simplified_title', 'seniority', 'salary_range',
       'min_salary', 'max_salary', 'avg_salary', 'company_age', 'company_name',
       'state', 'city', 'headquarters_state', 'same_location', 'size_range',
       'min_size', 'max_size', 'avg_size', 'python', 'sql', 'excel', 'R',
       'deep_learning', 'PhD', 'bachelor', 'masters', 'power_bi', 'tableau',
       'prob_solver', 'critical_thinker'],
      dtype='object')

In [44]:
df_final=df.drop('Unnamed: 0', axis=1)
df_final.sample(10)



Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,...,excel,R,deep_learning,PhD,bachelor,masters,power_bi,tableau,prob_solver,critical_thinker
1180,Data Engineer/Reporting Analyst,$64K-$113K (Glassdoor est.),An emerging IoT technology company in Philadel...,1.6,"Management Decisions, Inc.\n1.6","Philadelphia, PA","Milwaukee, WI",1 to 50 employees,-1,Company - Public,...,0,1,0,0,1,0,0,1,0,0
340,"Sr. Data Analyst, Retail Media",$77K-$132K (Glassdoor est.),Who we are\nCriteo (NASDAQ: CRTO) is the globa...,3.9,Criteo\n3.9,"New York, NY","Paris, France",1001 to 5000 employees,2005,Company - Public,...,0,1,0,0,1,0,0,0,0,0
747,Data Analyst - IntelliScript,$60K-$66K (Glassdoor est.),Role Description\n\nBy leveraging our relation...,3.8,Milliman\n3.8,"Chicago, IL","Seattle, WA",1001 to 5000 employees,1947,Company - Private,...,1,1,0,0,1,0,1,0,0,0
493,Data Analyst - Senior Consultant,$49K-$112K (Glassdoor est.),OverviewGuidehouse is a leading management con...,3.4,Guidehouse\n3.4,"Norfolk, VA","Washington, DC",5001 to 10000 employees,2018,Company - Private,...,0,1,0,0,1,0,0,0,0,0
1725,Junior Data Analyst,$40K-$72K (Glassdoor est.),Job Description\nJob description\nInterpret da...,5.0,"Staffigo Technical Services, LLC\n5.0","Jacksonville, FL","Woodridge, IL",51 to 200 employees,2008,Company - Private,...,1,1,0,0,1,1,0,0,0,0
815,Data Analyst,$42K-$76K (Glassdoor est.),KellyMitchell matches the best IT and business...,3.6,Kelly Mitchell Group\n3.6,"Downers Grove, IL","Saint Louis, MO",1001 to 5000 employees,1998,Company - Private,...,1,0,0,0,1,0,0,0,0,0
2000,Business Data Analyst,$65K-$120K (Glassdoor est.),Job Title: Business Data Analyst\n\nLocation: ...,3.8,VLink Inc.\n3.8,"San Francisco, CA","South Windsor, CT",201 to 500 employees,2006,Company - Private,...,0,1,0,0,1,0,0,0,0,0
1628,Manager : Marketing Ops and Data Analyst,$63K-$116K (Glassdoor est.),In this role you will be a part of a project t...,-1.0,Show Me Leads,"Mountain View, CA",-1,-1,-1,-1,...,1,1,0,0,1,1,0,0,0,0
1499,Data Analyst,$89K-$151K (Glassdoor est.),We are hiring Data Analyst for our client in R...,5.0,TechNet Inc.\n5.0,"San Jose, CA","Alpharetta, GA",1 to 50 employees,-1,Company - Private,...,0,1,0,0,1,0,0,0,0,0
928,MDM Data Analyst,$47K-$76K (Glassdoor est.),Arthur Lawrence is urgently looking for MDM Da...,4.1,Arthur Lawrence\n4.1,"Houston, TX","Upland, CA",201 to 500 employees,2003,Company - Private,...,0,1,0,0,1,0,0,0,0,0


In [45]:
#get a csv for eda 
df_final.to_csv('clean_salary_data.csv',index = False)