# Data Analyst Jobs

## Intro
The aim is to take this [Kaggle Dataset](https://www.kaggle.com/andrewmvd/data-analyst-jobs) and try to find answers to these questions:

1. Where do Analysts earn the most/least?
2. How big are the salary differences between Junior, Regular and Senior positions?
3. Are company size and salary correlated?
4. What are the best jobs by salary and company rating?
5. What are the top skills needed for the job?

The dataset is rather small. That's why it would be a good idea keeping most of it while performing the cleanup/preprocessing.

## Data Cleanup
But the data needs a good scrub first.

In [1]:
# Importing packages

import os
import re
import string

import pandas as pd
import numpy as np

In [2]:
# Read the dataset and take a peek inside

data = pd.read_csv(os.path.join('..', 'raw_data', 'DataAnalyst.csv'))
data.head(25)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True
5,5,Data Analyst,$37K-$66K (Glassdoor est.),About Cubist\nCubist Systematic Strategies is ...,3.9,Point72\n3.9,"New York, NY","Stamford, CT",1001 to 5000 employees,2014,Company - Private,Investment Banking & Asset Management,Finance,Unknown / Non-Applicable,-1,-1
6,6,Business/Data Analyst (FP&A),$37K-$66K (Glassdoor est.),Two Sigma is a different kind of investment ma...,4.4,Two Sigma\n4.4,"New York, NY","New York, NY",1001 to 5000 employees,2001,Company - Private,Investment Banking & Asset Management,Finance,Unknown / Non-Applicable,-1,-1
7,7,Data Science Analyst,$37K-$66K (Glassdoor est.),Data Science Analyst\n\nJob Details\nLevel\nEx...,3.7,GNY Insurance Companies\n3.7,"New York, NY","New York, NY",201 to 500 employees,1914,Company - Private,Insurance Carriers,Insurance,$100 to $500 million (USD),"Travelers, Chubb, Crum & Forster",True
8,8,Data Analyst,$37K-$66K (Glassdoor est.),The Data Analyst is an integral member of the ...,4.0,DMGT\n4.0,"New York, NY","London, United Kingdom",5001 to 10000 employees,1896,Company - Public,Venture Capital & Private Equity,Finance,$1 to $2 billion (USD),"Thomson Reuters, Hearst, Pearson",-1
9,9,"Data Analyst, Merchant Health",$37K-$66K (Glassdoor est.),About Us\n\nRiskified is the AI platform power...,4.4,Riskified\n4.4,"New York, NY","New York, NY",501 to 1000 employees,2013,Company - Private,Research & Development,Business Services,Unknown / Non-Applicable,"Signifyd, Forter",-1


### First impression

1. The `Job Titles` column is noisy. Let's add an `Experience` column to be able to simply sort/group by seniority.

---

2. `Salary Estimate` isn't helpful in this form as well. This column should be numeric.
I will delete it and instead add two new columns for the lower and upper salary range.

---

3. The Rating needs to be removed from `Company Name`

---

4. Some columns that don't seem helpful answering the questions can be deleted:
* `Unnamed: 0`
* `Easy Apply`
* `Competitors`
* `Headquarters`
* `Founded`
* `Type of ownership`
* `Industry`
* `Sector`
* `Revenue`

---

5. There are a few `-1` values scattered through the columns.
Deleting above columns partially deals with the `-1`/rubbish values.
I'm sure there are more rubbish values

#### Job Titles and Experience

In [3]:
# Let's investigate the random Job titles
# Filter out the clean ones and keep the noise

noisy = data[(data['Job Title'] != 'Data Analyst') \
            & (data['Job Title'] != 'Junior Data Analyst') \
            & (data['Job Title'] != 'Senior Data Analyst')]

# Show the noisy job titles
noisy['Job Title'].head(25)

0     Data Analyst, Center on Immigration and Justic...
1                                  Quality Data Analyst
2     Senior Data Analyst, Insights & Analytics Team...
4                                Reporting Data Analyst
6                          Business/Data Analyst (FP&A)
7                                  Data Science Analyst
9                         Data Analyst, Merchant Health
12                                         DATA ANALYST
14                     Investment Advisory Data Analyst
15                          Sustainability Data Analyst
17                                Clinical Data Analyst
18                              DATA PROGRAMMER/ANALYST
20                        Product Analyst, Data Science
21                                 Data Science Analyst
22                       Data Analyst - Intex Developer
24                       Entry Level / Jr. Data Analyst
26                 Data + Business Intelligence Analyst
27                                Data Analyst, 

Junior, Regular and Senior are differentiated from each other but not clear enough.
We can assign some terms to the positions to help us.

| Junior   | Senior | Regular         |
|----------|--------|-----------------|
| junior   | senior | everything else |
| beginner | lead   | ...             |
| entry    | master | ...             |
| jr       | sr     | ...             |



In [4]:
def seniority(x):
    """Get seniority of job title"""
    experience = {'junior': ['beginner', 'entry', 'junior', 'jr'],
                  'senior': ['senior', 'lead', 'sr', 'master']}

    # Return Junior or Senior
    for exp, words in experience.items():
        for w in words:
            if w in x.lower():
                return exp.title()

    # Returns Regular if above doesn't apply
    not_regular = experience['junior'] + experience['senior']
    for nr in not_regular:
        if nr not in x.lower():
            return 'Regular'

# Adding an `Experience` column to the dataframe using above function
data['Experience'] = data['Job Title'].map(seniority)
data

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,Experience
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,Regular
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,Regular
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,Senior
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,Regular
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,Regular
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2248,2248,RQS - IHHA - 201900004460 -1q Data Security An...,$78K-$104K (Glassdoor est.),Maintains systems to protect data from unautho...,2.5,"Avacend, Inc.\n2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,-1,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1,Regular
2249,2249,Senior Data Analyst (Corporate Audit),$78K-$104K (Glassdoor est.),Position:\nSenior Data Analyst (Corporate Audi...,2.9,Arrow Electronics\n2.9,"Centennial, CO","Centennial, CO",10000+ employees,1935,Company - Public,Wholesale,Business Services,$10+ billion (USD),"Avnet, Ingram Micro, Tech Data",-1,Senior
2250,2250,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),"Title: Technical Business Analyst (SQL, Data a...",-1.0,Spiceorb,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,Regular
2251,2251,"Data Analyst 3, Customer Experience",$78K-$104K (Glassdoor est.),Summary\n\nResponsible for working cross-funct...,3.1,Contingent Network Services\n3.1,"Centennial, CO","West Chester, OH",201 to 500 employees,1984,Company - Private,Enterprise Software & Network Solutions,Information Technology,$25 to $50 million (USD),-1,-1,Regular


In [5]:
data['Experience'].value_counts()

Regular    1670
Senior      497
Junior       86
Name: Experience, dtype: int64

#### Salary Estimate

In [6]:
data

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,Experience
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,Regular
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,Regular
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,Senior
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,Regular
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,Regular
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2248,2248,RQS - IHHA - 201900004460 -1q Data Security An...,$78K-$104K (Glassdoor est.),Maintains systems to protect data from unautho...,2.5,"Avacend, Inc.\n2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,-1,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1,Regular
2249,2249,Senior Data Analyst (Corporate Audit),$78K-$104K (Glassdoor est.),Position:\nSenior Data Analyst (Corporate Audi...,2.9,Arrow Electronics\n2.9,"Centennial, CO","Centennial, CO",10000+ employees,1935,Company - Public,Wholesale,Business Services,$10+ billion (USD),"Avnet, Ingram Micro, Tech Data",-1,Senior
2250,2250,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),"Title: Technical Business Analyst (SQL, Data a...",-1.0,Spiceorb,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,Regular
2251,2251,"Data Analyst 3, Customer Experience",$78K-$104K (Glassdoor est.),Summary\n\nResponsible for working cross-funct...,3.1,Contingent Network Services\n3.1,"Centennial, CO","West Chester, OH",201 to 500 employees,1984,Company - Private,Enterprise Software & Network Solutions,Information Technology,$25 to $50 million (USD),-1,-1,Regular


In [7]:
def salary_estimate(x, low=True):
    garbage = string.punctuation + string.ascii_lowercase
    sal_split = x.lower().split('-')

    lower = re.sub(f'[{garbage}]', '', sal_split[0])
    upper = re.sub(f'[{garbage}]', '', sal_split[1])

    return lower if low else upper

# Create Columns
data['Salary Lower'] = data['Salary Estimate'].apply(salary_estimate)
data['Salary Upper'] = \
    data['Salary Estimate'].apply(salary_estimate, args=(False,))

# Convert to numerical and multiply by 1000
data['Salary Lower'] = \
    pd.to_numeric(data['Salary Lower'], errors='coerce') * 1000
data['Salary Upper'] = \
    pd.to_numeric(data['Salary Upper'], errors='coerce') * 1000

data = data.convert_dtypes()
data

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,Experience,Salary Lower,Salary Upper
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice 3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,Regular,37000,66000
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview Provides analytical and technical su...,3.8,Visiting Nurse Service of New York 3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,Regular,37000,66000
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace 3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,Senior,37000,66000
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939 Remote:Yes We col...,4.1,Celerity 4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,Regular,37000,66000
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP FanDuel Group is a world-...,3.9,FanDuel 3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,Regular,37000,66000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2248,2248,RQS - IHHA - 201900004460 -1q Data Security An...,$78K-$104K (Glassdoor est.),Maintains systems to protect data from unautho...,2.5,"Avacend, Inc. 2.5","Denver, CO","Alpharetta, GA",51 to 200 employees,-1,Company - Private,Staffing & Outsourcing,Business Services,Unknown / Non-Applicable,-1,-1,Regular,78000,104000
2249,2249,Senior Data Analyst (Corporate Audit),$78K-$104K (Glassdoor est.),Position: Senior Data Analyst (Corporate Audit...,2.9,Arrow Electronics 2.9,"Centennial, CO","Centennial, CO",10000+ employees,1935,Company - Public,Wholesale,Business Services,$10+ billion (USD),"Avnet, Ingram Micro, Tech Data",-1,Senior,78000,104000
2250,2250,"Technical Business Analyst (SQL, Data analytic...",$78K-$104K (Glassdoor est.),"Title: Technical Business Analyst (SQL, Data a...",-1.0,Spiceorb,"Denver, CO",-1,-1,-1,-1,-1,-1,-1,-1,-1,Regular,78000,104000
2251,2251,"Data Analyst 3, Customer Experience",$78K-$104K (Glassdoor est.),Summary Responsible for working cross-functio...,3.1,Contingent Network Services 3.1,"Centennial, CO","West Chester, OH",201 to 500 employees,1984,Company - Private,Enterprise Software & Network Solutions,Information Technology,$25 to $50 million (USD),-1,-1,Regular,78000,104000


In [8]:
data.dtypes

Unnamed: 0             Int64
Job Title             string
Salary Estimate       string
Job Description       string
Rating               Float64
Company Name          string
Location              string
Headquarters          string
Size                  string
Founded                Int64
Type of ownership     string
Industry              string
Sector                string
Revenue               string
Competitors           string
Easy Apply            string
Experience            string
Salary Lower           Int64
Salary Upper           Int64
dtype: object

In [9]:
# data['Salary Lower'].value_counts().sort_index()

#### Company Names

The company rating appended to the company name has to go

In [10]:
data['Company Name'].head(10)

0             Vera Institute of Justice
3.2
1    Visiting Nurse Service of New York
3.8
2                           Squarespace
3.4
3                              Celerity
4.1
4                               FanDuel
3.9
5                               Point72
3.9
6                             Two Sigma
4.4
7               GNY Insurance Companies
3.7
8                                  DMGT
4.0
9                             Riskified
4.4
Name: Company Name, dtype: string

In [11]:
# Let's try to regex the rating away

data['Company Name'].str.replace('[0-9.]{3}$', '', regex=True)

0                Vera Institute of Justice

1       Visiting Nurse Service of New York

2                              Squarespace

3                                 Celerity

4                                  FanDuel

                       ...                 
2248                         Avacend, Inc.

2249                     Arrow Electronics

2250                               Spiceorb
2251           Contingent Network Services

2252                            SCL Health

Name: Company Name, Length: 2253, dtype: string

In [12]:
# That worked. Make it permanent

data['Company Name'] = \
    data['Company Name'].str.replace('[0-9.]{3}$', '', regex=True)
data.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,Experience,Salary Lower,Salary Upper
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,Regular,37000,66000
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview Provides analytical and technical su...,3.8,Visiting Nurse Service of New York,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,Regular,37000,66000
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,Senior,37000,66000
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939 Remote:Yes We col...,4.1,Celerity,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,Regular,37000,66000
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP FanDuel Group is a world-...,3.9,FanDuel,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,Regular,37000,66000


#### Deleting Columns

Some columns have to go to remove irrelevant information and remove some garbage data.

In [13]:
# Deleting columns
data = data.drop(columns=['Unnamed: 0', 'Easy Apply', 'Competitors',
                          'Headquarters', 'Founded', 'Type of ownership',
                          'Sector', 'Revenue', 'Industry'])

# Also delete `Salary Estimate` column
data = data.drop(columns=['Salary Estimate'])

#### Reorder Dataframe

Let's also reorder the table to see job-related information first and company information second.

In [14]:
data = data[['Job Title', 'Experience', 'Salary Lower', 'Salary Upper',
      'Job Description', 'Company Name', 'Rating', 'Location', 'Size']]
data.head(25)

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",201 to 500 employees
1,Quality Data Analyst,Regular,37000,66000,Overview Provides analytical and technical su...,Visiting Nurse Service of New York,3.8,"New York, NY",10000+ employees
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",1001 to 5000 employees
3,Data Analyst,Regular,37000,66000,Requisition NumberRR-0001939 Remote:Yes We col...,Celerity,4.1,"New York, NY",201 to 500 employees
4,Reporting Data Analyst,Regular,37000,66000,ABOUT FANDUEL GROUP FanDuel Group is a world-...,FanDuel,3.9,"New York, NY",501 to 1000 employees
5,Data Analyst,Regular,37000,66000,About Cubist Cubist Systematic Strategies is o...,Point72,3.9,"New York, NY",1001 to 5000 employees
6,Business/Data Analyst (FP&A),Regular,37000,66000,Two Sigma is a different kind of investment ma...,Two Sigma,4.4,"New York, NY",1001 to 5000 employees
7,Data Science Analyst,Regular,37000,66000,Data Science Analyst Job Details Level Experi...,GNY Insurance Companies,3.7,"New York, NY",201 to 500 employees
8,Data Analyst,Regular,37000,66000,The Data Analyst is an integral member of the ...,DMGT,4.0,"New York, NY",5001 to 10000 employees
9,"Data Analyst, Merchant Health",Regular,37000,66000,About Us Riskified is the AI platform powerin...,Riskified,4.4,"New York, NY",501 to 1000 employees


#### Garbage Values

Now the garbage has to go or be replaced in a meaningful way.

In [15]:
# Let's see if something is already considered garbage
data.isnull().sum()

Job Title          0
Experience         0
Salary Lower       1
Salary Upper       0
Job Description    0
Company Name       1
Rating             0
Location           0
Size               0
dtype: int64

Apparently almost everything is fine but there are some occurrences of -1 and Unknown within the data

In [16]:
data[data.isin([-1]).any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
11,Data Analyst,Regular,37000,66000,BulbHead is currently seeking a Data Analyst t...,BulbHead,-1.0,"Fairfield, NJ",1 to 50 employees
21,Data Science Analyst,Regular,37000,66000,"Job Description Our client, a music streaming ...",MUSIC & Entertainment,-1.0,"New York, NY",Unknown
34,Data Analyst (Games),Regular,46000,87000,Carry1st is the leading mobile game publisher ...,Carry1st,-1.0,"New York, NY",-1
36,Data Business Analyst,Regular,46000,87000,"At Clear Street, we are disrupting the institu...",Clear Street,-1.0,"New York, NY",51 to 200 employees
40,"Business Analyst, Data Platforms",Regular,46000,87000,Company Description Pinto is building the wor...,Pinto,-1.0,"New York, NY",1 to 50 employees
...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,Regular,49000,91000,Role Data Analyst Duration12+ months Location ...,"TechAspect Solutions, Inc. dba TA Digital",-1.0,"Centennial, CO",-1
2202,Financial Data Analyst,Regular,49000,91000,Position:Financial Data AnalystJob Description...,Black Knight Financial Technology Solutions,-1.0,"Denver, CO",-1
2239,Senior Contract Data Analyst,Senior,78000,104000,OverviewAmyx is seeking to hire a Senior Contr...,"Amyx, Iinc.",-1.0,"Aurora, CO",-1
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,-1.0,"Denver, CO",-1


In [17]:
data[data.isin(['-1]']).any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size


In [18]:
data[data.isin(['Unknown']).any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
21,Data Science Analyst,Regular,37000,66000,"Job Description Our client, a music streaming ...",MUSIC & Entertainment,-1.0,"New York, NY",Unknown
85,"Data Analyst, Bitcoin Trading Firm",Regular,51000,88000,Our client is an innovative bitcoin marketplac...,Fintech Recruiters,-1.0,"New York, NY",Unknown
148,Data Analyst,Regular,59000,85000,"Get To Know Voice At Voice, we are on a missi...",Voice,3.4,"Brooklyn, NY",Unknown
167,Lead Data Insights Analyst,Senior,43000,76000,Lead Data Insights Analyst Overview TapRm's g...,TapRm,-1.0,"Brooklyn, NY",Unknown
179,Data Analyst,Regular,43000,76000,"We are looking for an organized, detail-orient...",CompuForce,-1.0,"New York, NY",Unknown
275,Data Analyst/Project Manager,Regular,84000,90000,*Overview** The Data Analyst is a part of the ...,SMBC,3.0,"Jersey City, NJ",Unknown
391,TX Healthcare Data/Reporting Analyst,Regular,98000,114000,Position Summary A data management/analyst who...,RN Staff,3.4,"New York, NY",Unknown
402,NY Healthcare Data/Reporting Analyst,Regular,48000,96000,Position Summary A data management/analyst who...,RN Staff,3.4,"New York, NY",Unknown
403,Healthcare Senior Data Analyst - HEDIS,Senior,48000,96000,Job Description Healthcare Data Analyst - HEDI...,Village Care,-1.0,"New York, NY",Unknown
568,Data Analyst,Regular,37000,70000,"As a Data Analyst, you will engage with busine...",SAG-AFTRA Health Plan and SAG-Producers Pensio...,2.8,"Burbank, CA",Unknown


In [19]:
# Replace the occurrences of -1 and Unknown globally with numpy NaN

data = data.replace(-1, np.nan).replace('-1', np.nan).replace('Unknown', np.nan)
data

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",201 to 500 employees
1,Quality Data Analyst,Regular,37000,66000,Overview Provides analytical and technical su...,Visiting Nurse Service of New York,3.8,"New York, NY",10000+ employees
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",1001 to 5000 employees
3,Data Analyst,Regular,37000,66000,Requisition NumberRR-0001939 Remote:Yes We col...,Celerity,4.1,"New York, NY",201 to 500 employees
4,Reporting Data Analyst,Regular,37000,66000,ABOUT FANDUEL GROUP FanDuel Group is a world-...,FanDuel,3.9,"New York, NY",501 to 1000 employees
...,...,...,...,...,...,...,...,...,...
2248,RQS - IHHA - 201900004460 -1q Data Security An...,Regular,78000,104000,Maintains systems to protect data from unautho...,"Avacend, Inc.",2.5,"Denver, CO",51 to 200 employees
2249,Senior Data Analyst (Corporate Audit),Senior,78000,104000,Position: Senior Data Analyst (Corporate Audit...,Arrow Electronics,2.9,"Centennial, CO",10000+ employees
2250,"Technical Business Analyst (SQL, Data analytic...",Regular,78000,104000,"Title: Technical Business Analyst (SQL, Data a...",Spiceorb,,"Denver, CO",
2251,"Data Analyst 3, Customer Experience",Regular,78000,104000,Summary Responsible for working cross-functio...,Contingent Network Services,3.1,"Centennial, CO",201 to 500 employees


In [20]:
# Let's check the company name for very short entries

data[data['Company Name'].str.len() < 3]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
1352,Data Analyst,Regular,30000,53000,"Job Description ETL, SQL Queries, Data Modelin...",1,,"Dallas, TX",


In [21]:
# Also check the job description

data[data['Job Description'].str.len() < 10]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
912,Data Expert Analyst/Modeler,Regular,29000,38000,Â Â Â,InvenTech Info,4.8,"Houston, TX",201 to 500 employees


In [22]:
# Fix both

data.loc[data['Job Description'].str.len() < 10, 'Job Description'] = np.nan
data.loc[data['Company Name'].str.len() < 2, 'Company Name'] = np.nan

# And check again...

In [23]:
data[data['Job Description'].str.len() < 10]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size


In [24]:
data[data['Company Name'].str.len() < 3]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size


In [25]:
# See if we have null values now

data.isnull().sum()

Job Title            0
Experience           0
Salary Lower         1
Salary Upper         0
Job Description      1
Company Name         2
Rating             272
Location             0
Size               205
dtype: int64

#### Deleting Rows with Garbage

In [40]:
data[data.isna().any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
11,Data Analyst,Regular,37000,66000,BulbHead is currently seeking a Data Analyst t...,BulbHead,,"Fairfield, NJ",1 to 50 employees
21,Data Science Analyst,Regular,37000,66000,"Job Description Our client, a music streaming ...",MUSIC & Entertainment,,"New York, NY",
34,Data Analyst (Games),Regular,46000,87000,Carry1st is the leading mobile game publisher ...,Carry1st,,"New York, NY",
36,Data Business Analyst,Regular,46000,87000,"At Clear Street, we are disrupting the institu...",Clear Street,,"New York, NY",51 to 200 employees
40,"Business Analyst, Data Platforms",Regular,46000,87000,Company Description Pinto is building the wor...,Pinto,,"New York, NY",1 to 50 employees
...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,Regular,49000,91000,Role Data Analyst Duration12+ months Location ...,"TechAspect Solutions, Inc. dba TA Digital",,"Centennial, CO",
2202,Financial Data Analyst,Regular,49000,91000,Position:Financial Data AnalystJob Description...,Black Knight Financial Technology Solutions,,"Denver, CO",
2239,Senior Contract Data Analyst,Senior,78000,104000,OverviewAmyx is seeking to hire a Senior Contr...,"Amyx, Iinc.",,"Aurora, CO",
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,,"Denver, CO",


Now that we're left with about 300 rows containing garbage data. Let's for simplicity's sake nuke these rows. Knowingly destroying about 13% of the data.

In [41]:
data = data.dropna()
data

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Size
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",201 to 500 employees
1,Quality Data Analyst,Regular,37000,66000,Overview Provides analytical and technical su...,Visiting Nurse Service of New York,3.8,"New York, NY",10000+ employees
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",1001 to 5000 employees
3,Data Analyst,Regular,37000,66000,Requisition NumberRR-0001939 Remote:Yes We col...,Celerity,4.1,"New York, NY",201 to 500 employees
4,Reporting Data Analyst,Regular,37000,66000,ABOUT FANDUEL GROUP FanDuel Group is a world-...,FanDuel,3.9,"New York, NY",501 to 1000 employees
...,...,...,...,...,...,...,...,...,...
2247,Marketing/Communications - Data Analyst-Marketing,Regular,78000,104000,Job Description Job Title: Marketing/Communica...,APN Software Services Inc.,4.1,"Broomfield, CO",51 to 200 employees
2248,RQS - IHHA - 201900004460 -1q Data Security An...,Regular,78000,104000,Maintains systems to protect data from unautho...,"Avacend, Inc.",2.5,"Denver, CO",51 to 200 employees
2249,Senior Data Analyst (Corporate Audit),Senior,78000,104000,Position: Senior Data Analyst (Corporate Audit...,Arrow Electronics,2.9,"Centennial, CO",10000+ employees
2251,"Data Analyst 3, Customer Experience",Regular,78000,104000,Summary Responsible for working cross-functio...,Contingent Network Services,3.1,"Centennial, CO",201 to 500 employees


In [28]:
# raw_data.isnull().sum()