# Data Cleanup

The dataset is rather small. That's why it would be a good idea keeping most of it while performing the cleanup/preprocessing.

In [108]:
import os

import pandas as pd
import numpy as np

Read the dataset and take a peek inside...

In [109]:
data = pd.read_csv(os.path.join('..', 'raw_data', 'DataAnalyst.csv'))
data.head(25)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True
5,5,Data Analyst,$37K-$66K (Glassdoor est.),About Cubist\nCubist Systematic Strategies is ...,3.9,Point72\n3.9,"New York, NY","Stamford, CT",1001 to 5000 employees,2014,Company - Private,Investment Banking & Asset Management,Finance,Unknown / Non-Applicable,-1,-1
6,6,Business/Data Analyst (FP&A),$37K-$66K (Glassdoor est.),Two Sigma is a different kind of investment ma...,4.4,Two Sigma\n4.4,"New York, NY","New York, NY",1001 to 5000 employees,2001,Company - Private,Investment Banking & Asset Management,Finance,Unknown / Non-Applicable,-1,-1
7,7,Data Science Analyst,$37K-$66K (Glassdoor est.),Data Science Analyst\n\nJob Details\nLevel\nEx...,3.7,GNY Insurance Companies\n3.7,"New York, NY","New York, NY",201 to 500 employees,1914,Company - Private,Insurance Carriers,Insurance,$100 to $500 million (USD),"Travelers, Chubb, Crum & Forster",True
8,8,Data Analyst,$37K-$66K (Glassdoor est.),The Data Analyst is an integral member of the ...,4.0,DMGT\n4.0,"New York, NY","London, United Kingdom",5001 to 10000 employees,1896,Company - Public,Venture Capital & Private Equity,Finance,$1 to $2 billion (USD),"Thomson Reuters, Hearst, Pearson",-1
9,9,"Data Analyst, Merchant Health",$37K-$66K (Glassdoor est.),About Us\n\nRiskified is the AI platform power...,4.4,Riskified\n4.4,"New York, NY","New York, NY",501 to 1000 employees,2013,Company - Private,Research & Development,Business Services,Unknown / Non-Applicable,"Signifyd, Forter",-1


### First impression

1. The `Job Titles` column is noisy. Let's add an `Experience` column to be able to simply sort/group by seniority.
2. `Salary Estimate` isn't helpful in this form as well. This column should be numeric.
I will delete it and instead add two new columns for the lower and upper salary range.
3. `Size` can be split in two columns as well. Just like `Salary Estimate`
4. The Rating needs to be removed from `Company Name`
5. Some columns that don't seem helpful answering the questions can be deleted:
* `Unnamed: 0`
* `Easy Apply`
* `Competitors`
* `Headquarters`
* `Founded`
* `Type of ownership`
* `Industry`
* `Sector`
* `Revenue`
6. There are a few `-1` values scattered through the columns.
Deleting above columns partially deals with the `-1`/rubbish values.
I'm sure there are more rubbish values

### Job Titles and Experience

Let's investigate the random Job titles. First filter out the clean ones and keep the noise.

In [110]:
noisy = data[(data['Job Title'] != 'Data Analyst') \
            & (data['Job Title'] != 'Junior Data Analyst') \
            & (data['Job Title'] != 'Senior Data Analyst')]

noisy['Job Title'].head(25)

0     Data Analyst, Center on Immigration and Justic...
1                                  Quality Data Analyst
2     Senior Data Analyst, Insights & Analytics Team...
4                                Reporting Data Analyst
6                          Business/Data Analyst (FP&A)
7                                  Data Science Analyst
9                         Data Analyst, Merchant Health
12                                         DATA ANALYST
14                     Investment Advisory Data Analyst
15                          Sustainability Data Analyst
17                                Clinical Data Analyst
18                              DATA PROGRAMMER/ANALYST
20                        Product Analyst, Data Science
21                                 Data Science Analyst
22                       Data Analyst - Intex Developer
24                       Entry Level / Jr. Data Analyst
26                 Data + Business Intelligence Analyst
27                                Data Analyst, 

Junior, Regular and Senior are differentiated from each other but not clear enough.
We can assign some synonyms to the positions to help us.

| Junior   | Senior | Regular         |
|----------|--------|-----------------|
| junior   | senior | everything else |
| beginner | lead   | ...             |
| entry    | master | ...             |
| jr       | sr     | ...             |



Based on above findings we can write a custom function to add a new `Experience` column that helps differentiating Juniors, Regulars and Seniors.

In [111]:
def seniority(job_title):
    """Return seniority of job title
    :param job_title: Input Job title"""

    experience = {'junior': ['beginner', 'entry', 'junior', 'jr'],
                  'senior': ['senior', 'lead', 'sr', 'master']}

    # Return Junior or Senior
    for exp, words in experience.items():
        for w in words:
            if w in job_title.lower():
                return exp.title()

    # Returns Regular if above doesn't apply
    not_regular = experience['junior'] + experience['senior']
    for word in not_regular:
        if word not in job_title.lower():
            return 'Regular'

data['Experience'] = data['Job Title'].map(seniority)
data[['Job Title', 'Experience']].sample(10)

Unnamed: 0,Job Title,Experience
1165,Payment Integrity Data Analyst,Regular
2104,Business/Data Analyst,Regular
1261,Data Analyst,Regular
1376,Data Analyst,Regular
468,Data Analyst,Regular
1875,Reporting /Data Analyst,Regular
640,Senior Data Engineer & Analyst,Senior
1684,Associate SalesForce.com Data Steward & Analyst,Regular
1386,Information Technology Specialist / Data Analyst,Regular
1275,Data Privacy Analyst,Regular


### Salary Estimate

Let's cleanup the `Salary Estimate` column. It will help to split it in two columns: `Salary Lower`, `Salary Upper`. Also multiply by 1000 to remove the "K".

In [112]:
data[['Salary Estimate']].sample(15)

Unnamed: 0,Salary Estimate
2004,$65K-$120K (Glassdoor est.)
613,$35K-$45K (Glassdoor est.)
1455,$48K-$88K (Glassdoor est.)
1849,$54K-$75K (Glassdoor est.)
1578,$51K-$93K (Glassdoor est.)
1689,$35K-$42K (Glassdoor est.)
1453,$48K-$88K (Glassdoor est.)
538,$55K-$103K (Glassdoor est.)
536,$55K-$103K (Glassdoor est.)
1487,$110K-$190K (Glassdoor est.)


In [113]:
data[['Salary Lower', 'Salary Upper']] = data['Salary Estimate'].str\
    .split('-', expand=True)\
    .replace('[a-zA-Z$.\(\)]', '', regex=True)

data['Salary Lower'], data['Salary Upper'] = \
    pd.to_numeric(data['Salary Lower'], errors='coerce') * 1000,\
    pd.to_numeric(data['Salary Upper'], errors='coerce') * 1000

data[['Salary Estimate', 'Salary Lower', 'Salary Upper']].sample(15)

Unnamed: 0,Salary Estimate,Salary Lower,Salary Upper
1756,$35K-$67K (Glassdoor est.),35000.0,67000
922,$47K-$76K (Glassdoor est.),47000.0,76000
20,$37K-$66K (Glassdoor est.),37000.0,66000
1257,$76K-$122K (Glassdoor est.),76000.0,122000
472,$43K-$69K (Glassdoor est.),43000.0,69000
1797,$50K-$86K (Glassdoor est.),50000.0,86000
1852,$54K-$75K (Glassdoor est.),54000.0,75000
1294,$60K-$124K (Glassdoor est.),60000.0,124000
1924,$99K-$178K (Glassdoor est.),99000.0,178000
992,$53K-$94K (Glassdoor est.),53000.0,94000


### Company Size

Let's use a similar approach that already worked for the `Salary Estimate`. Here we also add two new columns: `Company Size Min`, `Company Size Max`

In [114]:
data[['Company Size Min', 'Company Size Max']] = data['Size'].str\
    .replace('[a-zA-Z+]', '', regex=True)\
    .str.split(expand=True)

data['Company Size Min'], data['Company Size Max'] = \
    pd.to_numeric(data['Company Size Min'], errors='coerce'),\
    pd.to_numeric(data['Company Size Max'], errors='coerce')

data[['Size', 'Company Size Min', 'Company Size Max']].sample(15)

Unnamed: 0,Size,Company Size Min,Company Size Max
1187,1001 to 5000 employees,1001.0,5000.0
635,1 to 50 employees,1.0,50.0
1948,51 to 200 employees,51.0,200.0
411,1001 to 5000 employees,1001.0,5000.0
1641,10000+ employees,10000.0,
1868,201 to 500 employees,201.0,500.0
321,501 to 1000 employees,501.0,1000.0
878,1 to 50 employees,1.0,50.0
774,51 to 200 employees,51.0,200.0
848,5001 to 10000 employees,5001.0,10000.0


### Company Names

The company rating appended to the company name has to go

In [115]:
data['Company Name'].head(10)

0             Vera Institute of Justice\n3.2
1    Visiting Nurse Service of New York\n3.8
2                           Squarespace\n3.4
3                              Celerity\n4.1
4                               FanDuel\n3.9
5                               Point72\n3.9
6                             Two Sigma\n4.4
7               GNY Insurance Companies\n3.7
8                                  DMGT\n4.0
9                             Riskified\n4.4
Name: Company Name, dtype: object

Testing the removal of the rating and newline character with the magic of regex.

In [116]:
regex_pattern = r'(\n)[0-9.]{3}$'
data['Company Name'].str.replace(regex_pattern, '', regex=True)

0                Vera Institute of Justice
1       Visiting Nurse Service of New York
2                              Squarespace
3                                 Celerity
4                                  FanDuel
                       ...                
2248                         Avacend, Inc.
2249                     Arrow Electronics
2250                              Spiceorb
2251           Contingent Network Services
2252                            SCL Health
Name: Company Name, Length: 2253, dtype: object

That worked. Make it permanent

In [117]:
data['Company Name'] = \
    data['Company Name'].str.replace(regex_pattern, '', regex=True)

data.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Industry,Sector,Revenue,Competitors,Easy Apply,Experience,Salary Lower,Salary Upper,Company Size Min,Company Size Max
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice,"New York, NY","New York, NY",201 to 500 employees,1961,...,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,Regular,37000.0,66000,201.0,500.0
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York,"New York, NY","New York, NY",10000+ employees,1893,...,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,Regular,37000.0,66000,10000.0,
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace,"New York, NY","New York, NY",1001 to 5000 employees,2003,...,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,Senior,37000.0,66000,1001.0,5000.0
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity,"New York, NY","McLean, VA",201 to 500 employees,2002,...,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,Regular,37000.0,66000,201.0,500.0
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel,"New York, NY","New York, NY",501 to 1000 employees,2009,...,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,Regular,37000.0,66000,501.0,1000.0


### Location latitude and longitude

Let's also incorporate geocoding feature to translate the locations to latitude and longitude. This way we have more possibilities for visualization down the road.

In [118]:
import requests

def get_geocode(address):
    geo_api = "https://nominatim.openstreetmap.org/search"
    params = {'q': address, 'format': 'jsonv2'}
    response = requests.get(geo_api, params=params).json()

    return {'lat': response[0]['lat'], 'lng': response[0]['lon']}

Let's test the openstreetmap API on a small section of the dataset and add two new columns `Latitude` and `Longitude`.

In [119]:
test = data[['Location']].sample(10)

test[['Latitude', 'Longitude']] = \
    test[['Location']].apply(get_geocode, axis=1, result_type='expand')

test

Unnamed: 0,Location,Latitude,Longitude
413,"Whippany, NJ",40.8245442,-74.4170972
584,"Los Angeles, CA",34.0536909,-118.242766
856,"Deerfield, IL",42.1711365,-87.8445119
842,"Chicago, IL",41.8755616,-87.6244212
1520,"Palo Alto, CA",37.4443293,-122.1598465
1765,"Columbus, OH",39.9622601,-83.0007065
554,"Los Angeles, CA",34.0536909,-118.242766
1946,"San Francisco, CA",37.7790262,-122.419906
531,"Los Angeles, CA",34.0536909,-118.242766
923,"Houston, TX",29.7589382,-95.3676974


That worked just fine. Let's run it on the whole dataset. My apologies for bombarding the API :-( The resulting dataframe will be stored as a CSV file to prevent straining the API even further.

In [120]:
data[['Latitude', 'Longitude']] = \
    data[['Location']].apply(get_geocode, axis=1, result_type='expand')

data.sample(10)

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,Revenue,Competitors,Easy Apply,Experience,Salary Lower,Salary Upper,Company Size Min,Company Size Max,Latitude,Longitude
271,271,ETL Developer / Data Analyst,$84K-$90K (Glassdoor est.),Leading financial firm has a fulltime opening ...,5.0,Blue Rock Consulting,"New York, NY","Cranford, NJ",1 to 50 employees,-1,...,Unknown / Non-Applicable,-1,-1,Regular,84000.0,90000,1.0,50.0,40.7127281,-74.0060152
33,33,"Data Science Analyst, Capital Markets",$46K-$87K (Glassdoor est.),Who we are\n\nSoFi is a digital personal finan...,3.2,SoFi,"New York, NY","San Francisco, CA",1001 to 5000 employees,2011,...,Unknown / Non-Applicable,-1,-1,Regular,46000.0,87000,1001.0,5000.0,40.7127281,-74.0060152
780,780,Data Analyst,$67K-$92K (Glassdoor est.),Analyze and write complex sql queries in oracl...,4.0,Lorven Technologies Inc,"Chicago, IL","Plainsboro, NJ",1 to 50 employees,-1,...,Less than $1 million (USD),-1,-1,Regular,67000.0,92000,1.0,50.0,41.8755616,-87.6244212
758,758,Data Analyst,$73K-$82K (Glassdoor est.),Job Description\nWe are seeking a Data Analyst...,4.1,SkySource Solutions,"Downers Grove, IL","Berea, OH",1 to 50 employees,2017,...,$5 to $10 million (USD),-1,-1,Regular,73000.0,82000,1.0,50.0,41.7936822,-88.0102281
1342,1342,Business Data Analyst,$30K-$53K (Glassdoor est.),"Deep functional and technical understanding, a...",3.8,Diversant LLC,"Dallas, TX","Red Bank, NJ",1001 to 5000 employees,2005,...,$100 to $500 million (USD),"Kforce, Mitchell Martin, Insight Global",-1,Regular,30000.0,53000,1001.0,5000.0,32.7762719,-96.7968559
1494,1494,Data Analyst,$110K-$190K (Glassdoor est.),KAYGEN is an emerging leader in providing top ...,3.9,Kaygen Inc.,"San Jose, CA","Irvine, CA",1 to 50 employees,-1,...,$1 to $5 million (USD),-1,-1,Regular,110000.0,190000,1.0,50.0,37.3361905,-121.890583
1231,1231,Senior Data Analyst,$73K-$89K (Glassdoor est.),We have the below role open with our direct en...,3.3,Convene Technologies,"San Antonio, TX","Tampa, FL",51 to 200 employees,2005,...,Unknown / Non-Applicable,-1,-1,Senior,73000.0,89000,51.0,200.0,29.4246002,-98.4951405
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity,"New York, NY","McLean, VA",201 to 500 employees,2002,...,$50 to $100 million (USD),-1,-1,Regular,37000.0,66000,201.0,500.0,40.7127281,-74.0060152
1762,1762,Data Analyst-Quality Improvement Services,$28K-$52K (Glassdoor est.),OverviewJOB POSTING: Data Analyst - Quality Im...,3.7,Nationwide Children's Hospital,"Columbus, OH","Columbus, OH",10000+ employees,1892,...,$1 to $2 billion (USD),-1,-1,Regular,28000.0,52000,10000.0,,39.9622601,-83.0007065
874,874,SAS Analyst / Data Analytics /Sr. information ...,$68K-$87K (Glassdoor est.),Job description\n\n***************************...,-1.0,Kbyte Systems LLC.,"Chicago, IL",-1,-1,-1,...,-1,-1,-1,Senior,68000.0,87000,-1.0,,41.8755616,-87.6244212


### Deleting Columns

Some columns have to go to remove irrelevant information and remove some garbage data.

In [121]:
data = data.drop(columns=['Unnamed: 0', 'Easy Apply', 'Competitors',
                          'Headquarters', 'Founded', 'Type of ownership',
                          'Sector', 'Revenue', 'Industry', 'Salary Estimate', 'Size'])

In [122]:
data = data[['Job Title', 'Experience', 'Salary Lower', 'Salary Upper',
             'Job Description', 'Company Name', 'Rating', 'Location',
             'Latitude', 'Longitude', 'Company Size Min', 'Company Size Max']]

data.head(25)

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000.0,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",40.7127281,-74.0060152,201.0,500.0
1,Quality Data Analyst,Regular,37000.0,66000,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000.0,
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000.0,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
3,Data Analyst,Regular,37000.0,66000,Requisition NumberRR-0001939\nRemote:Yes\nWe c...,Celerity,4.1,"New York, NY",40.7127281,-74.0060152,201.0,500.0
4,Reporting Data Analyst,Regular,37000.0,66000,ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,FanDuel,3.9,"New York, NY",40.7127281,-74.0060152,501.0,1000.0
5,Data Analyst,Regular,37000.0,66000,About Cubist\nCubist Systematic Strategies is ...,Point72,3.9,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
6,Business/Data Analyst (FP&A),Regular,37000.0,66000,Two Sigma is a different kind of investment ma...,Two Sigma,4.4,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
7,Data Science Analyst,Regular,37000.0,66000,Data Science Analyst\n\nJob Details\nLevel\nEx...,GNY Insurance Companies,3.7,"New York, NY",40.7127281,-74.0060152,201.0,500.0
8,Data Analyst,Regular,37000.0,66000,The Data Analyst is an integral member of the ...,DMGT,4.0,"New York, NY",40.7127281,-74.0060152,5001.0,10000.0
9,"Data Analyst, Merchant Health",Regular,37000.0,66000,About Us\n\nRiskified is the AI platform power...,Riskified,4.4,"New York, NY",40.7127281,-74.0060152,501.0,1000.0


### Garbage Values

Now the garbage has to go or be replaced in a meaningful way. Checking for null data first.

In [123]:
data.isnull().sum()

Job Title             0
Experience            0
Salary Lower          1
Salary Upper          0
Job Description       0
Company Name          1
Rating                0
Location              0
Latitude              0
Longitude             0
Company Size Min     42
Company Size Max    580
dtype: int64

In [124]:
data[data.isin([-1]).any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
11,Data Analyst,Regular,37000.0,66000,BulbHead is currently seeking a Data Analyst t...,BulbHead,-1.0,"Fairfield, NJ",40.2050878,-74.2135004,1.0,50.0
21,Data Science Analyst,Regular,37000.0,66000,"Job Description\nOur client, a music streaming...",MUSIC & Entertainment,-1.0,"New York, NY",40.7127281,-74.0060152,,
34,Data Analyst (Games),Regular,46000.0,87000,Carry1st is the leading mobile game publisher ...,Carry1st,-1.0,"New York, NY",40.7127281,-74.0060152,-1.0,
36,Data Business Analyst,Regular,46000.0,87000,"At Clear Street, we are disrupting the institu...",Clear Street,-1.0,"New York, NY",40.7127281,-74.0060152,51.0,200.0
40,"Business Analyst, Data Platforms",Regular,46000.0,87000,Company Description\n\nPinto is building the w...,Pinto,-1.0,"New York, NY",40.7127281,-74.0060152,1.0,50.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,Regular,49000.0,91000,Role Data Analyst Duration12+ months Location ...,"TechAspect Solutions, Inc. dba TA Digital",-1.0,"Centennial, CO",39.579155,-104.8769227,-1.0,
2202,Financial Data Analyst,Regular,49000.0,91000,Position:Financial Data AnalystJob Description...,Black Knight Financial Technology Solutions,-1.0,"Denver, CO",39.7392364,-104.9848623,-1.0,
2239,Senior Contract Data Analyst,Senior,78000.0,104000,OverviewAmyx is seeking to hire a Senior Contr...,"Amyx, Iinc.",-1.0,"Aurora, CO",41.7571701,-88.3147539,-1.0,
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,-1.0,"Denver, CO",39.7392364,-104.9848623,-1.0,


In [125]:
data[data.isin(['-1]']).any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max


In [126]:
data[data.isin(['Unknown']).any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max


There are some occurrences of -1 and Unknown within the data. In the next step they are replaced with numpy NaN.

In [127]:
data = data\
    .replace(-1, np.nan)\
    .replace('-1', np.nan)\
    .replace('Unknown', np.nan)
data

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000.0,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",40.7127281,-74.0060152,201.0,500.0
1,Quality Data Analyst,Regular,37000.0,66000,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000.0,
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000.0,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
3,Data Analyst,Regular,37000.0,66000,Requisition NumberRR-0001939\nRemote:Yes\nWe c...,Celerity,4.1,"New York, NY",40.7127281,-74.0060152,201.0,500.0
4,Reporting Data Analyst,Regular,37000.0,66000,ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,FanDuel,3.9,"New York, NY",40.7127281,-74.0060152,501.0,1000.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2248,RQS - IHHA - 201900004460 -1q Data Security An...,Regular,78000.0,104000,Maintains systems to protect data from unautho...,"Avacend, Inc.",2.5,"Denver, CO",39.7392364,-104.9848623,51.0,200.0
2249,Senior Data Analyst (Corporate Audit),Senior,78000.0,104000,Position:\nSenior Data Analyst (Corporate Audi...,Arrow Electronics,2.9,"Centennial, CO",39.579155,-104.8769227,10000.0,
2250,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,"Title: Technical Business Analyst (SQL, Data a...",Spiceorb,,"Denver, CO",39.7392364,-104.9848623,,
2251,"Data Analyst 3, Customer Experience",Regular,78000.0,104000,Summary\n\nResponsible for working cross-funct...,Contingent Network Services,3.1,"Centennial, CO",39.579155,-104.8769227,201.0,500.0


Let's check the `Company Name` and `Job Description` for very short entries just in case.

In [128]:
data[data['Company Name'].str.len() < 3]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
27,"Data Analyst, Product",Regular,37000.0,66000,"About Ro\nFounded in 2017, Ro is a patient-dri...",Ro,4.8,"New York, NY",40.7127281,-74.0060152,51.0,200.0
60,"Data Analyst, Revenue Analytics",Regular,51000.0,88000,"About Ro\nFounded in 2017, Ro is a patient-dri...",Ro,4.8,"New York, NY",40.7127281,-74.0060152,51.0,200.0
235,"Senior Analyst, AB Testing and Data Operations",Senior,41000.0,78000,"What We're Looking For:\n\nThe Senior Analyst,...",2U,3.5,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
792,"Data Analyst, Tax (Affordable Care Act) (ACA) ...",Regular,67000.0,92000,"Data Analyst, Tax (Affordable\nCare Act) (ACA)...",EY,3.8,"Chicago, IL",41.8755616,-87.6244212,10000.0,
1001,Master Data Operation Analyst,Senior,46000.0,102000,PK currently has exciting opportunity for a Ma...,PK,3.6,"Phoenix, AZ",33.4484367,-112.0741417,1001.0,5000.0
1091,"Data Analyst, Data & Analytics (Advanced Analy...",Regular,41000.0,78000,"Data\nAnalyst, Data & Analytics (Advanced Anal...",EY,3.8,"Philadelphia, PA",39.9527237,-75.1635262,10000.0,
1276,"Data Analyst 3 (San Diego or Atlanta, GA)",Regular,76000.0,122000,Job Description Summary\nJob Description\n\n\n...,BD,3.6,"San Diego, CA",32.7174202,-117.1627728,10000.0,
1352,Data Analyst,Regular,30000.0,53000,"Job Description\nETL, SQL Queries, Data Modeli...",1,,"Dallas, TX",32.7762719,-96.7968559,,
1946,"Data Analyst, Data & Analytics (Advanced Analy...",Regular,93000.0,159000,"Data Analyst, Data & Analytics (Advanced\nAnal...",EY,3.8,"San Francisco, CA",37.7790262,-122.419906,10000.0,


In [129]:
data[data['Job Description'].str.len() < 10]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
912,Data Expert Analyst/Modeler,Regular,29000.0,38000,Â\nÂ\nÂ,InvenTech Info,4.8,"Houston, TX",29.7589382,-95.3676974,201.0,500.0


There aren't many fortunately. Replacing them with numpy NaN and check again.

In [130]:
data.loc[data['Job Description'].str.len() < 10, 'Job Description'] = np.nan
data.loc[data['Company Name'].str.len() < 2, 'Company Name'] = np.nan

In [131]:
data[data['Job Description'].str.len() < 10]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max


In [132]:
data[data['Company Name'].str.len() < 3]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
27,"Data Analyst, Product",Regular,37000.0,66000,"About Ro\nFounded in 2017, Ro is a patient-dri...",Ro,4.8,"New York, NY",40.7127281,-74.0060152,51.0,200.0
60,"Data Analyst, Revenue Analytics",Regular,51000.0,88000,"About Ro\nFounded in 2017, Ro is a patient-dri...",Ro,4.8,"New York, NY",40.7127281,-74.0060152,51.0,200.0
235,"Senior Analyst, AB Testing and Data Operations",Senior,41000.0,78000,"What We're Looking For:\n\nThe Senior Analyst,...",2U,3.5,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
792,"Data Analyst, Tax (Affordable Care Act) (ACA) ...",Regular,67000.0,92000,"Data Analyst, Tax (Affordable\nCare Act) (ACA)...",EY,3.8,"Chicago, IL",41.8755616,-87.6244212,10000.0,
1001,Master Data Operation Analyst,Senior,46000.0,102000,PK currently has exciting opportunity for a Ma...,PK,3.6,"Phoenix, AZ",33.4484367,-112.0741417,1001.0,5000.0
1091,"Data Analyst, Data & Analytics (Advanced Analy...",Regular,41000.0,78000,"Data\nAnalyst, Data & Analytics (Advanced Anal...",EY,3.8,"Philadelphia, PA",39.9527237,-75.1635262,10000.0,
1276,"Data Analyst 3 (San Diego or Atlanta, GA)",Regular,76000.0,122000,Job Description Summary\nJob Description\n\n\n...,BD,3.6,"San Diego, CA",32.7174202,-117.1627728,10000.0,
1946,"Data Analyst, Data & Analytics (Advanced Analy...",Regular,93000.0,159000,"Data Analyst, Data & Analytics (Advanced\nAnal...",EY,3.8,"San Francisco, CA",37.7790262,-122.419906,10000.0,


This is much better.

In [133]:
data.isnull().sum()

Job Title             0
Experience            0
Salary Lower          1
Salary Upper          0
Job Description       1
Company Name          2
Rating              272
Location              0
Latitude              0
Longitude             0
Company Size Min    205
Company Size Max    580
dtype: int64

### Dealing with remaining garbage and missing values

In [134]:
data.isna().sum()

Job Title             0
Experience            0
Salary Lower          1
Salary Upper          0
Job Description       1
Company Name          2
Rating              272
Location              0
Latitude              0
Longitude             0
Company Size Min    205
Company Size Max    580
dtype: int64

In [135]:
data[data.isna().any(axis=1)]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
1,Quality Data Analyst,Regular,37000.0,66000,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000.0,
10,Data Analyst,Regular,37000.0,66000,NYU Grossman School of Medicine is one of the ...,NYU Langone Health,4.0,"New York, NY",40.7127281,-74.0060152,10000.0,
11,Data Analyst,Regular,37000.0,66000,BulbHead is currently seeking a Data Analyst t...,BulbHead,,"Fairfield, NJ",40.2050878,-74.2135004,1.0,50.0
12,DATA ANALYST,Regular,37000.0,66000,Job Summary:\n\nThe Clinical Research Data Ana...,Montefiore Medical,3.7,"New York, NY",40.7127281,-74.0060152,10000.0,
20,"Product Analyst, Data Science",Regular,37000.0,66000,Note: By applying to this position your applic...,Google,4.4,"New York, NY",40.7127281,-74.0060152,10000.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
2243,Data Analyst-(Remote- All across,Regular,78000.0,104000,About CenturyLink\n\nCenturyLink (NYSE: CTL) i...,CenturyLink,3.0,"Broomfield, CO",39.9203827,-105.0691464,10000.0,
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,,"Denver, CO",39.7392364,-104.9848623,,
2249,Senior Data Analyst (Corporate Audit),Senior,78000.0,104000,Position:\nSenior Data Analyst (Corporate Audi...,Arrow Electronics,2.9,"Centennial, CO",39.579155,-104.8769227,10000.0,
2250,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,"Title: Technical Business Analyst (SQL, Data a...",Spiceorb,,"Denver, CO",39.7392364,-104.9848623,,


Now that we're left with ~700 rows containing garbage data. We could just delete the rows but the dataset is small. Let's impute some data.

Let's use the mean values for `Rating`, `Company Size Min` and `Company Size Max`.
Afterwards delete the remaining rows with null values in `Salary Lower`, `Job Description` and `Company Name`, because it's just 4 entries.

#### Rating

Show the ratings and calculate the mean. Afterwards replace the missing ratings by the mean rating value.

In [136]:
null_rating = data[data['Rating'].isnull()]
null_rating

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
11,Data Analyst,Regular,37000.0,66000,BulbHead is currently seeking a Data Analyst t...,BulbHead,,"Fairfield, NJ",40.2050878,-74.2135004,1.0,50.0
21,Data Science Analyst,Regular,37000.0,66000,"Job Description\nOur client, a music streaming...",MUSIC & Entertainment,,"New York, NY",40.7127281,-74.0060152,,
34,Data Analyst (Games),Regular,46000.0,87000,Carry1st is the leading mobile game publisher ...,Carry1st,,"New York, NY",40.7127281,-74.0060152,,
36,Data Business Analyst,Regular,46000.0,87000,"At Clear Street, we are disrupting the institu...",Clear Street,,"New York, NY",40.7127281,-74.0060152,51.0,200.0
40,"Business Analyst, Data Platforms",Regular,46000.0,87000,Company Description\n\nPinto is building the w...,Pinto,,"New York, NY",40.7127281,-74.0060152,1.0,50.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2200,Data Analyst,Regular,49000.0,91000,Role Data Analyst Duration12+ months Location ...,"TechAspect Solutions, Inc. dba TA Digital",,"Centennial, CO",39.579155,-104.8769227,,
2202,Financial Data Analyst,Regular,49000.0,91000,Position:Financial Data AnalystJob Description...,Black Knight Financial Technology Solutions,,"Denver, CO",39.7392364,-104.9848623,,
2239,Senior Contract Data Analyst,Senior,78000.0,104000,OverviewAmyx is seeking to hire a Senior Contr...,"Amyx, Iinc.",,"Aurora, CO",41.7571701,-88.3147539,,
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,,"Denver, CO",39.7392364,-104.9848623,,


In [137]:
mean_rating = data['Rating'].mean().round(1)
mean_rating

3.7

In [138]:
data['Rating'] = data['Rating'].replace(np.nan, mean_rating)

In [139]:
fixed_rating = data[(data['Rating'] == mean_rating)]
fixed_rating

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
7,Data Science Analyst,Regular,37000.0,66000,Data Science Analyst\n\nJob Details\nLevel\nEx...,GNY Insurance Companies,3.7,"New York, NY",40.7127281,-74.0060152,201.0,500.0
11,Data Analyst,Regular,37000.0,66000,BulbHead is currently seeking a Data Analyst t...,BulbHead,3.7,"Fairfield, NJ",40.2050878,-74.2135004,1.0,50.0
12,DATA ANALYST,Regular,37000.0,66000,Job Summary:\n\nThe Clinical Research Data Ana...,Montefiore Medical,3.7,"New York, NY",40.7127281,-74.0060152,10000.0,
21,Data Science Analyst,Regular,37000.0,66000,"Job Description\nOur client, a music streaming...",MUSIC & Entertainment,3.7,"New York, NY",40.7127281,-74.0060152,,
28,Data Analyst Entry Level,Junior,37000.0,66000,Type: Paid Intern (in a farm team)\n\nFunction...,Endai,3.7,"New York, NY",40.7127281,-74.0060152,1.0,50.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2212,Data Base Programmer/Analyst,Regular,57000.0,100000,Job Title: Data Base Programmer/Analyst\nLocat...,22nd Century Technologies,3.7,"Denver, CO",39.7392364,-104.9848623,1001.0,5000.0
2225,Sr Data Analyst,Senior,57000.0,100000,We work to solve deep technical problems that ...,Global Healthcare Exchange,3.7,"Louisville, CO",38.2542376,-85.759407,501.0,1000.0
2239,Senior Contract Data Analyst,Senior,78000.0,104000,OverviewAmyx is seeking to hire a Senior Contr...,"Amyx, Iinc.",3.7,"Aurora, CO",41.7571701,-88.3147539,,
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,3.7,"Denver, CO",39.7392364,-104.9848623,,


#### Company Size

1. Show me rows with invalid company size numbers
2. Please show only the rows without a `Company Size Max` entry

In [140]:
data[(data['Company Size Min'].isna()) | (data['Company Size Max'].isna())]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
1,Quality Data Analyst,Regular,37000.0,66000,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000.0,
10,Data Analyst,Regular,37000.0,66000,NYU Grossman School of Medicine is one of the ...,NYU Langone Health,4.0,"New York, NY",40.7127281,-74.0060152,10000.0,
12,DATA ANALYST,Regular,37000.0,66000,Job Summary:\n\nThe Clinical Research Data Ana...,Montefiore Medical,3.7,"New York, NY",40.7127281,-74.0060152,10000.0,
20,"Product Analyst, Data Science",Regular,37000.0,66000,Note: By applying to this position your applic...,Google,4.4,"New York, NY",40.7127281,-74.0060152,10000.0,
21,Data Science Analyst,Regular,37000.0,66000,"Job Description\nOur client, a music streaming...",MUSIC & Entertainment,3.7,"New York, NY",40.7127281,-74.0060152,,
...,...,...,...,...,...,...,...,...,...,...,...,...
2243,Data Analyst-(Remote- All across,Regular,78000.0,104000,About CenturyLink\n\nCenturyLink (NYSE: CTL) i...,CenturyLink,3.0,"Broomfield, CO",39.9203827,-105.0691464,10000.0,
2246,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,Spiceorb is looking for Technical Business Ana...,Spiceorb,3.7,"Denver, CO",39.7392364,-104.9848623,,
2249,Senior Data Analyst (Corporate Audit),Senior,78000.0,104000,Position:\nSenior Data Analyst (Corporate Audi...,Arrow Electronics,2.9,"Centennial, CO",39.579155,-104.8769227,10000.0,
2250,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,"Title: Technical Business Analyst (SQL, Data a...",Spiceorb,3.7,"Denver, CO",39.7392364,-104.9848623,,


In [141]:
data[(data['Company Size Max'].isna()) & (~data['Company Size Min'].isna())]

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
1,Quality Data Analyst,Regular,37000.0,66000,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000.0,
10,Data Analyst,Regular,37000.0,66000,NYU Grossman School of Medicine is one of the ...,NYU Langone Health,4.0,"New York, NY",40.7127281,-74.0060152,10000.0,
12,DATA ANALYST,Regular,37000.0,66000,Job Summary:\n\nThe Clinical Research Data Ana...,Montefiore Medical,3.7,"New York, NY",40.7127281,-74.0060152,10000.0,
20,"Product Analyst, Data Science",Regular,37000.0,66000,Note: By applying to this position your applic...,Google,4.4,"New York, NY",40.7127281,-74.0060152,10000.0,
22,Data Analyst - Intex Developer,Regular,37000.0,66000,Data Analyst - Intex Developer\n\n\nNew York\n...,Macquarie Group,3.3,"New York, NY",40.7127281,-74.0060152,10000.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
2233,Configuration & Data Management Analyst,Regular,57000.0,100000,Description:The coolest jobs on this planet or...,Lockheed Martin,3.8,"Littleton, CO",39.613321,-105.016649,10000.0,
2234,"Data Analyst 3, Customer Experience - Centennial",Regular,57000.0,100000,Business Unit: Summary Responsible for working...,Comcast,3.6,"Englewood, CO",39.6482059,-104.9879641,10000.0,
2243,Data Analyst-(Remote- All across,Regular,78000.0,104000,About CenturyLink\n\nCenturyLink (NYSE: CTL) i...,CenturyLink,3.0,"Broomfield, CO",39.9203827,-105.0691464,10000.0,
2249,Senior Data Analyst (Corporate Audit),Senior,78000.0,104000,Position:\nSenior Data Analyst (Corporate Audi...,Arrow Electronics,2.9,"Centennial, CO",39.579155,-104.8769227,10000.0,


In [142]:
mean_min_company_size = data['Company Size Min'].mean().round(0)
mean_max_company_size = data['Company Size Max'].mean().round(0)

print(f'Mean Min Company Size: {mean_min_company_size}\nMean Max Company Size: {mean_max_company_size}')

Mean Min Company Size: 2325.0
Mean Max Company Size: 1881.0


The mean minimal company size is larger than the mean max company size. This is not ideal. Why is that the case?

In [143]:
data[['Company Size Min']].value_counts().sort_values()

Company Size Min
5001.0               97
501.0               211
201.0               249
1.0                 347
1001.0              348
10000.0             375
51.0                421
dtype: int64

In [144]:
data[['Company Size Max']].value_counts().sort_values()

Company Size Max
10000.0              97
1000.0              211
500.0               249
50.0                347
5000.0              348
200.0               421
dtype: int64

The min company size of 10000 is raising the mean significantly. Let's take it out and see what happens.

In [145]:
data[data['Company Size Min'] < 10000]['Company Size Min'].mean().round(0) # Should be around 600

604.0

Let's also impute the missing max company size values for those  10000+ employee companies.
 Just use the same min and max value for those because there is no further information of how big these companies are.

In [146]:
data['Company Size Max'] = data['Company Size Max'].mask(pd.isnull, data['Company Size Min'])
data

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000.0,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",40.7127281,-74.0060152,201.0,500.0
1,Quality Data Analyst,Regular,37000.0,66000,Overview\n\nProvides analytical and technical ...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000.0,10000.0
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000.0,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",40.7127281,-74.0060152,1001.0,5000.0
3,Data Analyst,Regular,37000.0,66000,Requisition NumberRR-0001939\nRemote:Yes\nWe c...,Celerity,4.1,"New York, NY",40.7127281,-74.0060152,201.0,500.0
4,Reporting Data Analyst,Regular,37000.0,66000,ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,FanDuel,3.9,"New York, NY",40.7127281,-74.0060152,501.0,1000.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2248,RQS - IHHA - 201900004460 -1q Data Security An...,Regular,78000.0,104000,Maintains systems to protect data from unautho...,"Avacend, Inc.",2.5,"Denver, CO",39.7392364,-104.9848623,51.0,200.0
2249,Senior Data Analyst (Corporate Audit),Senior,78000.0,104000,Position:\nSenior Data Analyst (Corporate Audi...,Arrow Electronics,2.9,"Centennial, CO",39.579155,-104.8769227,10000.0,10000.0
2250,"Technical Business Analyst (SQL, Data analytic...",Regular,78000.0,104000,"Title: Technical Business Analyst (SQL, Data a...",Spiceorb,3.7,"Denver, CO",39.7392364,-104.9848623,,
2251,"Data Analyst 3, Customer Experience",Regular,78000.0,104000,Summary\n\nResponsible for working cross-funct...,Contingent Network Services,3.1,"Centennial, CO",39.579155,-104.8769227,201.0,500.0


What's the mean after this operation?

In [147]:
mean_min_company_size = data['Company Size Min'].mean().round(0)
mean_max_company_size = data['Company Size Max'].mean().round(0)

mean_min_company_size, mean_max_company_size

(2325.0, 3368.0)

Allright, min is less than max but the values seem quite high for the company size mean. Let's continue under the assumption that this is what we want for now.

In [148]:
data['Company Size Min'] = data['Company Size Min'].replace(np.nan, mean_min_company_size)
data['Company Size Max'] = data['Company Size Max'].replace(np.nan, mean_max_company_size)

In [149]:
data.isna().sum()

Job Title           0
Experience          0
Salary Lower        1
Salary Upper        0
Job Description     1
Company Name        2
Rating              0
Location            0
Latitude            0
Longitude           0
Company Size Min    0
Company Size Max    0
dtype: int64

### Finishing Touches

Because there are just a few rows with null values left we can just delete them to get clean data.

In [150]:
data = data.dropna()
data = data.convert_dtypes()
data.isna().sum()

Job Title           0
Experience          0
Salary Lower        0
Salary Upper        0
Job Description     0
Company Name        0
Rating              0
Location            0
Latitude            0
Longitude           0
Company Size Min    0
Company Size Max    0
dtype: int64

Let's also reorder the table to see job-related information first and company information second.

In [151]:
data = data[['Job Title', 'Experience', 'Salary Lower', 'Salary Upper',
             'Job Description', 'Company Name', 'Rating', 'Location',
             'Latitude', 'Longitude', 'Company Size Min', 'Company Size Max']]

data.head(25)

Unnamed: 0,Job Title,Experience,Salary Lower,Salary Upper,Job Description,Company Name,Rating,Location,Latitude,Longitude,Company Size Min,Company Size Max
0,"Data Analyst, Center on Immigration and Justic...",Regular,37000,66000,Are you eager to roll up your sleeves and harn...,Vera Institute of Justice,3.2,"New York, NY",40.7127281,-74.0060152,201,500
1,Quality Data Analyst,Regular,37000,66000,Overview Provides analytical and technical su...,Visiting Nurse Service of New York,3.8,"New York, NY",40.7127281,-74.0060152,10000,10000
2,"Senior Data Analyst, Insights & Analytics Team...",Senior,37000,66000,We’re looking for a Senior Data Analyst who ha...,Squarespace,3.4,"New York, NY",40.7127281,-74.0060152,1001,5000
3,Data Analyst,Regular,37000,66000,Requisition NumberRR-0001939 Remote:Yes We col...,Celerity,4.1,"New York, NY",40.7127281,-74.0060152,201,500
4,Reporting Data Analyst,Regular,37000,66000,ABOUT FANDUEL GROUP FanDuel Group is a world-...,FanDuel,3.9,"New York, NY",40.7127281,-74.0060152,501,1000
5,Data Analyst,Regular,37000,66000,About Cubist Cubist Systematic Strategies is o...,Point72,3.9,"New York, NY",40.7127281,-74.0060152,1001,5000
6,Business/Data Analyst (FP&A),Regular,37000,66000,Two Sigma is a different kind of investment ma...,Two Sigma,4.4,"New York, NY",40.7127281,-74.0060152,1001,5000
7,Data Science Analyst,Regular,37000,66000,Data Science Analyst Job Details Level Experi...,GNY Insurance Companies,3.7,"New York, NY",40.7127281,-74.0060152,201,500
8,Data Analyst,Regular,37000,66000,The Data Analyst is an integral member of the ...,DMGT,4.0,"New York, NY",40.7127281,-74.0060152,5001,10000
9,"Data Analyst, Merchant Health",Regular,37000,66000,About Us Riskified is the AI platform powerin...,Riskified,4.4,"New York, NY",40.7127281,-74.0060152,501,1000


Lastly save the cleaned up data to disc.

In [152]:
data.to_csv(os.path.join('..', 'raw_data', 'DataAnalyst_Cleanup.csv'), index=False, index_label=False)