| What problems should we worry about?                   | What can we do about these problems?                                     |
| ------------------------------------------------------ | ------------------------------------------------------------------------ |
| Extra index column in the source dataset               | Drop it                                                                  |
| `Salary Estimate`, `Size`, and `Revenue` are strings   | Convert to tuples of numbers                                             |
| `-1` as a default value nearly everywhere              | Replace all with `NaN`                                                   |
| `Rating` is included in the `Company Name`             | Parse with string operators                                              |
| `Competitors` are stored as a string                   | Split and replace with an array                                          |
| All field values are in _Title Case_                   | Rename all to use _snake_case_ so they're easier to use programmatically |
| `Unknown / Non-Applicable` default value for `Revenue` | Replace with `NaN`                                                       |


In [29]:
import pandas

df = pandas.read_csv("./2.0.csv")
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


In [30]:
df.drop(columns="index", inplace=True)
# Drp extra index column

In [31]:
df.rename(
    columns={
        "Job Title": "job_title",
        "Salary Estimate": "salary",
        "Job Description": "description",
        "Rating": "rating",
        "Company Name": "company",
        "Headquarters": "headquarters",
        "Size": "size",
        "Founded": "founded",
        "Type of ownership": "ownership",
        "Industry": "industry",
        "Sector": "sector",
        "Revenue": "revenue",
        "Competitors": "competitors",
    },
    inplace=True,
)
# snake_case all columns

In [32]:
df

Unnamed: 0,job_title,salary,description,rating,company,Location,headquarters,size,founded,ownership,industry,sector,revenue,competitors
0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,Data Scientist,$105K-$167K (Glassdoor est.),Summary\n\nWe’re looking for a data scientist ...,3.6,TRANZACT\n3.6,"Fort Lee, NJ","Fort Lee, NJ",1001 to 5000 employees,1989,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,-1
668,Data Scientist,$105K-$167K (Glassdoor est.),Job Description\nBecome a thought leader withi...,-1.0,JKGT,"San Francisco, CA",-1,-1,-1,-1,-1,-1,-1,-1
669,Data Scientist,$105K-$167K (Glassdoor est.),Join a thriving company that is changing the w...,-1.0,AccessHope,"Irwindale, CA",-1,-1,-1,-1,-1,-1,-1,-1
670,Data Scientist,$105K-$167K (Glassdoor est.),100 Remote Opportunity As an AINLP Data Scient...,5.0,ChaTeck Incorporated\n5.0,"San Francisco, CA","Santa Clara, CA",1 to 50 employees,-1,Company - Private,Advertising & Marketing,Business Services,$1 to $5 million (USD),-1


In [33]:
from numpy import NaN

df.replace(-1, NaN, inplace=True)  # Replace values in rating, founded, etc
df.replace("-1", NaN, inplace=True)  # Replace values in sector, competitors, etc
df["revenue"].replace(
    "Unknown / Non-Applicable", NaN, inplace=True
)  # Replace values in revenue

In [34]:
print("% of non-null values")
df.count() / df.shape[0]
# Remeasure

% of non-null values


job_title       1.000000
salary          1.000000
description     1.000000
rating          0.925595
company         1.000000
Location        1.000000
headquarters    0.953869
size            0.959821
founded         0.824405
ownership       0.959821
industry        0.894345
sector          0.894345
revenue         0.642857
competitors     0.254464
dtype: float64

In [35]:
# Drop competitors column as a large portion of it is NaN, and it's not directly related to the job posting either
df.drop(columns="competitors", inplace=True)

| What problems should we worry about?                 | What can we do about these problems?                                     |
| ---------------------------------------------------- | ------------------------------------------------------------------------ |
| `Salary Estimate`, `Size`, and `Revenue` are strings | Convert to tuples of numbers                                             |
| `Rating` is included in the `Company Name`           | Parse with string operators                                              |
| `Competitors` are stored as a string                 | Split and replace with an array                                          |


In [36]:
df.to_csv("./2.2.csv", index=False)