Data sourced from https://www.kaggle.com/datasets/rashikrahmanpritom/data-science-job-posting-on-glassdoor/

### How can we detect problems with the data?

-   Sample the data using `.head()` and look for problems
-   Use `.dtype` to verify datatypes
-   Use `.value_counts()` to find any non-NaN default values or common preceeding whitespace in values
-   Use `.duplicated()` to check that there are no duplicate values


In [2]:
import pandas

df = pandas.read_csv("./2.0.csv")
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


In [3]:
df.dtypes

index                  int64
Job Title             object
Salary Estimate       object
Job Description       object
Rating               float64
Company Name          object
Location              object
Headquarters          object
Size                  object
Founded                int64
Type of ownership     object
Industry              object
Sector                object
Revenue               object
Competitors           object
dtype: object

In [4]:
df.size

10080

In [20]:
for column in df.columns:
    print(f"---- {column} ----")
    print(df[column].value_counts().head(5))
    if df[column].dtype == object:
        print("-- First characters --")
        print(df[column].str[0].value_counts().head(5))

---- index ----
0      1
451    1
443    1
444    1
445    1
Name: index, dtype: int64
---- Job Title ----
Data Scientist               337
Data Engineer                 26
Senior Data Scientist         19
Machine Learning Engineer     16
Data Analyst                  12
Name: Job Title, dtype: int64
-- First characters --
D    457
S     88
M     25
A     18
P     16
Name: Job Title, dtype: int64
---- Salary Estimate ----
$99K-$132K (Glassdoor est.)     32
$75K-$131K (Glassdoor est.)     32
$79K-$131K (Glassdoor est.)     32
$90K-$109K (Glassdoor est.)     30
$137K-$171K (Glassdoor est.)    30
Name: Salary Estimate, dtype: int64
-- First characters --
$    672
Name: Salary Estimate, dtype: int64
---- Job Description ----
Job Overview: The Data Scientist is a key member of our cross-functional Product team responsible for discovering new and innovative solutions to the challenges within the built environment. Now, more than ever, building owners and operators rely on Hatch Data to get a

In [21]:
df[df.duplicated()]

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors


| What problems should we worry about?                 | What can we do about these problems?                                     |
| ---------------------------------------------------- | ------------------------------------------------------------------------ |
| Extra index column in the source dataset             | Drop it                                                                  |
| `Salary Estimate`, `Size`, and `Revenue` are strings | Convert to tuples of numbers                                             |
| `-1` as a default value nearly everywhere            | Replace all with `NaN`                                                   |
| `Rating` is included in the `Company Name`           | Parse with string operators                                              |
| `Competitors` are stored as a string                 | Split and replace with an array                                          |
| All field values are in _Title Case_                 | Rename all to use _snake_case_ so they're easier to use programmatically |
