# Data Cleaning Process

## Categorical vs. Numerical

| **Categorical**                                   | **Numerical**                                  |
|---------------------------------------------------|------------------------------------------------|
| Finite number of groups or categories. Also known as **Qualitative** data | Expressed using numerical values. Also known as **Quantitative** data. Represents measurements (e.g., height)               |

## Categorical Variables: Ordinal vs. Nominal

| **Ordinal**                                   | **Nominal**                                   |
|---------------------------------------------------|-------------------------------------------|
| Categorical variables that have a **natural order**. <br> E.g.: Strongly Disagree ... Strongly Agree | Categorical variables **without order**. <br> E.g.: Blue, Red, ... |

## Our dataset

| Column                   | Description                                                                      |
|------------------------- |--------------------------------------------------------------------------------- |
| `student_id`             | A unique ID for each student.                                                    |
| `city`                   | A code for the city the student lives in.                                        |
| `city_development_index` | A scaled development index for the city.                                         |
| `gender`                 | The student's gender.                                                            |
| `relevant_experience`    | An indicator of the student's work relevant experience.                          |
| `enrolled_university`    | The type of university course enrolled in (if any).                              |
| `education_level`        | The student's education level.                                                   |
| `major_discipline`       | The educational discipline of the student.                                       |
| `experience`             | The student's total work experience (in years).                                  |
| `company_size`           | The number of employees at the student's current employer.                       |
| `company_type`           | The type of company employing the student.                                       |
| `last_new_job`           | The number of years between the student's current and previous jobs.             |
| `training_hours`         | The number of hours of training completed.                                       |
| `job_change`             | An indicator of whether the student is looking for a new job (`1`) or not (`0`). |

## Importing Libraries and Loading Data

In [1]:
# Import necessary libraries
import pandas as pd

# Load the dataset
ds_jobs = pd.read_csv("C:/Users/caiov/OneDrive - UCLA IT Services/Documentos/DataScience/Repositories/cleaning-categorical-data-best-practices/data/customer_not_efficient.csv")

# View the dataset
ds_jobs.head()

Unnamed: 0,student_id,city,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,job_change
0,8949,city_103,0.92,Male,Has relevant experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevant experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevant experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevant experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevant experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [2]:
# Create a copy of ds_jobs for transforming
ds_jobs_transformed = ds_jobs.copy()

## Cleaning Procedures

### Exploring Data Types

In [3]:
ds_jobs_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   student_id              19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevant_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  job_change              19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

### Numeric Columns

In [4]:
# List of Numeric Columns
# Note: job_change should be a categorical column and not a numeric column

numeric_columns = ['student_id', 'city_development_index', 'training_hours']

### Converting Procedures

| **Integer Columns**               | **Float Columns**             |
|-----------------------------------|-------------------------------|
| Store as 32-bit integers (`int32`) | Store as 16-bit floats (`float16`) |


In [5]:
# Convert the numeric columns according to the table above

for col in ds_jobs_transformed[numeric_columns]:
    if pd.api.types.is_integer_dtype(ds_jobs_transformed[col]):
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('int32')
    elif pd.api.types.is_float_dtype(ds_jobs_transformed[col]):
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('float16')

print(ds_jobs_transformed[numeric_columns].dtypes)

student_id                  int32
city_development_index    float16
training_hours              int32
dtype: object


In [6]:
# List of Categorical Columns
categorical_columns = list(ds_jobs_transformed.select_dtypes(include=['object', 'category']).columns)

# Including `job_change` in the list of categorical columns
categorical_columns = categorical_columns + ['job_change']

print(categorical_columns)

['city', 'gender', 'relevant_experience', 'enrolled_university', 'education_level', 'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job', 'job_change']


In [7]:
ds_jobs_transformed[categorical_columns].nunique()

city                   123
gender                   3
relevant_experience      2
enrolled_university      3
education_level          5
major_discipline         6
experience              22
company_size             8
company_type             6
last_new_job             6
job_change               2
dtype: int64

In [8]:
# Separating Categorical Columns by its nature
ls_categorical_bool = ['relevant_experience', 'job_change']
ls_categorical_with_order = ['enrolled_university', 'education_level', 'experience', 'company_size', 'last_new_job']
ls_categorical_no_order = ['city', 'gender', 'major_discipline', 'company_type']

### Converting Procedures

| **Converting Categorical Data**                                                                                  |
|------------------------------------------------------------------------------------------------------------------|
| (Two-factor categories) Data w/ **2 categories**: yes/no → Convert to `bool`                                     |
| (Ordinal Data) Data w/ **> 2 categories** and **natural ordering** → Convert to `ordered category`               |
| (Nominal data) Data w/ **few unique values** and **no natural ordering** → Convert to `category`                 |

In [9]:
# Two-factor Categories (mapping to boolean)
print("relevant_experience: ", ds_jobs_transformed['relevant_experience'].unique())
print("job_change: ", ds_jobs_transformed['job_change'].unique())

ds_jobs_transformed['relevant_experience'] = ds_jobs_transformed['relevant_experience'].map({'Has relevant experience': True, 'No relevant experience': False})
ds_jobs_transformed['job_change'] = ds_jobs_transformed['job_change'].map({1: True, 0: False})

relevant_experience:  ['Has relevant experience' 'No relevant experience']
job_change:  [1. 0.]


In [10]:
print("relevant_experience: ", ds_jobs_transformed['relevant_experience'].unique())
print("job_change: ", ds_jobs_transformed['job_change'].unique())

relevant_experience:  [ True False]
job_change:  [ True False]


In [11]:
# Ordinal Data (converting to "ordered category")

# enrolled_university
print("enrolled_university: ", ds_jobs_transformed['enrolled_university'].unique())

ls_enrolled_university_order = ['no_enrollment', 'Part time course', 'Full time course']

ds_jobs_transformed['enrolled_university'] = pd.Categorical(ds_jobs_transformed['enrolled_university'], categories=ls_enrolled_university_order, ordered=True)

print("enrolled_university: ", ds_jobs_transformed['enrolled_university'])

enrolled_university:  ['no_enrollment' 'Full time course' nan 'Part time course']
enrolled_university:  0           no_enrollment
1           no_enrollment
2        Full time course
3                     NaN
4           no_enrollment
               ...       
19153       no_enrollment
19154       no_enrollment
19155       no_enrollment
19156       no_enrollment
19157       no_enrollment
Name: enrolled_university, Length: 19158, dtype: category
Categories (3, object): ['no_enrollment' < 'Part time course' < 'Full time course']


In [12]:
# education_level

print("education_level: ", ds_jobs_transformed['education_level'].unique())

ls_education_level_order = ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd']

ds_jobs_transformed['education_level'] = pd.Categorical(ds_jobs_transformed['education_level'], categories=ls_education_level_order, ordered=True)

print("education_level: ", ds_jobs_transformed['education_level'])

education_level:  ['Graduate' 'Masters' 'High School' nan 'Phd' 'Primary School']
education_level:  0              Graduate
1              Graduate
2              Graduate
3              Graduate
4               Masters
              ...      
19153          Graduate
19154          Graduate
19155          Graduate
19156       High School
19157    Primary School
Name: education_level, Length: 19158, dtype: category
Categories (5, object): ['Primary School' < 'High School' < 'Graduate' < 'Masters' < 'Phd']


In [13]:
# experience
print("experience: ", ds_jobs_transformed['experience'].unique())

ls_experience_order = ['<1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '>20']

ds_jobs_transformed['experience'] = pd.Categorical(ds_jobs_transformed['experience'], categories=ls_experience_order, ordered=True)

print("experience: ", ds_jobs_transformed['experience'])

experience:  ['>20' '15' '5' '<1' '11' '13' '7' '17' '2' '16' '1' '4' '10' '14' '18'
 '19' '12' '3' '6' '9' '8' '20' nan]
experience:  0        >20
1         15
2          5
3         <1
4        >20
        ... 
19153     14
19154     14
19155    >20
19156     <1
19157      2
Name: experience, Length: 19158, dtype: category
Categories (22, object): ['<1' < '1' < '2' < '3' ... '18' < '19' < '20' < '>20']


In [14]:
# company_size
print("company_size: ", ds_jobs_transformed['company_size'].unique())

ls_company_size_order = ['<10', '10/49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']

ds_jobs_transformed['company_size'] = pd.Categorical(ds_jobs_transformed['company_size'], categories=ls_company_size_order, ordered=True)

print("company_size: ", ds_jobs_transformed['company_size'])

company_size:  [nan '50-99' '<10' '10000+' '5000-9999' '1000-4999' '10-49' '100-499'
 '500-999']
company_size:  0            NaN
1          50-99
2            NaN
3            NaN
4          50-99
          ...   
19153        NaN
19154        NaN
19155      50-99
19156    500-999
19157        NaN
Name: company_size, Length: 19158, dtype: category
Categories (8, object): ['<10' < '10/49' < '50-99' < '100-500' < '500-999' < '1000-4999' < '5000-9999' < '10000+']


In [15]:
# last_new_job
print("last_new_job: ", ds_jobs_transformed['last_new_job'].unique())

ls_last_new_job_order = ['never', '1', '2', '3', '4', '>4']

ds_jobs_transformed['last_new_job'] = pd.Categorical(ds_jobs_transformed['last_new_job'], categories=ls_last_new_job_order, ordered=True)

print("last_new_job: ", ds_jobs_transformed['last_new_job'])

last_new_job:  ['1' '>4' 'never' '4' '3' '2' nan]
last_new_job:  0            1
1           >4
2        never
3        never
4            4
         ...  
19153        1
19154        4
19155        4
19156        2
19157        1
Name: last_new_job, Length: 19158, dtype: category
Categories (6, object): ['never' < '1' < '2' < '3' < '4' < '>4']


In [16]:
# Nominal Data (converting to "category")

ls_categorical_no_order = ['city', 'gender', 'major_discipline', 'company_type']

for col in ls_categorical_no_order:
    ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('category')

# Check the data types of the transformed dataset
ds_jobs_transformed[ls_categorical_no_order].info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   city              19158 non-null  category
 1   gender            14650 non-null  category
 2   major_discipline  16345 non-null  category
 3   company_type      13018 non-null  category
dtypes: category(4)
memory usage: 80.6 KB


### Business Goal

This recruitment company wants to focus on:
* more experienced professionals
* enterprise companies

Therefore, the DataFrame should be filtered to only contain:
* 'experience' >= 10 year
* 'company_size' >= 1000 employees 

In [17]:
# Filtering dataset for business goals
ds_jobs_transformed = ds_jobs_transformed[
    (ds_jobs_transformed['experience'] >= '10') & 
    (ds_jobs_transformed['company_size'] >= '1000-4999')
]


### Checking Efficiency: Memory Usage (Old Dataframe vs. New Dataframe)

In [18]:
print(f"Original DataFrame memory usage: {ds_jobs.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB")
print(f"Transformed DataFrame memory usage: {ds_jobs_transformed.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB")

Original DataFrame memory usage: 10.51 MB
Transformed DataFrame memory usage: 0.08 MB
