# **Clean and transformation CSV**

In [None]:
import pandas as pd
import pycountry

file_path = "C:/Users/gabri/OneDrive/Documentos/Universidad/ETL/workshop/csv/candidates.csv"
df = pd.read_csv(file_path, sep=";")

def get_country_alpha3(country_name): # Define a function to abbreviate country names
    try:
        return pycountry.countries.lookup(country_name).alpha_3
    except LookupError:
        return None

df['Application Date'] = pd.to_datetime(df['Application Date'], errors='coerce') #Convert Application Date to string type
df['Score'] = (df['Code Challenge Score'] + df['Technical Interview Score']) / 2 # Create Score with math operation between both Scores
df['Hired'] = df['Score'] >= 7 #Create Hired Column in function to the score
df['Country'] = df['Country'].apply(get_country_alpha3) # function  get_country_alpha3 apply

cleaned_file_path = "C:/Users/gabri/OneDrive/Documentos/Universidad/ETL/workshop/csv/candidates_cleaned.csv"
df.to_csv(cleaned_file_path, index=False)
print(f"Archivo limpio guardado en: {cleaned_file_path}")


Archivo limpio guardado en: C:/Users/gabri/OneDrive/Documentos/Universidad/ETL/workshop/csv/candidates_cleaned.csv


## As in the notebook readData-001.ipynb:

+ There are no null values.  
+ There are no duplicate values.  
+ There are no duplicate rows.  
+ We know that Application Date is of type `object`, so we need to fix that and handle it as a date.  
+ We do not have a column that identifies whether a candidate meets the conditions to be hired.  
+ We do not have a column to indicate whether a candidate is hired or not.  

### **Therefore,**

## In the notebook cleanData-002.ipynb, we do the following:  

+ Application Date is of type `object`, so in this cleaning process, we convert it to `date`.  
+ We create a **Score** column to determine whether a candidate truly meets the requirements to be hired based on their grades.  
+ We create the **Hired** column to identify which candidates meet the recommended requirements.  
+ We abbreviate country names to **3-letter codes** using the **pycountry** library.
+ Save a new file with name: **candidates_cleaned.csv**, this file have the clean data.
+ We create a file with the cleaned data, which we will later migrate to the database in **dataMigration-003.ipynb**.  


# **Dataframe clean and transformed**

In [10]:
df.head()

Unnamed: 0,First Name,Last Name,Email,Application Date,Country,YOE,Seniority,Technology,Code Challenge Score,Technical Interview Score,Score,Hired
0,Bernadette,Langworth,leonard91@yahoo.com,2021-02-26,NOR,2,Intern,Data Engineer,3,3,3.0,False
1,Camryn,Reynolds,zelda56@hotmail.com,2021-09-09,PAN,10,Intern,Data Engineer,2,10,6.0,False
2,Larue,Spinka,okey_schultz41@gmail.com,2020-04-14,BLR,4,Mid-Level,Client Success,10,9,9.5,True
3,Arch,Spinka,elvera_kulas@yahoo.com,2020-10-01,ERI,25,Trainee,QA Manual,7,1,4.0,False
4,Larue,Altenwerth,minnie.gislason@gmail.com,2020-05-20,MMR,13,Mid-Level,Social Media Community Management,9,7,8.0,True


In [9]:
print(df.dtypes)


First Name                           object
Last Name                            object
Email                                object
Application Date             datetime64[ns]
Country                              object
YOE                                   int64
Seniority                            object
Technology                           object
Code Challenge Score                  int64
Technical Interview Score             int64
Score                               float64
Hired                                  bool
dtype: object


This results in 12 columns, including the 2 new ones we have added:

**Score and Hired**

+ **Score** is a `float` column created from the data of both **Score** (`Code Challenge Score` and `Technical Interview Score`).  
+ **Hired** is a `boolean` column that determines whether a candidate is **not hired (0)** or **hired (1)**.  


## We will continue with the migration in the third notebook:  

**dataMigration-003**
