# EDA Report

In this file, I make a EDA Report Jupyter Notebook using Python Kernel in a venv what get data of PosgreSQL table previously loaded called **raw_applicant** of **etl_workshop_first** database to clean and transform it in data that can bring us value to load again a PosgreSQL table called **applicant**.

Later, I will connect Power BI with **applicant**'s PostgreSQL table to communicate the objetives specified:

- Hires by technology.

- Hires by year.

- Hires by seniority.

- Hires by country over years (USA, Brazil, Colombia, and Ecuador only).

In this file I create:

1. The connection between PostgreSQL and Python

2. The database called **etl_workshop_first**.

3. Two tables with its properties: **raw_applicant** and **applicant**.

Import libraries:

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

from connect_database import Connection_Postgres

: 

## Let's know the table
This table has 50.000 rows and 10 columns.

In [None]:
# Create connection with PostgreSQL
connection = Connection_Postgres()
cursor = connection.connection.cursor()
# Consult data
query_to_do = "SELECT first_name, last_name, email, applicant_date, country, experience_year, seniority, technology, code_challenge_score, technical_interview_score FROM raw_applicant"
cursor.execute(query_to_do)
record_table = cursor.fetchall()
# Get column names
column_names = [desc[0] for desc in cursor.description]
# Create dataframe
dataframe = pd.DataFrame(record_table, columns=column_names)
# Create connection with engine
connection_string = f"postgresql://{connection.connection_config['user']}:{connection.connection_config['password']}@{connection.connection_config['host']}:{connection.connection_config['port']}/{connection.connection_config['database']}"
postgres_engine = create_engine(connection_string)
# Close connection
connection.close_connection_database()

In [None]:
# Check info
dataframe.info()

Describing the table:

In [None]:
# Describe data
dataframe.describe()

Checking for NaN values that I couldn't see

In [None]:
dataframe.isna().sum()

Checking values for each column of table

In [None]:
for column in dataframe.columns:
    print(dataframe[column].value_counts())
    print("-"*10)

I see that in two score columns (code_challenge_score and technical_interview_score) have values among 0 and 10, where 10 is the maximum qualitification.

#### Time analysis
Let's go deeper on applicant_date column.

In [None]:
dataframe['applicant_date'] = pd.to_datetime(dataframe['applicant_date'], format='mixed')
dataframe['applicant_year'] = dataframe['applicant_date'].dt.year
dataframe['applicant_month_name'] = dataframe['applicant_date'].dt.month_name()
dataframe['applicant_month'] = dataframe['applicant_date'].dt.month

In [None]:
dataframe['applicant_year'].value_counts()

In [None]:
dataframe['applicant_month_name'].value_counts()

### Let's create the new column of is hiring or not
I take code_challenge_score and technical_interview_score for this.

In [None]:
dataframe['is_hire'] = np.where((dataframe['code_challenge_score'] >= 7) & (dataframe['technical_interview_score'] >= 7), 1, 0)

### Let's load this dataframe to applicant table in PostgreSQL

In [None]:
# Load data obtained to PostgreSQL
dataframe.to_sql('applicant', postgres_engine, if_exists='replace', index=False)
connection.log('Data loaded to {}: {} rows - {} columns.' .format('applicant', dataframe.shape[0], dataframe.shape[1]))

Data loaded:

![Data load to applicant][def]

[def]: figures/applicant_data.png