# **Pandas Exercise**

1. Connect to the database using SQLite3 Module.
2. Import all the data from SQLite Database into a DataFrame

In [52]:
import sqlite3
import pandas as pd

conn = sqlite3.connect('pandas_exercise.db')
pandas_exercise_df = pd.read_sql_query ("SELECT * FROM employee", conn)
pandas_exercise_df


Unnamed: 0,employee_id,first_name,last_name,department,salary
0,1,John,Doe,Sales,50000
1,2,Jane,,Marketing,60000
2,3,Michael,Johnson,Operations,55000
3,4,,Brown,Human Resources,58000
4,5,William,Jones,Sales,52000
...,...,...,...,...,...
104,105,Carson,Flores,Sales,75000
105,106,Naomi,Reed,Marketing,85000
106,107,Jaxon,Gonzalez,Operations,80000
107,108,Paisley,Hernandez,Human Resources,82000


### **Data Cleaning**

3. Count how many cells are empty.

In [53]:
pandas_exercise_df.isnull().sum()

employee_id    0
first_name     0
last_name      0
department     0
salary         0
dtype: int64

4. Count how many cells contains empty values represented with ''.

In [54]:
(pandas_exercise_df == "").sum()

employee_id    0
first_name     3
last_name      5
department     4
salary         5
dtype: int64

5. Replace the empty values that is represented with '' with the proper represetation of null which is NA.

In [90]:
pandas_exercise_df.replace("", pd.NA, inplace=True)

6. Confirm if the empty cells are now replaced with the correct representation of null.

In [56]:
pandas_exercise_df.isnull().sum()

employee_id    0
first_name     3
last_name      5
department     4
salary         5
dtype: int64

7. Delete the rows/records that contain null values.

In [91]:
pandas_exercise_df.dropna(inplace=True)

8. Confirm if the rows/records that contain null values have been deleted.

In [58]:
pandas_exercise_df.isnull().sum()

employee_id    0
first_name     0
last_name      0
department     0
salary         0
dtype: int64

9. Count how many rows/records are duplicated.

In [59]:
pandas_exercise_df.duplicated().sum()

np.int64(0)

Hint: Keep in mind that we have a `employee_id` column, which contains unique values for each row/record. 

This could affect the `df.duplicated().sum()` approach.

10. Delete the column that contains unique values that affects the `df.duplicated().sum()` approach.

In [60]:
pandas_exercise_df = pandas_exercise_df.drop("employee_id", axis = 1)
pandas_exercise_df.columns

Index(['first_name', 'last_name', 'department', 'salary'], dtype='object')

11. Count how many rows/records are duplicated.

In [61]:
pandas_exercise_df.duplicated().sum()

np.int64(7)

12. Remove the duplicated rows/records.

In [92]:
pandas_exercise_df.drop_duplicates(inplace=True)

13. Check if the duplicated rows/records have been deleted.

In [63]:
pandas_exercise_df.duplicated().sum()

np.int64(0)

14. Check the data type of each column.

In [64]:
pandas_exercise_df.dtypes

first_name    object
last_name     object
department    object
salary        object
dtype: object

15. Convert it to the right data type.

In [65]:
pandas_exercise_df["department"] = pandas_exercise_df["department"].astype("category")
pandas_exercise_df["salary"] = pandas_exercise_df["salary"].astype("int64")

16. Check if the data type is correct.

In [66]:
pandas_exercise_df.dtypes

first_name      object
last_name       object
department    category
salary           int64
dtype: object

### `Task 1:` Filter all the employee from the Operations department that have the salary greater than 70,000

In [69]:
pandas_exercise_df[(pandas_exercise_df["department"]=="Operations") & (pandas_exercise_df["salary"]>70000)]

Unnamed: 0,first_name,last_name,department,salary
70,Wyatt,Allen,Operations,71000
74,Henry,King,Operations,72000
82,Tyler,Adams,Operations,74000
86,Eli,Martinez,Operations,75000
90,Luke,Johnson,Operations,76000
94,Eliana,Gomez,Operations,77000
98,Xavier,Taylor,Operations,78000
102,Leonardo,Hill,Operations,79000
106,Jaxon,Gonzalez,Operations,80000


### `Task 2:` Filter all the employee from the Human Resources department that have the salary less than 65,000

In [70]:
pandas_exercise_df[(pandas_exercise_df["department"]=='Human Resources') & (pandas_exercise_df["salary"]<65000)]

Unnamed: 0,first_name,last_name,department,salary
7,Emma,Anderson,Human Resources,59000
12,Ava,Lopez,Human Resources,60000
21,Evelyn,Nguyen,Human Resources,62000
25,Liam,Gomez,Human Resources,63000
29,Lucas,Adams,Human Resources,64000


### `Task 3:` Calculate the total employee for each department

In [75]:
total_employee_per_department_df = pd.DataFrame()
total_employee_per_department_df = pandas_exercise_df.groupby("department").size()
total_employee_per_department_df

  total_employee_per_department_df = pandas_exercise_df.groupby("department").size()


department
Human Resources    22
Marketing          23
Operations         23
Sales              22
dtype: int64

### `Task 4:` Calculate average salary per department

In [87]:
avg_salary_df = pd.DataFrame()
avg_salary_df ["salary"] = pandas_exercise_df.groupby("department")["salary"].mean().round(2).sort_values()
avg_salary_df

  avg_salary_df ["salary"] = pandas_exercise_df.groupby("department")["salary"].mean().round(2).sort_values()


Unnamed: 0_level_0,salary
department,Unnamed: 1_level_1
Sales,63454.55
Operations,67869.57
Human Resources,70636.36
Marketing,72869.57


### `Task 5:` Calculate maximum salary per department

In [88]:
max_salary_df = pd.DataFrame()
max_salary_df ["salary"] = pandas_exercise_df.groupby("department")["salary"].max().sort_values()
max_salary_df

  max_salary_df ["salary"] = pandas_exercise_df.groupby("department")["salary"].max().sort_values()


Unnamed: 0_level_0,salary
department,Unnamed: 1_level_1
Sales,75000
Operations,80000
Human Resources,82000
Marketing,85000


### `Task 6:` Calculate manimum salary per department

In [89]:
min_salary_df = pd.DataFrame()
min_salary_df ["salary"] = pandas_exercise_df.groupby("department")["salary"].min().sort_values()
min_salary_df

  min_salary_df ["salary"] = pandas_exercise_df.groupby("department")["salary"].min().sort_values()


Unnamed: 0_level_0,salary
department,Unnamed: 1_level_1
Sales,50000
Operations,55000
Human Resources,59000
Marketing,60000


### `Task 7:` Create a report that list all of the information that you gathered from the data provided.

9 Employees from Operations have a salary higher than 70.000.
5 employees from HR have a salary lower than 65.000.
Number of employees from each department are distributed evenly, between 22 and 23.
Marketing have the highest average salary and sales have the lowest.
The maximum salary per department ranges from 75.000 to 85.000 while the minimum ranges from 50.000 to 60.000