<div class="alert alert-block alert-info">
Singapore Management University<br>
CS105 Statistical Thinking for Data Science, 2024/25 Term 2
</div>

# CS105 Group Project Submission (Part I)

-----
Provide your team details, including section, team number, team members, and the name of the dataset. 
Complete all of the following sections. For any part requiring code to derive your answers, please create a code cell immediately below your response and run the code.
To edit any markdown cell, double click the cell; after editing, execute the markdown cell to collapse it.
<br>
-----

## Declaration

<span style="color:red">By submitting this notebook, we declare that **no part of this submission is generated by any AI tool**. We understand that AI-generated submissions will be considered as plagiarism, and just like other plagirisum cases, disciplinary actions will be imposed.</span>

#### Section:   G5
#### Team:      T1
#### Members:
1. Zachary Tay
2. Bryan Lee
3. Ang Qi Long
4. Jonathan Wong
5. Swayam Jain

#### Dataset: Employee

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

df = pd.read_csv('employee.csv')

## Part I: Exploratory Data Analysis (EDA) [8% of final grade]

### 1. Overview of dataset [15% of Part I]

**a.** Summarise the background of the dataset [limited to 50 words]

<div style="text-align: justify;">
This dataset contains <b>HR data of all employees under a sales team</b>. The data includes <b>personal and employment details</b>, <b>total career sales acquired</b> and <b>latest quarterly rating</b>. An employee’s data is <b>captured at the beginning of each month</b>, either <b>up to the latest month</b> (Dec 2017) or <b>when they quit</b>.
</div>


**b.** State the size of the dataset

**Size**
- **Rows**: 2381
- **Columns**: 13


In [2]:
n_rows, n_cols = df.shape
print(f"{n_rows} Rows")
print(f"{n_cols} Columns")

2381 Rows
13 Columns


**c.** For each variable, describe what it represents and its data type (numerical or categorical)

**Date**
- **Type**: Categorical (?Nominal?)<br>
- **Info**: The date when the specific row’s data is recorded 

**Emp_ID**
- **Type**: Categorical (Nominal)<br>
- **Info**: The unique ID of the employee

**Age**
- **Type**: Numerical (Discrete)<br>
- **Info**: The age of the employee
  
**Gender**
- **Type**: Categorical (Nominal)<br>
- **Info**: The employee’s gender (Male or Female)

**Education**
- **Type**: Categorical (?Ordinal?)<br>
- **Info**: Highest education of the employee (College, Bachelor, Master)

**Salary**
- **Type**: Numerical (Discrete)<br>
- **Info**: Current salary of the employee excluding bonus 

**Join_Date**
- **Type**: Categorical (?Nominal?)<br>
- **Info**: The date when the employee joins the company

**Last_Work_Date**
- **Type**: Categorical (?Nominal?)<br>
- **Info**: The data when the employee leaves the company, otherwise empty if employee has not quit

**Join_Designation**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Designation level when the employee joined the company (1, 2, 3, 4, 5)

**Designation**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Current designation level of the employee (1, 2, 3, 4, 5)

**Total_Sales**
- **Type**: Numerical (Discrete)<br>
- **Info**: Total sales generated by the employee since joining the team

**Quarterly_Rating**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Last quarterly performance rating (1, 2, 3, 4)


****

### 2. Data pre-processing [35% of Part I]

**a.** For each variable, determine the percentage of missing data. For any column with missing data, describe how you resolve the issue. Clearly state any assumption you made.

| Variable w/ Missing Data | Count | Percentage |
| :---------------- | :------: | ----: |
| Join_Date | 118 | 4.96% |
| Last_Work_Date | 765 | 32.13% |
| Join_Designation | 105 | 4.41% |  



In [4]:
missing_count = n_rows - df.count()              # total rows - rows with non-null values
missing_percent = (missing_count / n_rows * 100) # missing rows / total rows

missing_data = pd.DataFrame({'Count': missing_count, 'Percentage': round(missing_percent,2)})
missing_data = missing_data[missing_data['Count'] > 0]  # filter out variable w/o missing data

missing_data

Unnamed: 0,Count,Percentage
Join_Date,118,4.96
Last_Work_Date,765,32.13
Join_Designation,105,4.41


#### Join Date
- **Resolution**: Drop all rows with missing `Join_Date`
- **Reason**: As data of an employee is updated every month, there is no past record to check for their join date. We therefore cannot reasonably accertain when they joined the sales team. Additionally, as the duration of employement will impact other variables and the percentage of missing data is not too high (4.96%), we opted to drop these rows with missing `Join_Date`
- **Assumption(s)**:
    - Each employee will only have one Emp_ID unique to them
    - An employee who had quit will not join the sales team again nor gain a new Emp_ID

In [11]:
df.dropna(subset=['Join_Date'], inplace = True)         # drop all rows with null values under Join_Date
df.count()                                              # count should be 2263 (2381-118)   *if code is ran individually

Date                    2263
Emp_ID                  2263
Age                     2263
Gender                  2263
City                    2263
Education               2263
Salary                  2263
Join_Date               2263
Last_Work_Date          1535
Join_Designation        2161
Designation             2263
Total_Sales_Acquired    2263
Quarterly_Rating        2263
dtype: int64

#### Last Work Date
- **Resolution**: Drop rows with missing `Last_Work_Date` **and** `Date` that is earlier than 1/12/2017
- **Reason**: Only the rows before the latest month (Dec 2017) are affected as an older `Date` suggests that an employee is no longer with the sales team. The rows with missing data captured in the latest month informs us that the employee is still working for the sales team. We again cannot reasonably accertain when the employee quit as an employee can quit at anytime in a given month and an employee can make any number of sales in that time period, thus impacting other variables. Therefore, as the number and percentage of affected rows is not too high (24, 1.01%), we opted to drop these rows with missing `Last_Work_Date`. 
- **Assumption(s)**:
    - There is no delay in capturing when an employee's last date of work was

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
n_rows = df.shape[0]

unknownLastDate = df[(df['Last_Work_Date'].isnull()) & (df['Date'] < '2017-01-12')].shape[0]
print("Unknown Last Date Worked:", unknownLastDate, round(unknownLastDate / n_rows * 100, 2), "%")

stillWorking = df[(df['Last_Work_Date'].isnull()) & (df['Date'] == '2017-01-12')].shape[0]
print("Still Working:", stillWorking, round(stillWorking / n_rows * 100, 1), "%")

Unknown Last Date Worked: 24 1.01 %
Still Working: 741 31.1 %


In [None]:
unknownLastDate = df[(df['Last_Work_Date'].isnull()) & (df['Date'] < '2017-01-12')]     # splice out rows where Date is before Dec 2017
df.drop(unknownLastDate.index, inplace=True)                                            # use index of unknownLastDate and drop row
df.count()                                                                              # count should be 2357 (2381-24)   *if code is ran individually

#### Join_Designation
- **Resolution**: For rows with missing `Join_Designation`, impute those where `Designation == 1` with 1 and drop rows where `Designation > 1`
- **Reason**: As `Designation` captures the current designation level of an employee when their data was recorded, if current designation level is 1, then we can definitvely deduce that the `Join_Designation` is 1 too. For any higher current designation level than 1, we again cannot reasonably accertain their initial designation level as it likely varies with other variables. As the number and percentage of rows missing data where `Designation > 1`  is not too high (83, 3.49%), we opted to drop these rows and impute those where where `Designation == 1` is 1 (22, 0.924%) with 1  
- **Assumption(s)**: -  

In [10]:
cdIs1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] == 1)].shape[0]
cdNot1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] != 1)].shape[0]
n_rows = df.shape[0]

print("(Current) Designation = 1:", cdIs1, round(cdIs1 / n_rows * 100, 3), "%")
print("(Current) Designation > 1:", cdNot1, round(cdNot1 / n_rows * 100, 2), "%")

(Current) Designation = 1: 22 0.924 %
(Current) Designation > 1: 83 3.49 %


In [None]:
cdNot1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] != 1)]   # splice out rows where designation > 1
df.drop(cdNot1.index, inplace=True)                                         # use index of cdNot1 and drop row
# print(cdNot1.shape[0])

cdIs1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] == 1)]    # splice out rows where designation is 1
# print(cdIs1.iloc[cdIs1[cdIs1["Emp_ID"]==21].index])
df.loc[cdIs1.index, "Join_Designation"] = 1                                 # use index of cdIs1 and impute row with 1
# print(cdIs1.shape[0])
# print(df.iloc[df[df["Emp_ID"]==21].index])

df['Join_Designation'] = df['Join_Designation'].astype(int)                 # convert imputed float (1.0) to int (1)
# print(df.iloc[df[df["Emp_ID"]==21].index])
df.count()                                                                  # count should be 2298 (2381-83)   *if code is ran individually

**b.** For each variable, identify outliers (if any) and describe how you resolve the issue. Clearly state any assumption you made.

**Response.** 

**c.** For categorical variables, perform the necessary encoding.

**Response.** 

### 3.	Exploratory analysis and visualization [50% of Part I]

**a.** For each variable, provide relevant summary statistics

**Response.** 

**b.** For each variable, provide an appropriate visualisation depicting the distribution of its values, and summarize any key observation(s) you made.

**Response.** 

**c.** Perform bi-variate analysis on the variables. You do not need to present the analysis of every pair of variables; only focus on the pairs you believe are worth investigating and explain. For each pair, describe the relationship between the two variables. Use appropriate statistical methods and/or visualizations.

**Response.** 