<div class="alert alert-block alert-info">
Singapore Management University<br>
CS105 Statistical Thinking for Data Science, 2024/25 Term 2
</div>

# CS105 Group Project Submission (Part I)

-----
Provide your team details, including section, team number, team members, and the name of the dataset. 
Complete all of the following sections. For any part requiring code to derive your answers, please create a code cell immediately below your response and run the code.
To edit any markdown cell, double click the cell; after editing, execute the markdown cell to collapse it.
<br>
-----

## Declaration

<span style="color:red">By submitting this notebook, we declare that **no part of this submission is generated by any AI tool**. We understand that AI-generated submissions will be considered as plagiarism, and just like other plagirisum cases, disciplinary actions will be imposed.</span>

#### Section:   G5
#### Team:      T1
#### Members:
1. Zachary Tay
2. Bryan Lee
3. Ang Qi Long
4. Jonathan Wong
5. Swayam Jain

#### Dataset: Employee

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

df = pd.read_csv('employee.csv')

<a id="menu"></a>
#### Table of Content


1. [Overview of Dataset](#part1)
2. [Data Pre-processing](#part2)
3. [Exploratory Analysis and Visualization](#part3)

<a id="part1"></a>

## Part I: Exploratory Data Analysis (EDA) [8% of final grade]

<a id="part1"></a>
### 1. Overview of dataset [15% of Part I]
a. [Background](#part1a) <br>
b. [Size](#part1b) <br>
c. [Variables](#part1c)

_[(Back Top)](#menu)_

### **a.** Summarise the background of the dataset [limited to 50 words]

<div style="text-align: justify;">
This dataset contains <b>HR data of all employees under a sales team</b>. The data includes <b>personal and employment details</b>, <b>total career sales acquired</b> and <b>latest quarterly rating</b>. An employee’s data is <b>captured at the beginning of each month</b>, either <b>up to the latest month</b> (Dec 2017) or <b>when they quit</b>.
</div>


<a id="part1b"></a>

### **b.** State the size of the dataset 

**Size**
- **Rows**: 2381
- **Columns**: 13


In [None]:
n_rows, n_cols = df.shape
print(f"{n_rows} Rows")
print(f"{n_cols} Columns")

<a id="part1c"></a> [(Back)](#part1)

### **c.** For each variable, describe what it represents and its data type (numerical or categorical)

**Date**
- **Type**: Categorical (?Nominal?)<br>
- **Info**: The date when the specific row’s data is recorded 

**Emp_ID**
- **Type**: Categorical (Nominal)<br>
- **Info**: The unique ID of the employee

**Age**
- **Type**: Numerical (Discrete)<br>
- **Info**: The age of the employee
  
**Gender**
- **Type**: Categorical (Nominal)<br>
- **Info**: The employee’s gender (Male or Female)

**City**
- **Type**: Categorical (?)<br>
- **Info**: The city where the employees works in (C1, C2, ..., C29)

**Education**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Highest education of the employee (College, Bachelor, Master)

**Salary**
- **Type**: Numerical (Discrete)<br>
- **Info**: Current salary of the employee excluding bonus 

**Join_Date**
- **Type**: Categorical (?Nominal?)<br>
- **Info**: The date when the employee joins the company

**Last_Work_Date**
- **Type**: Categorical (?Nominal?)<br>
- **Info**: The data when the employee leaves the company, otherwise empty if employee has not quit

**Join_Designation**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Designation level when the employee joined the company (1, 2, 3, 4, 5)

**Designation**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Current designation level of the employee (1, 2, 3, 4, 5)

**Total_Sales**
- **Type**: Numerical (Discrete)<br>
- **Info**: Total sales generated by the employee since joining the team

**Quarterly_Rating**
- **Type**: Categorical (Ordinal)<br>
- **Info**: Last quarterly performance rating (1, 2, 3, 4)


[(Back)](#part1) <a id="part2"></a>

*****
## 2. Data pre-processing [35% of Part I]
a. [Missing Data](#part2a) <br>
b. [Outlier](#part2b) <br>
c. [Encoding](#part2c)                                   

_[(Back Top)](#menu)_

<a id="part2a"></a>

### **a.** For each variable, determine the percentage of missing data. For any column with missing data, describe how you resolve the issue. Clearly state any assumption you made.

| Variable w/ Missing Data | Count | Percentage |
| :---------------- | :------: | ----: |
| Join_Date | 118 | 4.96% |
| Last_Work_Date | 765 | 32.13% |
| Join_Designation | 105 | 4.41% |  



In [None]:
def displayMissing() :
    missing_count = df.shape[0] - df.count()              # total rows - rows with non-null values
    missing_percent = (missing_count / df.shape[0] * 100) # missing rows / total rows

    missing_data = pd.DataFrame({'Count': missing_count, 'Percentage': round(missing_percent,2)})
    missing_data = missing_data[missing_data['Count'] > 0]  # filter out variable w/o missing data

    return missing_data

displayMissing()

#### **Join Date**
- **Resolution**: Drop all rows with missing `Join_Date`
- **Reason**: As data of an employee is updated every month, there is no past record to check for their join date. We therefore cannot reasonably accertain when they joined the sales team. Additionally, as the duration of employement will impact other variables and the percentage of missing data is not too high (4.96%), we opted to drop these rows with missing `Join_Date`
- **Assumption(s)**:
    - Each employee will only have one Emp_ID unique to them
    - An employee who had quit will not join the sales team again nor gain a new Emp_ID

In [None]:
df.dropna(subset=['Join_Date'], inplace = True)         # drop all rows with null values under Join_Date
displayMissing()                                        # Join_Date count is 0 (LWD & JD are affected)

#### **Last Work Date**
- **Resolution**: For rows with missing `Last_Work_Date`,
    - If `Date` is before 1/12/2017, drop rows
    - If `Date` is 1/12/2017, impute rows with 31/12/2017
- **Reason**: 
    - For rows before Dec 2017, an older `Date` suggests that the employee is no longer with the sales team. The employee may quit on anyday within a given month and make any number of sales in that period too, thus affecting the other variables. As we again cannot reasonably accertain when the employee quit and number of affected is not too high (24, 1.06%), we opted to drop these rows with missing 'Last_Work_Date`
    - For rows during Dec 2027, `Last_Work_Date` being blank indicates that the employee has not quit in that given month. As such, we can state that the date they last worked (or are employed) is 31 Dec 2017 and opted to impute with this date.
- **Assumption(s)**: 
    - There are no other employees who quit within Dec 2017 beyond those given in the dataset 

In [None]:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

unknownLastDate = df[(df['Last_Work_Date'].isnull()) & (df['Date'] < '2017-12-01')].shape[0]
print("Unknown Last Date:", unknownLastDate, round(unknownLastDate / df.shape[0] * 100, 2), "%")

stillWorking = df[(df['Last_Work_Date'].isnull()) & (df['Date'] == '2017-12-01')].shape[0]
print("Still Working:", stillWorking, round(stillWorking / df.shape[0] * 100, 1), "%")

In [None]:
unknownLastDate = df[(df['Last_Work_Date'].isnull()) & (df['Date'] < '2017-12-01')]     # splice out rows where Date is before Dec 2017
df.drop(unknownLastDate.index, inplace=True)                                            # use index of unknownLastDate and drop row

df.fillna({"Last_Work_Date": "31/12/2017"}, inplace=True)                               # impute rows of employees still working with sales team
                                                                                        # with last day of month (31 Dec 2017)

displayMissing()                                                                        # Last_Work_Date is 0 (JD affected)

#### **Join Designation**
- **Resolution**: For rows with missing `Join_Designation`, 
    - If `Designation == 1`, impute rows with 1
    - If `Designation > 1`, drop these rows
- **Reason**: 
    - As `Designation` captures the current designation level of an employee when their data was recorded, if current designation level is 1, then we can definitvely deduce that the `Join_Designation` is 1 too. 
    - For any higher current designation level than 1, we again cannot reasonably accertain their initial designation level as it likely varies with other variables. As the number and percentage of rows missing data where `Designation > 1`  is not too high (78, 3.48%), we opted to drop these rows and impute those where where `Designation == 1` is 1 (22, 0.983%) with 1  
- **Assumption(s)**: -  

In [None]:
cdIs1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] == 1)].shape[0]
cdNot1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] != 1)].shape[0]

print("(Current) Designation = 1:", cdIs1, round(cdIs1 / df.shape[0] * 100, 3), "%")
print("(Current) Designation > 1:", cdNot1, round(cdNot1 / df.shape[0] * 100, 2), "%")

In [None]:
cdNot1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] != 1)]   # splice out rows where designation > 1
df.drop(cdNot1.index, inplace=True)                                         # use index of cdNot1 and drop row
# print(cdNot1.shape[0])

cdIs1 = df[(df['Join_Designation'].isnull()) & (df['Designation'] == 1)]    # splice out rows where designation is 1
# print(cdIs1.iloc[cdIs1[cdIs1["Emp_ID"]==21].index])
df.loc[cdIs1.index, "Join_Designation"] = 1                                 # use index of cdIs1 and impute row with 1
# print(cdIs1.shape[0])
# print(df.iloc[df[df["Emp_ID"]==21].index])

df['Join_Designation'] = df['Join_Designation'].astype(int)                 # convert imputed float (1.0) to int (1)
# print(df.iloc[df[df["Emp_ID"]==21].index])
displayMissing()                                                            # Join_Designation count is 0

**Size after Cleaning**
- **Rows**: 2161
- **Columns**: 13

In [None]:
n_rows, n_cols = df.shape
print(f"{n_rows} Rows")
print(f"{n_cols} Columns")

<a id="part2b"></a> [(Back)](#part2)


### **b.** For each variable, identify outliers (if any) and describe how you resolve the issue. Clearly state any assumption you made.


#### **Age**
There exists 33 outlier rows with `Age` above upper bound.
- **Resolution**: Remove the row with the outlier with the employee of the oldest age.
- **Reason**: This outlier is siginficantly further away from the rest of the cluster
- **Assumption(s)**: 
    - An employee is not forced to quit or retire once they reach a certain age

In [None]:
# Identify outliers
Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is 17 years old                
upper = Q3 + 1.5 * (Q3-Q1)                          # upper bound is 49 years old

below = df[df['Age'] <= lower].shape[0]
above = df[df['Age'] >= upper].shape[0]

print(f"Rows below lower bound ({int(lower)}): {below}")
print(f"Rows below upper bound ({int(upper)}): {above}")

plt.figure(figsize=(20,5))
df[["Age"]].boxplot()
plt.title("Age")
plt.ylabel("Years")
plt.show()

In [None]:
# drop the oldest person
max_age = df.Age.max()
age_outlier = df[df["Age"] == max_age]
age_outlier
df.drop(age_outlier.index, inplace=True)

In [None]:
Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is 17 years old                
upper = Q3 + 1.5 * (Q3-Q1)                          # upper bound is 49 years old

below = df[df['Age'] <= lower].shape[0]
above = df[df['Age'] >= upper].shape[0]

print(f"Rows below lower bound ({int(lower)}): {below}")
print(f"Rows below upper bound ({int(upper)}): {above}")

plt.figure(figsize=(20,5))
df[["Age"]].boxplot()
plt.title("Age")
plt.ylabel("Years")
plt.show()

In [None]:
n_rows, n_cols = df.shape
print(f"{n_rows} Rows")
print(f"{n_cols} Columns")

#### **Salary**
There exists 50 outlier rows with `Salary` above upper bound.
- **Resolution**:
    - Drop the 3 outliers separated from the cluster
    - Keep the outliers within the cluster
- **Reason**:
    - As there is only 3 such outliers that are much further away from the rest of the points, we opted to drop them.
    - Likely corresponds to employee with higher designation level. As such, we should keep these outliers for our data analysis
- **Assumption(s)**: -

In [None]:
# Identify outliers for salary

Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is $-16409.25                
upper = Q3 + 1.5 * (Q3-Q1)                          # upper bound is $129844.75

below = df[df['Salary'] <= lower].shape[0]
above = df[df['Salary'] >= upper].shape[0]

print(f"Rows below lower bound (${lower}): {below}")
print(f"Rows below upper bound (${upper}): {above}")

plt.figure(figsize=(20,5))
df[["Salary"]].boxplot()
plt.ylabel("Dollars ($)")
plt.show()

df.sort_values('Salary', ascending=False)[["Emp_ID", "Salary"]].head(5)

In [None]:
# Drop the 3 outliers
top_3_outliers = df.sort_values("Salary", ascending=False).head(3)
df.drop(top_3_outliers.index, inplace=True)

#### **Total Sales Acquired**

There exists 10 rows with negative `Total_Sales_Acquired`.
- **Resolution**: Drop such rows with negative `Total_Sales_Acquired`
- **Reason**: Total sales acquired should minimally be 0, not negative. We should not absolute these negative values or impute with 0 as we cannot reasonably accertain true total sales.
- **Assumption(s)**:
    - Dataset does not keep track whether an employee caused a loss of sales

There exists 307 outlier rows with `Total_Sales_Acquired` above upper bound.
- **Resolution**: Drop the top 3 rows that have outliers.
- **Reason**: These outliers are siginficantly further away from the rest of the data points, hence, we have decided to drop them. 
- **Assumption(s)**: -

In [None]:
Q1 = df["Total_Sales_Acquired"].quantile(0.25)
Q3 = df["Total_Sales_Acquired"].quantile(0.75)
lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is                 
upper = Q3 + 1.5 * (Q3-Q1)                          # upper bound is 

negative = df[df['Total_Sales_Acquired'] < 0].shape[0]
below = df[df['Total_Sales_Acquired'] <= lower].shape[0]
above = df[df['Total_Sales_Acquired'] >= upper].shape[0]

print(f"Rows with negative sales: {negative}");
print(f"Rows below lower bound ({lower}): {below}")
print(f"Rows below upper bound ({upper}): {above}")

plt.figure(figsize=(20,5))
df[["Total_Sales_Acquired"]].boxplot()
plt.ylabel("Sales (x10^8)")
plt.show()

df.sort_values('Total_Sales_Acquired', ascending=False)[["Emp_ID", "Total_Sales_Acquired"]].head(5)

In [None]:
# Drop rows with negative sales
negativeSales = df[df['Total_Sales_Acquired'] < 0]  
df.drop(negativeSales.index, inplace=True)

In [None]:
# Drop the 3 outliers
top_3_outliers = df.sort_values("Total_Sales_Acquired", ascending=False).head(3)
df.drop(top_3_outliers.index, inplace=True)

**Size after Handling Outliers**
- **Rows**: 2144
- **Columns**: 13

In [None]:
n_rows, n_cols = df.shape
print(f"{n_rows} Rows")
print(f"{n_cols} Columns")

<a id="part2c"></a> [(Back)](#part2)

### **c.** For categorical variables, perform the necessary encoding.

#### **Emp ID, Join Designation, Designation, Quarterly Rating**

These categorical variables are stored as `int` and therefore need not be encoded.

#### **Gender**
Binary (nominal) variable; To apply binary encoding 
|Value|Encoded|
|:-:|:-:|
|Male|0|
|Female|1|

In [None]:
gender_encoding = {"Male":0, "Female":1} 
df["Gender_Encoded"] = df["Gender"].map(gender_encoding)  # map Gender column using encoding

df[["Date", "Emp_ID", "Gender", "Gender_Encoded"]].sample(5)

#### **City**
Ordinal variable; To apply ordinal encoding
Extract city number
|Value|Encoded|
|:-:|:-:|
|C1|1|
|C2|2|
|...|...|
|C28|28|
|C29|29|

In [None]:
city_encoding = {"C1":1, "C2":2, "C3":3, "C4":4 ,"C5":5 ,"C6":6,"C7":7,"C8":8,"C9":9,"C10":10,"C11":11, "C12":12, "C13":13, "C14":14, "C15":15 ,"C16":16 ,"C17":17,"C18":18,"C19":19,"C20":20,"C21":21,"C22":22, "C23":23, "C24":24, "C25":25, "C26":26 ,"C27":27 ,"C28":28,"C29":29}

df["City_Encoded"] = df["City"].map(city_encoding) #map City column using encoding

df[["Date", "Emp_ID", "City", "City_Encoded"]].sample(5)


#### **Education**
Ordinal variable; To apply ordinal encoding
|Value|Encoded|
|:-:|:-:|
|College|0|
|Bachelor|1|
|Master|2|

In [None]:
education_encoding = {"College":0, "Bachelor":1, "Master":2} 
df["Education_Encoded"] = df["Education"].map(education_encoding)  # map Gender column using encoding

df[["Date", "Emp_ID", "Education", "Education_Encoded"]].sample(5)

#### **Date**
Convert date string to pandas Timestamp <br>
Splice month and year from `Date`<br>
Day is not needed as data is always captured at beginning of each month (i.e. 1st)

In [None]:
# df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)  # already converted above

df["Recorded_Month"] = df['Date'].dt.month
df["Recorded_Year"] = df['Date'].dt.year

df[["Date", "Emp_ID", "Recorded_Month", "Recorded_Year"]].sample(5)

#### **Join Date**
Convert date string to pandas Timestamp <br>
Splice day, month and year from `Join_Date`<br>

In [None]:
df['Join_Date'] = pd.to_datetime(df['Join_Date'], dayfirst=True)  

df["Join_Day"] = df['Join_Date'].dt.day
df["Join_Month"] = df['Join_Date'].dt.month
df["Join_Year"] = df['Join_Date'].dt.year

df[["Emp_ID", "Join_Date", "Join_Day", "Join_Month", "Join_Year"]].sample(5)

#### **Last Work Date**
Convert date string to pandas Timestamp <br>
Splice day, month and year from `Last_Work_Date`<br>

In [None]:
df['Last_Work_Date'] = pd.to_datetime(df['Last_Work_Date'], dayfirst=True)  

df["LWD_Day"] = df['Last_Work_Date'].dt.day
df["LWD_Month"] = df['Last_Work_Date'].dt.month
df["LWD_Year"] = df['Last_Work_Date'].dt.year

df[["Emp_ID", "Last_Work_Date", "LWD_Day", "LWD_Month", "LWD_Year"]].sample(5)

[(Back)](#part2)
<a id="part3"></a>

----
### 3.	Exploratory analysis and visualization [50% of Part I]
a. [Summary Statistics](#part3a) <br>
b. [Visualisaton](#part3b) <br>
c. [Bi-Variate Analysis](#part3c)

_[(Back Top)](#menu)_

<a id="part3a"></a>

### **a.** For each variable, provide relevant summary statistics

In [None]:
def displayCategorical(column):
    value_counts = df[column].value_counts()
    percentage = (value_counts / df.shape[0]) * 100

    col_data = pd.DataFrame({'Count': value_counts.values, 'Percentage': round(percentage, 2)})    
    return col_data.sort_index()


#### **Date**

In [None]:
displayCategorical("Date")

#### **Emp_ID**

In [None]:
unique_count = df.Emp_ID.nunique()
n_rows, n_cols = df.shape
print(f"# unique employee IDs : {unique_count}")
print(f"# rows : {n_rows}")

#### **Age**

In [None]:
df[["Age"]].describe()

#### **Gender**

In [None]:
displayCategorical("Gender")

#### **City**

In [None]:
value_counts = df["City"].value_counts()                # Cannot sort normally by City(str) as "C10" < "C2" 
percentage = (value_counts / df.shape[0]) * 100
col_data = pd.DataFrame({'Code':df["City_Encoded"].value_counts().index,'Count': value_counts.values, 'Percentage': round(percentage, 2)})    
col_data.sort_values(by="Code").drop(columns=["Code"]) 

#### **Education**

In [None]:
unique_count = df.Education.nunique()
print(f"# unique types of education : {unique_count}")

displayCategorical("Education")

#### **Salary**

In [None]:
df[["Salary"]].describe()

#### **Join Date**

In [None]:
unique_count = df.Join_Date.nunique()
n_rows, n_cols = df.shape
print(f"# unique join dates : {unique_count}")
print(f"# rows : {n_rows}")
print()

classes = df.Join_Date.unique()
print(f"Values of join dates : {classes}")

all_join_dates = df.Join_Date.mode()[0]  # note that .mode() returns a series so we need to access the first element using [0]
df.Join_Date.value_counts()  # do a count to verify the mode

#### **Last Work Date**

In [None]:
unique_count = df.Last_Work_Date.nunique()
n_rows, n_cols = df.shape
print(f"# unique last work dates : {unique_count}")
print(f"# rows : {n_rows}")
print()

classes = df.Last_Work_Date.unique()
print(f"Values of last work dates : {classes}")

all_join_dates = df.Last_Work_Date.mode()[0]  # note that .mode() returns a series so we need to access the first element using [0]
df.Last_Work_Date.value_counts()  # do a count to verify the mode

#### **Join Designation**

In [None]:
unique_count = df.Join_Designation.nunique()
print(f"# unique types of join designations : {unique_count}")

displayCategorical("Join_Designation")

#### **Designation**

In [None]:
unique_count = df.Designation.nunique()
print(f"# unique types of designations : {unique_count}")

displayCategorical("Designation")

#### **Total Sales Acquired**

In [None]:
df[["Total_Sales_Acquired"]].describe()

#### **Quarterly Rating**

In [None]:
unique_count = df.Quarterly_Rating.nunique()
print(f"# unique types of ratings : {unique_count}")

displayCategorical("Quarterly_Rating")

<a id="part3b"></a> [(Back)](#part3)

### **b.** For each variable, provide an appropriate visualisation depicting the distribution of its values, and summarize any key observation(s) you made.

#### **Date** <a id="p3b1"></a>
- **Key Observation(s)**:
    - Dec 2017 has the most data recorded as it includes those of employees that are still working (663) and have quit (72)
    - Other Date being non-zero represents those employees who had quit within that month
        - The distribution of employees who quit in a given month appears relatively uniform
        - In 2016, May had the most employees who quit
        - In 2017, July had the most employees who quit
        - In both years, April had the least employees who quit

In [None]:
stillWorking = df[(df['Last_Work_Date']=="31/12/2017")].shape[0]
dec2017 = df[(df['Date']=="2017-12-01")].shape[0]
print(f"Still Working in Dec 2017: {stillWorking}")
print(f"Quit in Dec 2017: {dec2017-stillWorking}")

date_data = df["Date"].value_counts(normalize=False)
date_level = date_data.index

plt.figure(figsize=(20,5))
bars = plt.bar(date_level, date_data, width=15)

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

plt.title("Date", fontsize=15)
plt.xlabel("Date", fontsize=12)
plt.xticks(date_level, date_level.strftime('%b %Y'), rotation=90, fontsize=10)
plt.ylabel("Num. of Employees", fontsize=12)
plt.show()

#### **Emp ID** <a id="p3b2"></a>
- **Key Observation(s)**:
    - Increment between Emp_ID is not always 1, possibly suggesting a loss of data for these employees

In [None]:
unique_count_emp = df.Emp_ID.nunique()
n_rows = df.shape[0]

print(f"# Total unique Emp_ID : {unique_count_emp}")
print(f"# Total Rows : {n_rows}")

df[df["Emp_ID"] < 500].Emp_ID.plot.line()

#### **Age** <a id="p3b3"></a>
- **Key Observation(s)**:
    - The distribution of Age is slightly right-skewed (more data above median)

In [None]:
# Boxplot
plt.figure(figsize=(20,5))
df[["Age"]].boxplot()
plt.title("Age", fontsize=15)
plt.ylabel("Years", fontsize=12)
plt.show()

# Histogram
plt.figure(figsize=(20,5))
plt.title("Distribution by Age", fontsize=15)
plt.xlabel("Age", fontsize=12)
plt.ylabel("Num. of Employees", fontsize=12)
df["Age"].hist(bins=20)   
plt.show()

#### **Gender** <a id="p3b4"></a>
- **Key Observation(s)**:
    - There have been more males employees (59%) than female employees (41%)

In [None]:
gender_data = df["Gender"].value_counts(normalize=False)
gender_level = gender_data.index

plt.figure(figsize=(20,5))
bars = plt.bar(gender_level, gender_data)

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

plt.title("Gender", fontsize=15)
plt.xlabel("Gender", fontsize=12)
# plt.xticks(gender_level, ['Male', 'Female'])
plt.ylabel("Num. of Employees", fontsize=12)
plt.show()

#### **City** <a id="p3b5"></a>
- **Key Observation(s)**: 
    - City C20 has had the greatest number of employees, suggesting it is a significant location
    - The distribution across the other 28 cities appears relatively uniform

In [None]:
city_data = df["City_Encoded"].value_counts(normalize=False)
city_level = city_data.index

plt.figure(figsize=(20,5))
bars = plt.bar(city_level, city_data)

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

plt.title("Cities", fontsize=15)
plt.xlabel("City No. ", fontsize=12)
plt.xticks(range(1, 30))
plt.ylabel("Num. of Employees", fontsize=12)
plt.show()

#### **Education** <a id="p3b6"></a>
- **Key Observation(s)**:
    - The distribution across the 3 education levels is balanced, with College having a slightly lower count

In [None]:
education_data = df["Education"].value_counts(normalize=False)
education_level = education_data.index

plt.figure(figsize=(20, 5))
bars = plt.bar(education_level, education_data)

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

plt.title("Education", fontsize=15)
plt.ylabel("Num. of Employees", fontsize=12)
plt.show()

#### **Salary** <a id="p3b7"></a>
- **Key Observation(s)**:
    - The distribution of Salary is right-skewed (more data above median)

In [None]:
# Boxplot
plt.figure(figsize=(20,5))
df[["Salary"]].boxplot()
plt.title("Salary", fontsize=15)
plt.ylabel("$", fontsize=12)
plt.show()

# Histogram
plt.figure(figsize=(20,5))
plt.title("Distribution by Salary", fontsize=15)
plt.xlabel("Salary ($)", fontsize=12)
plt.ylabel("Num. of Employees", fontsize=12)
df["Salary"].hist(bins=50)    
plt.show()

#### **Join Date** <a id="p3b8"></a>
- **Key Observation(s)**: -

In [None]:
jd_count = df.Join_Date.nunique()
print(f"# unique join dates  : {jd_count}")

cols = ["Join_Day", "Join_Month", "Join_Year"]
xaxes = [np.arange(1,32,1), np.arange(1,13,1), np.arange(2010,2018,1)]
labels = ["Day", "Month", "Year"]

for i in range(len(cols)):
    date_data = df[cols[i]].value_counts(normalize=False)
    date_level = date_data.index

    plt.figure(figsize=(10,5))
    bars = plt.bar(date_level, date_data)

    for bar in bars:                               
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

    plt.title("Join Date ("+labels[i]+")", fontsize=15)
    plt.xlabel(labels[i], fontsize=12)
    plt.xticks(xaxes[i]) 
    plt.ylabel("Count", fontsize=12)
    plt.show()

#### **Last Work Date** <a id="p3b9"></a>
- **Key Observation(s)**: 

In [None]:
date_data = df.groupby(df["Last_Work_Date"].dt.to_period('M'))["Last_Work_Date"].count()
date_level = date_data.index.to_timestamp()

plt.figure(figsize=(10,5))
bars = plt.bar(date_level, date_data, width=15)

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

plt.title("Date", fontsize=15)
plt.xlabel("Date", fontsize=12)
plt.xticks(date_level, date_level.strftime('%b %Y'), rotation=90, fontsize=10)
plt.ylabel("Count", fontsize=12)
plt.show()


In [None]:
lwd_count = df.Last_Work_Date.nunique()
print(f"# unique last work date : {lwd_count}")

cols = ["LWD_Day", "LWD_Month", "LWD_Year"]
xaxes = [np.arange(1,32,1), np.arange(1,13,1), np.arange(2015,2018,1)]
labels = ["Day", "Month", "Year"]

for i in range(len(cols)):
    date_data = df[cols[i]].value_counts(normalize=False)
    date_level = date_data.index

    plt.figure(figsize=(10,5))
    bars = plt.bar(date_level, date_data)

    for bar in bars:                               
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height()}', ha='center', va='bottom', fontsize=10)

    plt.title("Last Work Date ("+labels[i]+")", fontsize=15)
    plt.xlabel(labels[i], fontsize=12)
    plt.xticks(xaxes[i]) 
    plt.ylabel("Count", fontsize=12)
    plt.show()

#### **Join Designation** <a id="p3b10"></a>
- **Key Observation(s)**: 
    - Employees rarely join with designation level 4 or 5 (1.95%)
    - An employee mostly likely joins with designation level 1 (44.21%)
    - For each subsequent designation level, the employee count at that designation level decreases, with a significant drop between level 3 and 4 

In [None]:
jd_data = df["Join_Designation"].value_counts(normalize=True)
jd_level = jd_data.index

plt.figure(figsize=(20,5))                
bars = plt.bar(jd_level, jd_data)              

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height():.2%}', ha='center', va='bottom', fontsize=10)

plt.xlabel("Designation Level", fontsize=12)   
plt.ylabel("Percentage", fontsize=12)          
plt.yticks([0, 0.1, 0.2, 0.3, 0.4, 0.5])
plt.title("Employees' Designation Upon Joining", fontsize=15) 
plt.show()

#### **Designation** <a id="p3b11"></a>
- **Key Observation(s)**: 
    - There is a significant drop in the number of employees at level 1 (11.81%)
    - All other designation levels (2-5) have increased while following a similar trend as Join_Designation 
    - Designation level 3 has the greatest jump (5.02%)
    - Designation level 5 is still the smallest (0.98%), suggesting it is difficult to be promoted to level 5 

In [None]:
cd_data = df["Designation"].value_counts(normalize=True)
cd_level = cd_data.index

plt.figure(figsize=(20,5))                  
bars = plt.bar(cd_level, cd_data)              

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height():.2%}', ha='center', va='bottom', fontsize=10)

plt.xlabel("Designation Level", fontsize=12)   
plt.ylabel("Percentage", fontsize=12)          
plt.yticks([0, 0.1, 0.2, 0.3, 0.4, 0.5])
plt.title("Employees' Latest Designation", fontsize=15)  
plt.show()

In [None]:
plt.figure(figsize=(20,5)) 

jd_data = df["Join_Designation"].value_counts(normalize=True)
bars1 = plt.bar(jd_data.index - 0.2, jd_data, 0.4, label = 'First Joined') 

for bar in bars1:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=10)

cd_data = df["Designation"].value_counts(normalize=True)
bars2 = plt.bar(cd_data.index + 0.2, cd_data, 0.4, label = 'Latest') 
  
for bar in bars2:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=10)

plt.xlabel("Designation Level", fontsize=12) 
plt.ylabel("Percentage", fontsize=12)  
plt.yticks([0, 0.1, 0.2, 0.3, 0.4, 0.5])
plt.legend() 
plt.title("Employee's Designation", fontsize=15)
plt.show() 

#### **Total Sales Accquired** <a id="p3b12"></a>
- **Key Observation(s)**:
    - A significant number of employees (653) accuqired 0 total sales such that the lower quantile and lower bound are both 0
    - Among the outlier data
        - The majority are concentrated between upper bound (0.1x10^8) and 0.4x10^8 total sales
        - There is another grouping between 0.5x10^8 and 0.6x10^8 total sales
        - There is 3 distinct points after 0.6x10^8 total sales
    - The distribution of Total Sales Accquired is right-skewed (more data above median)

In [None]:
# Boxplot
plt.figure(figsize=(20,5))
df[["Total_Sales_Acquired"]].boxplot()
plt.title("Total Sales Acquired", fontsize=15)
plt.ylabel("Sales (x10^8)", fontsize=12)
plt.show()

# Histogram
plt.figure(figsize=(20,5))
plt.title("Distribution by Total Sales Acquired", fontsize=15)
plt.xlabel("Sales", fontsize=12)
plt.ylabel("Num. of Employees", fontsize=12)
df["Total_Sales_Acquired"].hist(bins=50)    
plt.show()


In [None]:
log_totalSalesAcquired = np.log1p(df.Total_Sales_Acquired)
log_totalSalesAcquired.hist(bins=20)
plt.show()

#### **Quarterly Rating** <a id="p3b13"></a>
- **Key Observation(s)**: 
    - Follows a logarithmic decrease, with a significant drop between rating 1 and 2
    - The majority of employees are given a quarterly rating of 1, emphasising it is difficult to attain a higher rating

In [None]:
qr_data = df["Quarterly_Rating"].value_counts(normalize=True)
qr_level = qr_data.index

plt.figure(figsize=(20,5))
bars = plt.bar(qr_level, qr_data)

for bar in bars:                               
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=10)
    
plt.title("Employees' Quarterly Rating", fontsize=15)
plt.xlabel("Quarterly Rating", fontsize=10)
plt.xticks([1,2,3,4])
plt.ylabel("Percentage", fontsize=10)
plt.yticks(np.arange(0,0.9,0.1))
plt.show()

<a id="part3c"></a> [(Back)](#part3)

### **c.** Perform bi-variate analysis on the variables. You do not need to present the analysis of every pair of variables; only focus on the pairs you believe are worth investigating and explain. For each pair, describe the relationship between the two variables. Use appropriate statistical methods and/or visualizations.

**Response.** 

In [None]:
Q1 = df["Salary"].quantile(0.25)
Q3 = df["Salary"].quantile(0.75)
#lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is $-16409.25                
upper = Q3 + 1.5 * (Q3-Q1) 

def salary_class(n):   
    if n > upper:
        return "Very high salary"
    elif n > Q3:
        return "High salary"
    elif n < Q1:
        return "Low salary"
    else:
        return "Middle salary"
    
df["salaryClass"] = df.Salary.apply(salary_class)

df[["Emp_ID", "Salary", "salaryClass"]].sample(25)

In [None]:
Q1 = df["Total_Sales_Acquired"].quantile(0.25)
Q2 = df["Total_Sales_Acquired"].quantile(0.50)
Q3 = df["Total_Sales_Acquired"].quantile(0.75)
#lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is $-16409.25                
upper = Q3 + 1.5 * (Q3-Q1) 

def salary_class(n):   
    if n > upper:
        return "Very high sales"
    elif n > Q3:
        return "High sales"
    elif n < Q2:
        return "Low sales"
    else:
        return "Middle sales"
    
df["totalSalesAcquiredClass"] = df.Total_Sales_Acquired.apply(salary_class)

df[["Emp_ID", "Total_Sales_Acquired", "totalSalesAcquiredClass"]].sample(25)

#### **Age vs Salary** <a id="p3b11"></a>

Explanation for choosing this relationship: 
We wanted to examine if older workers earn higher salaries than younger workers, due to the fact that they have more experience in the workforce.

Relationship: 
There is no clear relationship between the two variables. According to the scatter plot, most of the points are concentrated around the salary range of $40,000 - $80,000, and is independent of the ages of the workers.

In [None]:

xs = df.Salary
ys = df.Age

plt.figure(figsize=(20,5))
plt.scatter(xs, ys)
plt.title("Plot of Age vs Salary", fontsize=15)
plt.xlabel("Salary", fontsize=12)
plt.ylabel("Age", fontsize=12)
#plt.yticks([1,2,3,4,5])
plt.show()

In [None]:
corr = np.corrcoef(xs, ys)
corr

In [None]:
cov = np.cov(xs, ys)
cov

In [None]:
xs.var(), ys.var() 

#### **City vs Salary** <a id="p3b11"></a>

Explanation for choosing this relationship: 
We wanted to examine if people living in the city affects how much they earn.

Relationship: 
There is no clear relationship between the two variables.

In [None]:
pd.crosstab(df.City, df.salaryClass, normalize="index")

#### **Designation vs Salary** <a id="p3b11"></a>

Explanation for choosing the relationship:
We wanted to examine if a higher designation equates to earning higher salaries.

Relationship:
Our hypothesis is generally true. People who are of designation 5 have approximately 58.9% if them earning very high salaries. Conversely, those who are of designations 1 and 2 mainly earn low to middle salaries.

In [None]:
pd.crosstab(df.Designation, df.salaryClass, normalize="index")

#### **Education vs Join Designation** <a id="p3b11"></a>

In [None]:
pd.crosstab(df.Education, df.Join_Designation)

#### **Age vs Designation** <a id="p3b11"></a>

In [None]:
xs = df.Age
ys = df.Designation

plt.figure(figsize=(20,5))
plt.scatter(xs, ys)
plt.title("Plot of Age vs Designation", fontsize=15)
plt.xlabel("Age", fontsize=12)
plt.ylabel("Designation", fontsize=12)
plt.yticks([1,2,3,4,5])
plt.show()

#### **City vs (Designation-JD)** <a id="p3b11"></a>

In [None]:
xs = df.City
ys = df.Designation - df.Join_Designation

plt.figure(figsize=(20,5))
plt.scatter(xs, ys)
plt.title("Plot of City vs Change In Designation", fontsize=15)
plt.xlabel("City name", fontsize=12)
plt.ylabel("Change In Designation", fontsize=12)
plt.yticks([0,1,2,3,4])
plt.show()

#### **(LWD-JD) vs (Designation-JD)** <a id="p3b11"></a>

In [None]:
xs = df.Last_Work_Date - df.Join_Date
ys = df.Designation - df.Join_Designation

plt.figure(figsize=(20,5))
plt.scatter(xs, ys)
plt.title("Plot of Time Worked vs Change In Designation", fontsize=15)
plt.xlabel("Time Worked (years)", fontsize=12)
plt.ylabel("Change In Designation", fontsize=12)
plt.yticks([0,1,2,3,4])
plt.show()

[(Back)](#part3)

Gender vs Sales.
-I am simply gonna use the same sales class above to do the comparison.


In [None]:
Q1 = df["Total_Sales_Acquired"].quantile(0.25)
Q2 = df["Total_Sales_Acquired"].quantile(0.50)
Q3 = df["Total_Sales_Acquired"].quantile(0.75)
#lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is $-16409.25                
upper = Q3 + 1.5 * (Q3-Q1) 

def salary_class(n):   
    if n > upper:
        return "Very high sales"
    elif n > Q3:
        return "High sales"
    elif n < Q2:
        return "Low sales"
    else:
        return "Middle sales"
    
df["totalSalesAcquiredClass"] = df.Total_Sales_Acquired.apply(salary_class)

table = pd.crosstab(df.Gender, df.totalSalesAcquiredClass)
table

In [None]:
table_norm = pd.crosstab(df.Gender, df.totalSalesAcquiredClass, normalize="index")
table_norm

In [None]:
workers = table["High sales"]["Female"] + table["Middle sales"]["Female"] + table["Low sales"]["Female"] + table["Very high sales"]["Female"] + table["High sales"]["Male"] + table["Middle sales"]["Male"] + table["Low sales"]["Male"] + table["Very high sales"]["Male"]
Female_AtLeastHigh = table["High sales"]["Female"] + table["Very high sales"]["Female"]
Male_AtLeastHigh = table["High sales"]["Male"] + table["Very high sales"]["Male"]
AtLeastHigh = table["High sales"]["Female"] + table["Very high sales"]["Female"] + table["High sales"]["Male"] + table["Very high sales"]["Male"]
Females = table["High sales"]["Female"] + table["Middle sales"]["Female"] + table["Low sales"]["Female"] + table["Very high sales"]["Female"]
Males = table["High sales"]["Male"] + table["Middle sales"]["Male"] + table["Low sales"]["Male"] + table["Very high sales"]["Male"]

print(f"Probability(AtLeastHigh)={AtLeastHigh/workers}")
print(f"Probability(AtLeastHigh | Female)={Female_AtLeastHigh/Females}")
print(f"Probability(AtLeastHigh | Male)={Male_AtLeastHigh/Males}")

Hence given that the Probability(AtLeastHigh) , that is to say sales is of class "High sales" and "Very high sales" is actually 0.25.
While Probability(AtLeastHigh | Female) is also roughly 0.25 being actually 0.25257731958762886
and
Probability(AtLeastHigh | Male) is also roughly 0.25 , with the probability actually being 0.24821173104434907
We can safely conclude that Sales is independent of Gender and that being Male or Female will not affect the TotalSalesAcquired.

In [None]:
#Now I will simply do the City vs Sales
#Using the same classification as above

In [None]:
Q1 = df["Total_Sales_Acquired"].quantile(0.25)
Q2 = df["Total_Sales_Acquired"].quantile(0.50)
Q3 = df["Total_Sales_Acquired"].quantile(0.75)
#lower = Q1 - 1.5 * (Q3-Q1)                          # lower bound is $-16409.25                
upper = Q3 + 1.5 * (Q3-Q1) 

def salary_class(n):   
    if n > upper:
        return "Very high sales"
    elif n > Q3:
        return "High sales"
    elif n < Q2:
        return "Low sales"
    else:
        return "Middle sales"
    
df["totalSalesAcquiredClass"] = df.Total_Sales_Acquired.apply(salary_class)
tableCity = pd.crosstab(df.City, df.totalSalesAcquiredClass)
tableCity

In [None]:
#What I am gonna do is basically, since there is too many cities, I will simply compare probability of the Very High sales vs the probability of very high sales given a specific city

In [None]:
df.City.count()

In [None]:
#let the total data of workers in all the cities be df.City.count()
AllCitiesTotal = df.City.count()
VeryHighSalesNo = tableCity["Very high sales"]["C1"] + tableCity["Very high sales"]["C2"] + tableCity["Very high sales"]["C3"] + tableCity["Very high sales"]["C4"] + tableCity["Very high sales"]["C5"] + tableCity["Very high sales"]["C6"] + tableCity["Very high sales"]["C7"] + tableCity["Very high sales"]["C8"] + tableCity["Very high sales"]["C9"] + tableCity["Very high sales"]["C10"] + tableCity["Very high sales"]["C11"] + tableCity["Very high sales"]["C12"] + tableCity["Very high sales"]["C13"] + tableCity["Very high sales"]["C14"] + tableCity["Very high sales"]["C15"] + tableCity["Very high sales"]["C16"] + tableCity["Very high sales"]["C17"] + tableCity["Very high sales"]["C18"] + tableCity["Very high sales"]["C19"] + tableCity["Very high sales"]["C20"] + tableCity["Very high sales"]["C21"] + tableCity["Very high sales"]["C22"] + tableCity["Very high sales"]["C23"] + tableCity["Very high sales"]["C24"] + tableCity["Very high sales"]["C25"] + tableCity["Very high sales"]["C26"] + tableCity["Very high sales"]["C27"] + tableCity["Very high sales"]["C28"] + tableCity["Very high sales"]["C29"]
AllCitiesTotal
VeryHighSalesNo
print(f"Probability(Very high sales)={VeryHighSalesNo/AllCitiesTotal} ")

In [None]:
tableCity_Norm = pd.crosstab(df.City, df.totalSalesAcquiredClass, normalize="index")
tableCity_Norm

In [None]:
#let's check C20 for probability of veryHighSales
C20VeryHighSales = tableCity_Norm["Very high sales"]["C20"]
C20VeryHighSales
print(f"Probability(Very high sales | C20 )={C20VeryHighSales} ")

Given that the Probability(very high sales) is not the same as the Probability(Very high sales | C20 ), this seems to imply that there is some sort of dependency where the city will affect the probability of the TotalSalesAcquired.

In [None]:
#Now I will do (LastWorkingDate - JoinDate) vs Sales

In [None]:
df['Join_Date'] = pd.to_datetime(df['Join_Date'], dayfirst=True)  

df["Join_Day"] = df['Join_Date'].dt.day
df["Join_Month"] = df['Join_Date'].dt.month
df["Join_Year"] = df['Join_Date'].dt.year

df["Join_Day"].describe()
df["Join_Month"].describe()
df["Join_Year"].describe()

In [None]:
#so apparently, all the columns are float, so I want to make them int to be able to do computations.

In [None]:
df["Join_Day"] = df["Join_Day"].astype(int)
df["Join_Month"] = df["Join_Month"].astype(int)
df["Join_Year"] = df["Join_Year"].astype(int)

In [None]:
df["JoinDateInDays"] = df.apply(lambda row : (row.Join_Year * 365) + (row.Join_Month * 30) + (row.Join_Day) , axis=1)
#At this point I am too tired and can't figure out a way to make month 28 , 30 or 31 days lol

In [None]:
df['Last_Work_Date'] = pd.to_datetime(df['Last_Work_Date'], dayfirst=True)  

df["LWD_Day"] = df['Last_Work_Date'].dt.day
df["LWD_Month"] = df['Last_Work_Date'].dt.month
df["LWD_Year"] = df['Last_Work_Date'].dt.year

df["LWD_Day"].describe()
df["LWD_Month"].describe()
df["LWD_Year"].describe()

In [None]:
df["LWD_Day"] = df["LWD_Day"].astype(int)
df["LWD_Month"] = df["LWD_Month"].astype(int)
df["LWD_Year"] = df["LWD_Year"].astype(int)

In [None]:
df["LastWorkDateInDays"] = df.apply(lambda row : (row.LWD_Year * 365) + (row.LWD_Month * 30) + (row.LWD_Day) , axis=1)
#At this point I am too tired and can't figure out a way to make month 28 , 30 or 31 days lol

In [None]:
df["DaysWorked"] = df.apply(lambda row : abs(row.LastWorkDateInDays - row.JoinDateInDays) , axis=1)
#What is attribute error :( !!!

In [None]:
#I will just proceed as if there is no error for now.

In [None]:
xs = df.DaysWorked
ys = df.Total_Sales_Acquired

plt.figure(figsize=(8,5))

plt.scatter(xs,ys)
plt.title("DaysWorked vs TotalSalesAcquired", fontsize=15)
plt.xlabel("DaysWorked", fontsize=12)
plt.ylabel("TotalSalesAcquired", fontsize=12)
plt.show()

In [None]:
corr = np.corrcoef(xs,ys)
corr

In [None]:
df["Time_Worked"] = df["Last_Work_Date"] - df["Join_Date"]
# Convert Time_Worked from timedelta to number of days
df["Time_Worked"] = df["Time_Worked"].dt.days

df[["Emp_ID","Last_Work_Date","Join_Date","Time_Worked"]]

xs = df.Time_Worked
ys = df.Total_Sales_Acquired

plt.figure(figsize=(8,5))

plt.scatter(xs,ys)
plt.title("Time_Worked vs TotalSalesAcquired", fontsize=15)
plt.xlabel("Time_Worked", fontsize=12)
plt.ylabel("TotalSalesAcquired", fontsize=12)
plt.show()

corr1 = np.corrcoef(xs,ys)
corr1

Given the Correlation coefficient is more than 0.50 and quite close to 0.70, we can imply that the Time_Worked which is given by 
Last_Work_Date - Join_Date, has a strong correlation to the TotalSalesAcquired.
Hence the Time_worked can affect the TotalSalesAcquired.