## Analyzing the personality test of Umuzi recruits for different departments applied to. 


### 1. Reading and cleaning the dataset


We first import all required libraries for the project. Using the imported pandas library, we read our personality_score.csv and departments.csv files. 


In [1]:
import pandas as pd
import numpy as np

In [2]:
personality_scores = pd.read_csv('data/personality_scores.csv', sep=';')
departments_data = pd.read_csv('data/departments.csv', sep = ';')
personality_scores.head()

Unnamed: 0,ID,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],...,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,IPIP_HIGH_RISK
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,,,,,,,,,,
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,,,,,,,,,,
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,,,,,,,,,,
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,,,,,,,,,,
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,,,,,,,,,,


The personality score data shows that there are columns with NaN values, below we check and remove all columns with NaN values for all rows, and remove section 5 of 6 in the column names. We then verify that there are no missing values using the assertion. 

In [3]:
personality_scores = personality_scores.dropna(axis='columns', how = 'all')

for column_name in personality_scores.columns[1:]: 
    personality_scores.rename(columns= {column_name: column_name[column_name.find("[")+1: column_name.find("]")]}, inplace= True)

assert personality_scores.isnull().sum().sum() == 0  

personality_scores.head()

Unnamed: 0,ID,I am always prepared.,I am easily disturbed.,I am exacting (demanding) in my work.,I am full of ideas.,I am interested in people.,I am not interested in abstract ideas.,I am not interested in other people's problems.,I am not really interested in others.,I am quick to understand things.,...,I often forget to put things back in their proper place,I pay attention to details.,I seldom feel blue (down).,I spend time reflecting on things.,I start conversations.,I sympathize with others' feelings.,I take time out for others.,I talk to a lot of different people at parties.,I use difficult words.,I worry about things.
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 3)","(5, 5)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 1)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)"
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 1)","(5, 3)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,"(3, 1)","(3, 5)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)"
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,"(3, 5)","(3, 5)","(4, 5)","(5, 5)","(1, 3)","(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"


We drop the duplicates in the personality scores data based on the ID column. We then check that the new dataframe has the length equal to that of the unique entries of the original personality scores dataframe using assertion.

In [4]:
personality_scores_df = personality_scores.drop_duplicates(subset='ID', keep="first")

In [5]:
assert personality_scores.ID.nunique() == len(personality_scores_df), f'Number of unique ID is not equal to personality score length'

### 2. Determining the total score of each of the personality test subscales. 


The personality scores dataset consists of scores saved as tuples with two values. The first value of the tuples measures the big five personality traits with 1 = Extraversion, 2 = Agreeableness, 3 = Conscientiousness, 4 = Emotional Stability/Neuroticism, and 5 = Intellect/Imagination / openness to experiences. The second value of the tuples is the individual's scored response to the question, with 1 = Disagree, 3 = Neutral and 5 = Agree. 


We determine the total score of each of the personality test subscales below, by filtering the data using the personality traits and summing the individual's scored responses.

In [6]:
def personality_test_totals(scores):
    lst_tuple = list(map(lambda i: eval(i), scores[1:]))
    
    extraversion_total = sum(list(map( lambda y: y[1],filter(lambda x: x[0] == 1, lst_tuple))))
    agreeableness_total = sum(list(map( lambda y: y[1],filter(lambda x: x[0] == 2, lst_tuple))))
    conscientiousness_total = sum(list(map( lambda y: y[1],filter(lambda x: x[0] == 3, lst_tuple))))
    neuroticism_total = sum(list(map( lambda y: y[1],filter(lambda x: x[0] == 4, lst_tuple))))
    intellect_total = sum(list(map( lambda y: y[1],filter(lambda x: x[0] == 5, lst_tuple))))
    
    return agreeableness_total, conscientiousness_total, neuroticism_total, intellect_total, extraversion_total
   

df = (personality_scores.apply(personality_test_totals, axis = 1, result_type = 'expand'))
df.columns = ['agreeableness','conscientiousness','emotional stability','openness to new experiences','extraversion']
personality_scores_df = pd.concat([personality_scores_df, df], axis =1, join = 'inner')

personality_scores_df.head()

Unnamed: 0,ID,I am always prepared.,I am easily disturbed.,I am exacting (demanding) in my work.,I am full of ideas.,I am interested in people.,I am not interested in abstract ideas.,I am not interested in other people's problems.,I am not really interested in others.,I am quick to understand things.,...,I sympathize with others' feelings.,I take time out for others.,I talk to a lot of different people at parties.,I use difficult words.,I worry about things.,agreeableness,conscientiousness,emotional stability,openness to new experiences,extraversion
0,0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",40,48,36,42,30
1,1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)",46,46,40,42,42
2,2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",40,40,38,42,28
3,3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)",38,38,40,38,30
4,4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)",...,"(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",34,46,38,36,28


### 3. Merging personality scores data with the departments data 

We merge the personality scores to the department data on the ID column and use an assert statement to check that the newly created merged data frame has the same amount of rows as the department data frame, and the expected number of columns. We first check the unique values found in the Department column. 

In [7]:
departments_data.Department.unique()

array(['Data', 'Web Dev', 'Copywriting', 'Design', 'Strategy', 'Web dev'],
      dtype=object)

The department column has two values for web development spelled differently, we correct the second value to match the first 'Web Dev' value.

In [8]:
departments_data['Department'] = departments_data['Department'].replace({'Web dev':'Web Dev'})


In [9]:
merged_data = departments_data.merge(personality_scores_df, on=["ID"])
merged_data.head()

Unnamed: 0,ID,Department,I am always prepared.,I am easily disturbed.,I am exacting (demanding) in my work.,I am full of ideas.,I am interested in people.,I am not interested in abstract ideas.,I am not interested in other people's problems.,I am not really interested in others.,...,I sympathize with others' feelings.,I take time out for others.,I talk to a lot of different people at parties.,I use difficult words.,I worry about things.,agreeableness,conscientiousness,emotional stability,openness to new experiences,extraversion
0,0,Data,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",40,48,36,42,30
1,1,Data,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)",46,46,40,42,42
2,2,Data,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)",...,"(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",40,40,38,42,28
3,3,Data,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)",...,"(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)",38,38,40,38,30
4,4,Data,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)",...,"(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)",34,46,38,36,28


In [10]:
assert len(merged_data.index) == len(departments_data.index), f'Number of rows of merged data are not equal to the number of rows in departments data'
assert len(merged_data.columns) == (len(personality_scores_df.columns) + len(departments_data.columns) - 1), f'Number of columns in merged data are not equal to the expected number of columns'


### 4. Determining high and low risk applicants

We filter the merged data using high_risk and low_risk labels in a new column 'Risk', tagging all applicants who scored less than 30 on emotional stability, conscientiousness and agreeableness as high_risk. We then print the ID numbers and departments of these high risk applicants. 


In [11]:
merged_data['Risk'] = np.where(((merged_data['agreeableness'] < 30) & (merged_data['conscientiousness'] < 30 ) & (merged_data['emotional stability'] < 30)),'high_risk', 'low_risk')

high_risk_results = merged_data[merged_data['Risk'] == 'high_risk']
low_risk_results = merged_data[merged_data['Risk'] == 'low_risk']
high_risk = high_risk_results[['ID','Department']]
high_risk


Unnamed: 0,ID,Department
881,881,Data
1197,1197,Copywriting


### 3. Creating a new department dataframe

In this section we create a new data frame called department_df with a count of the number of low and high risk applicants within each department.

We first determine all the department names and create an empty data frame with the department names columns. We then determine the value count for high risk and low risk for each department, and append the results to the empty data frame. 

In [12]:
department_names = list(merged_data['Department'].unique())
column_names = list(['Risk']) + department_names
empty_df = pd.DataFrame(columns= column_names)                           

high_risk_dict = dict(high_risk_results['Department'].value_counts())
high_risk_dict['Risk'] = 'High risk'                                     
low_risk_dict = dict(low_risk_results['Department'].value_counts())
low_risk_dict['Risk'] = 'Low risk'                                       

department_df = empty_df.append([high_risk_dict, low_risk_dict], ignore_index= False).fillna(int(0))

department_df

Unnamed: 0,Risk,Data,Web Dev,Copywriting,Design,Strategy
0,High risk,1,0.0,1,0.0,0.0
1,Low risk,328,331.0,325,120.0,449.0
