### Recruitment Data 

#### The following notebook will explore data wrangling with the use of recruitment personality scores data. Applicants are required to complete a personality survey and that data is used as input for this notebook. The five factors observed in the survey are Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticsm or Emotional stability (OCEAN). These traits will form the subscale for the various data aplications and manipulations below. The structure of the notebook is as follows: 

### 1. Data Validation 
#### The data from the personality scores file is read in and checked for unique entries.

### 2. Subscale calculation 
#### A function is defined to take in and convert the answers from the personality survey to numbers. The total scores for each subscale are then calculated. 

### 3. Subscale interpretation 
#### A function is defined that takes in subscale totals as a parameter and returns a new dataframe with a new column that labels the subscale score as 'high', 'medium' or 'low', which we will henceforth refer to as the score category . 

### 4. Merge dataframes 
#### The department dataframe is merged with the personality score dataframe, retaining all applicants within the various departments. 

### 5. Data Visualization 
#### A histogram plotting the personality scores with the various departments is depicted. 

### 6. Dataframe Filtration
#### The merged dataframe is filtered for candidates that have a 'low' label on the Neuroticsm, Conscientiousness and Agreeableness traits. Such identified candidates, by department, are placed in a new column with the tag 'high risk'. 

### 7. Count Dataframe
#### A new dataframe is made with the count of the number of applicants in each score category within each subscale and department. 



### 1. Data Validation
#### The code below removes the duplicates and drops all null-valued columns. There are 1555 unique entries meaning 1555 applicants and 50 questions on the survey as represented by each column 

In [73]:
#Packages necessary to execute data manipulation and analysis are imported 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import re

# The input data is read in, checked and duplicates are removed
scores = pd.read_csv('personality_scores.csv', sep = ';', header = 0)
scores.info()
pers_scores = scores.drop_duplicates(subset = 'ID', keep = 'first')
prsnl_scores = pers_scores.set_index('ID')
prsnl_scores.dropna(axis = 1, how = 'all', inplace = True)
print(prsnl_scores.shape)
prsnl_scores.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1555 entries, 0 to 1554
Data columns (total 70 columns):
ID                                                                          1555 non-null int64
Section 5 of 6 [I am always prepared.]                                      1555 non-null object
Section 5 of 6 [I am easily disturbed.]                                     1555 non-null object
Section 5 of 6 [I am exacting (demanding) in my work.]                      1555 non-null object
Section 5 of 6 [I am full of ideas.]                                        1555 non-null object
Section 5 of 6 [I am interested in people.]                                 1555 non-null object
Section 5 of 6 [I am not interested in abstract ideas.]                     1555 non-null object
Section 5 of 6 [I am not interested in other people's problems.]            1555 non-null object
Section 5 of 6 [I am not really interested in others.]                      1555 non-null object
Section 5 of 6 [I am 

Unnamed: 0_level_0,Section 5 of 6 [I am always prepared.],Section 5 of 6 [I am easily disturbed.],Section 5 of 6 [I am exacting (demanding) in my work.],Section 5 of 6 [I am full of ideas.],Section 5 of 6 [I am interested in people.],Section 5 of 6 [I am not interested in abstract ideas.],Section 5 of 6 [I am not interested in other people's problems.],Section 5 of 6 [I am not really interested in others.],Section 5 of 6 [I am quick to understand things.],Section 5 of 6 [I am quiet around strangers.],...,Section 5 of 6 [I often forget to put things back in their proper place],Section 5 of 6 [I pay attention to details.],Section 5 of 6 [I seldom feel blue (down).],Section 5 of 6 [I spend time reflecting on things.],Section 5 of 6 [I start conversations.],Section 5 of 6 [I sympathize with others' feelings.],Section 5 of 6 [I take time out for others.],Section 5 of 6 [I talk to a lot of different people at parties.],Section 5 of 6 [I use difficult words.],Section 5 of 6 [I worry about things.]
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 3)","(5, 3)","(2, 3)","(2, 5)","(5, 5)","(1, 3)",...,"(3, 5)","(3, 5)","(4, 3)","(5, 5)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
1,"(3, 5)","(4, 5)","(3, 5)","(5, 5)","(2, 5)","(5, 3)","(2, 5)","(2, 5)","(5, 5)","(1, 3)",...,"(3, 5)","(3, 1)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 3)","(4, 3)"
2,"(3, 5)","(4, 3)","(3, 3)","(5, 5)","(2, 5)","(5, 5)","(2, 5)","(2, 5)","(5, 5)","(1, 1)",...,"(3, 5)","(3, 5)","(4, 1)","(5, 3)","(1, 3)","(2, 5)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"
3,"(3, 5)","(4, 5)","(3, 3)","(5, 5)","(2, 5)","(5, 3)","(2, 3)","(2, 3)","(5, 3)","(1, 3)",...,"(3, 1)","(3, 5)","(4, 1)","(5, 5)","(1, 5)","(2, 5)","(2, 5)","(1, 5)","(5, 1)","(4, 1)"
4,"(3, 3)","(4, 5)","(3, 3)","(5, 3)","(2, 3)","(5, 3)","(2, 3)","(2, 3)","(5, 5)","(1, 1)",...,"(3, 5)","(3, 5)","(4, 5)","(5, 5)","(1, 3)","(2, 3)","(2, 5)","(1, 3)","(5, 1)","(4, 3)"


### 2. Subscale calculation 
#### A function is defined to take in and convert the answers from the personality survey to numbers and then the total scores for each subscale are calculated. 

In [62]:
# A function is defined that finds the number strings in a string and returns a dictionary with the fist number in the string being the key and the second number representing the value

def numerate(string_pair):
    num_list = [int(s) for s in re.findall('\d+', str(string_pair))]
    return num_list 

# A for loop is used to numerate the dataframe. Values(question answers) with the same keys(subscale) in the same row are summed and appended to new dataframe. 
# subscale_df = pd.DataFrame(columns = ['Openness', 'Conscientiousness', 'Extraversion', 'Agreeableness', 'Neuroticsm'])
values_df = pd.DataFrame(prsnl_scores.values, index = prsnl_scores.index)
i = 0
j = 0
for i in values_df:
       for j in range(len(values_df.columns)):
            values_df.replace(to_replace = values_df.iloc[j,i], value = numerate(values_df.iloc[j,i]))
        
# values_df.apply(numerate, axis=0)

            
#         subscale_df.loc[''] = sum(j['5'] for '5' in j])
#         c_total = sum(j['3'] for '3' in j])
#         e_total = sum(j['1'] for '1' in j])
#         a_total = sum(j['2'] for '2' in j])
#         n_total = sum(j['4'] for '4' in j])
# return subscale_df  

TypeError: Invalid "to_replace" type: 'str'

### 3. Subscale interpretation 
#### A function is defined that takes in subscale totals as a parameter and returns a new dataframe with a new column that labels the subscale score as 'high', 'medium' or 'low', which we will henceforth refer to as the score category . 

In [None]:
# The interpretation function is defined with subscale totals as a parameter
def generate_score_interpretation(subscale_df): 
    

### 3. Subscale interpretation 
#### A function is defined that takes in subscale totals as a parameter and returns a new dataframe with a new column that labels the subscale score as 'high', 'medium' or 'low', which we will henceforth refer to as the score category . 

In [74]:
# Department file is read in and merged with cleaned personality score data
dept_df = pd.read_csv('departments.csv', sep = ';', header = 0, index_col = 'ID')
dept_df.head()
# merge_df = pd.merge(pers_scores, dept_df)

Unnamed: 0_level_0,Department,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,IPIP_HIGH_RISK
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,Data,,,,,,,,,,,,,,,,,,,
1,Data,,,,,,,,,,,,,,,,,,,
2,Data,,,,,,,,,,,,,,,,,,,
3,Data,,,,,,,,,,,,,,,,,,,
4,Data,,,,,,,,,,,,,,,,,,,
