
#  Performing cluster sampling

##  Assignment 

Now that you know when to use cluster sampling, it's time to put it into action. In this exercise, you'll explore the `JobRole` column of the attrition dataset. You can think of each job role as a subgroup of the whole population of employees.

`attrition_pop` is available; `pandas` is loaded with its usual alias, and the `random` package is available. A seed of `19790801` has also been set with `random.seed()`.

##  Pre exercise code 

```
import pandas as pd
import random
random.seed(19790801)
attrition_pop = pd.read_feather(
  path = "/usr/local/share/datasets/attrition.feather"
)
```


In [50]:
import pandas as pd
import random
random.seed(19790801)
attrition_pop = pd.read_feather(
  path = "attrition.feather"
)

In [51]:
attrition_pop.head(1)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,21,0.0,Travel_Rarely,391,Research_Development,15,College,Life_Sciences,High,Male,...,Excellent,Very_High,0,0,6,Better,0,0,0,0


##  Instructions 

- Create a list of unique `JobRole` values from `attrition_pop`, and assign to `job_roles_pop`.
- Randomly sample four `JobRole` values from `job_roles_pop`.



In [60]:
job_roles_pop = list(attrition_pop['JobRole'].unique())
job_roles_pop

['Research_Scientist',
 'Sales_Representative',
 'Laboratory_Technician',
 'Human_Resources',
 'Sales_Executive',
 'Manufacturing_Director',
 'Healthcare_Representative',
 'Research_Director',
 'Manager']

In [61]:
job_role_sample = random.sample(job_roles_pop, k=4)
job_role_sample

['Research_Director', 'Research_Scientist', 'Human_Resources', 'Manager']


- Subset `attrition_pop` for the sampled job roles by filtering for rows where `JobRole` is in `job_roles_samp`.




In [64]:
attrition_subset = attrition_pop[attrition_pop['JobRole'].isin(job_role_sample)]

- Remove any unused categories from `JobRole`.
- For each job role in the filtered dataset, take a random sample of ten rows, setting the seed to `2022`.



In [66]:
attrition_subset['JobRole']

0       Research_Scientist
5       Research_Scientist
6       Research_Scientist
10      Research_Scientist
17      Research_Scientist
               ...        
1462               Manager
1464               Manager
1465               Manager
1466               Manager
1469     Research_Director
Name: JobRole, Length: 526, dtype: category
Categories (9, object): ['Healthcare_Representative', 'Human_Resources', 'Laboratory_Technician', 'Manager', ..., 'Research_Director', 'Research_Scientist', 'Sales_Executive', 'Sales_Representative']

In [67]:
attrition_subset['JobRole'].cat.remove_unused_categories()

0       Research_Scientist
5       Research_Scientist
6       Research_Scientist
10      Research_Scientist
17      Research_Scientist
               ...        
1462               Manager
1464               Manager
1465               Manager
1466               Manager
1469     Research_Director
Name: JobRole, Length: 526, dtype: category
Categories (4, object): ['Human_Resources', 'Manager', 'Research_Director', 'Research_Scientist']

In [69]:
attrition_subset.loc[:, 'JobRole'] = attrition_subset['JobRole'].cat.remove_unused_categories()

In [70]:
attrition_subset['JobRole']

0       Research_Scientist
5       Research_Scientist
6       Research_Scientist
10      Research_Scientist
17      Research_Scientist
               ...        
1462               Manager
1464               Manager
1465               Manager
1466               Manager
1469     Research_Director
Name: JobRole, Length: 526, dtype: category
Categories (4, object): ['Human_Resources', 'Manager', 'Research_Director', 'Research_Scientist']

In [72]:
clu_sample = attrition_subset.groupby('JobRole').sample(n=10, random_state=2022)

  clu_sample = attrition_subset.groupby('JobRole').sample(n=10, random_state=2022)


In [74]:
clu_sample

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1348,44,1.0,Travel_Rarely,1376,Human_Resources,1,College,Medical,Medium,Male,...,Excellent,Very_High,1,24,1,Better,20,6,3,6
886,41,0.0,Non-Travel,552,Human_Resources,4,Bachelor,Human_Resources,High,Male,...,Excellent,Medium,1,10,4,Better,3,2,1,2
983,39,0.0,Travel_Rarely,141,Human_Resources,3,Bachelor,Human_Resources,High,Female,...,Excellent,High,1,12,3,Bad,8,3,3,6
88,27,1.0,Travel_Frequently,1337,Human_Resources,22,Bachelor,Human_Resources,Low,Female,...,Excellent,Low,0,1,2,Better,1,0,0,0
189,34,0.0,Travel_Rarely,829,Human_Resources,3,College,Human_Resources,High,Male,...,Excellent,High,1,4,1,Bad,3,2,0,2
160,24,0.0,Travel_Frequently,897,Human_Resources,10,Bachelor,Medical,Low,Male,...,Excellent,Very_High,1,3,2,Better,2,2,2,1
839,46,0.0,Travel_Rarely,991,Human_Resources,1,College,Life_Sciences,Very_High,Female,...,Excellent,High,0,10,3,Best,7,6,5,7
966,30,0.0,Travel_Rarely,1240,Human_Resources,9,Bachelor,Human_Resources,High,Male,...,Excellent,Very_High,0,12,2,Bad,11,9,4,7
162,28,0.0,Non-Travel,280,Human_Resources,1,College,Life_Sciences,High,Male,...,Excellent,Medium,1,3,2,Better,3,2,2,2
1231,37,0.0,Travel_Rarely,1239,Human_Resources,8,College,Other,High,Male,...,Excellent,High,0,19,4,Good,10,0,4,7
