# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### 1. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user) directly and assign it to a variable called `users`

In [1]:
import pandas as pd
file = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_table(file, sep = "|")
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


### 2. Convert occupation to TitleCase

In [2]:
users['occupation'] = users.occupation.str.title()
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB


### 3. For each combination of occupation and gender, calculate the mean age (in tall table format)

In [3]:
combi = users.groupby(['occupation', 'gender']).age.mean()
display(combi)

occupation     gender
Administrator  F         40.638889
               M         37.162791
Artist         F         30.307692
               M         32.333333
Doctor         M         43.571429
Educator       F         39.115385
               M         43.101449
Engineer       F         29.500000
               M         36.600000
Entertainment  F         31.000000
               M         29.000000
Executive      F         44.000000
               M         38.172414
Healthcare     F         39.818182
               M         45.400000
Homemaker      F         34.166667
               M         23.000000
Lawyer         F         39.500000
               M         36.200000
Librarian      F         40.000000
               M         40.000000
Marketing      F         37.200000
               M         37.875000
None           F         36.500000
               M         18.600000
Other          F         35.472222
               M         34.028986
Programmer     F         32.16666

### 4. Construct a pivot table showing average `age` per `occupation` and `gender` group

In [4]:
pivot = users.pivot_table(values = 'age', index = 'occupation', columns = 'gender')
pivot.head()

gender,F,M
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Administrator,40.638889,37.162791
Artist,30.307692,32.333333
Doctor,,43.571429
Educator,39.115385,43.101449
Engineer,29.5,36.6


### 5. Determine the mean age per occupation

In [5]:
age_per_occ = users.groupby('occupation').age.mean()
age_per_occ

occupation
Administrator    38.746835
Artist           31.392857
Doctor           43.571429
Educator         42.010526
Engineer         36.388060
Entertainment    29.222222
Executive        38.718750
Healthcare       41.562500
Homemaker        32.571429
Lawyer           36.750000
Librarian        40.000000
Marketing        37.615385
None             26.555556
Other            34.523810
Programmer       33.121212
Retired          63.071429
Salesman         35.666667
Scientist        35.548387
Student          22.081633
Technician       33.148148
Writer           36.311111
Name: age, dtype: float64

### 6. Determine the Male ratio per occupation and sort it from the most to the least

In [6]:
male = users[users['gender'] == 'M']
male_rat = male['occupation'].value_counts(normalize = True)
print(male_rat)

Student          0.202985
Other            0.102985
Educator         0.102985
Engineer         0.097015
Programmer       0.089552
Administrator    0.064179
Executive        0.043284
Scientist        0.041791
Technician       0.038806
Writer           0.038806
Librarian        0.032836
Entertainment    0.023881
Marketing        0.023881
Artist           0.022388
Retired          0.019403
Lawyer           0.014925
Salesman         0.013433
Doctor           0.010448
None             0.007463
Healthcare       0.007463
Homemaker        0.001493
Name: occupation, dtype: float64


### 7. Construct a table showing the minimum and maximum ages (columns) per occupation (rows)

In [7]:
import numpy as np
stats = users.groupby('occupation').agg([np.min, np.max])['age']
stats

Unnamed: 0_level_0,amin,amax
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Administrator,21,70
Artist,19,48
Doctor,28,64
Educator,23,63
Engineer,22,70
Entertainment,15,50
Executive,22,69
Healthcare,22,62
Homemaker,20,50
Lawyer,21,53


### 8.  For each occupation present the percentage of women and men

In [8]:
gender_ratio = users.groupby(['occupation'])['gender'].value_counts(normalize = True) * 100
display(gender_ratio)

occupation     gender
Administrator  M          54.430380
               F          45.569620
Artist         M          53.571429
               F          46.428571
Doctor         M         100.000000
Educator       M          72.631579
               F          27.368421
Engineer       M          97.014925
               F           2.985075
Entertainment  M          88.888889
               F          11.111111
Executive      M          90.625000
               F           9.375000
Healthcare     F          68.750000
               M          31.250000
Homemaker      F          85.714286
               M          14.285714
Lawyer         M          83.333333
               F          16.666667
Librarian      F          56.862745
               M          43.137255
Marketing      M          61.538462
               F          38.461538
None           M          55.555556
               F          44.444444
Other          M          65.714286
               F          34.285714
Progra

### 9. Zero-pad (zero-fill) the zip codes to get a numeric string of 6-characters

In [9]:
users['zip_code'] = users.zip_code.astype(str).str.zfill(6)
users.zip_code

0      085711
1      094043
2      032067
3      043537
4      015213
        ...  
938    033319
939    002215
940    097229
941    078209
942    077841
Name: zip_code, Length: 943, dtype: object

### 10. Get the oldest user for each occupation and show his/her zip code

In [10]:
index_max = np.array(users.groupby('occupation')['age'].idxmax())
zipcode = users['zip_code'].iloc[index_max]
zipcode

802    078212
122    020008
844    097405
857    009645
766    000000
914    060614
558    010022
519    012603
721    017331
9      090703
584    098501
90     001913
417    021206
422    091606
776    001810
480    037771
210    032605
615    050613
187    029440
196    075094
463    094583
Name: zip_code, dtype: object

### 11. Construct a feature showing the number of users per occupation (do not aggregate!)

In [11]:
feature = users['occupation'].value_counts()
feature

Student          196
Other            105
Educator          95
Administrator     79
Engineer          67
Programmer        66
Librarian         51
Writer            45
Executive         32
Scientist         31
Artist            28
Technician        27
Marketing         26
Entertainment     18
Healthcare        16
Retired           14
Lawyer            12
Salesman          12
None               9
Homemaker          7
Doctor             7
Name: occupation, dtype: int64