# Aggregations

### Step 1. Import the necessary libraries

In [15]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset occupation.csv from the folder or [here](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [99]:
users = pd.read_csv("C:\Analytix\Assignments\Dataset\occupation.csv", sep="|")
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


### Step 3. Assign it to a variable called users.

### Step 4. Discover what is the mean age per occupation

In [92]:
userArr = users.groupby('occupation').mean()
userArr.reset_index(inplace=True)
print("The mean age per occupation of:")
for index,row in userArr.iterrows():
    print("{}: {}".format(row['occupation'], round(row['age'],2)))

The mean age per occupation of:
administrator: 38.75
artist: 31.39
doctor: 43.57
educator: 42.01
engineer: 36.39
entertainment: 29.22
executive: 38.72
healthcare: 41.56
homemaker: 32.57
lawyer: 36.75
librarian: 40.0
marketing: 37.62
none: 26.56
other: 34.52
programmer: 33.12
retired: 63.07
salesman: 35.67
scientist: 35.55
student: 22.08
technician: 33.15
writer: 36.31


### Step 5. Discover the Male ratio per occupation and sort it from the most to the least.

Use numpy.where() to encode gender column.

In [180]:
occ_arr = users
occ_arr['gender_bin'] = np.where(occ_arr.gender == 'M', 1,0)
occ_arr = occ_arr.groupby('occupation')
sortedArr = {}
for name,group in occ_arr:
    sortedArr.update({name : group['gender_bin'].sum()/group['gender_bin'].count()})
    
sortedArr = sorted(sortedArr.items(), key= lambda x: x[1]) 
print("Male ratio per occupation sorted from most to the least:")
for i in reversed(sortedArr):
    print("{}: {}".format(i[0],round(i[1],2)))

Male ratio per occupation sorted from most to the least:
doctor: 1.0
engineer: 0.97
technician: 0.96
retired: 0.93
programmer: 0.91
executive: 0.91
scientist: 0.9
entertainment: 0.89
lawyer: 0.83
salesman: 0.75
educator: 0.73
student: 0.69
other: 0.66
marketing: 0.62
writer: 0.58
none: 0.56
administrator: 0.54
artist: 0.54
librarian: 0.43
healthcare: 0.31
homemaker: 0.14


### Step 6. For each occupation, calculate the minimum and maximum ages

In [227]:
occ_arr = users
occ_arr['gender_bin'] = np.where(occ_arr.gender == 'M', 1,0)
new_arr = occ_arr.groupby('occupation')
new_arr = new_arr.agg([np.max,np.min])
new_arr = new_arr.reset_index()
new_arr = new_arr[['occupation','age']]
new_arr
print("Minimum and Maximum age of different occupation are:")
for index, row in new_arr.iterrows():
    print("{}:".format(row['occupation'][0]))
    print("   max: {}".format(row['age']['amax']))
    print("   min: {}".format(row['age']['amin']))

Minimum and Maximum age of different occupation are:
administrator:
   max: 70
   min: 21
artist:
   max: 48
   min: 19
doctor:
   max: 64
   min: 28
educator:
   max: 63
   min: 23
engineer:
   max: 70
   min: 22
entertainment:
   max: 50
   min: 15
executive:
   max: 69
   min: 22
healthcare:
   max: 62
   min: 22
homemaker:
   max: 50
   min: 20
lawyer:
   max: 53
   min: 21
librarian:
   max: 69
   min: 23
marketing:
   max: 55
   min: 24
none:
   max: 55
   min: 11
other:
   max: 64
   min: 13
programmer:
   max: 63
   min: 20
retired:
   max: 73
   min: 51
salesman:
   max: 66
   min: 18
scientist:
   max: 55
   min: 23
student:
   max: 42
   min: 7
technician:
   max: 55
   min: 21
writer:
   max: 60
   min: 18


### Step 7. For each combination of occupation and gender, calculate the mean age

In [249]:
occ_arr = users
occ_arr = occ_arr.groupby(['occupation', 'gender'])['age'].mean()
occ_arr1 = pd.DataFrame(occ_arr).reset_index()
occ_arr1
print("Mean age of different combinations of occupation and gender:")
for index, row in occ_arr1.iterrows():
    print("{} and {}".format(row['occupation'], row['gender']))
    print("   mean age: {}".format(round(row['age'],2)))

Mean age of different combinations of occupation and gender:
administrator and F
   mean age: 40.64
administrator and M
   mean age: 37.16
artist and F
   mean age: 30.31
artist and M
   mean age: 32.33
doctor and M
   mean age: 43.57
educator and F
   mean age: 39.12
educator and M
   mean age: 43.1
engineer and F
   mean age: 29.5
engineer and M
   mean age: 36.6
entertainment and F
   mean age: 31.0
entertainment and M
   mean age: 29.0
executive and F
   mean age: 44.0
executive and M
   mean age: 38.17
healthcare and F
   mean age: 39.82
healthcare and M
   mean age: 45.4
homemaker and F
   mean age: 34.17
homemaker and M
   mean age: 23.0
lawyer and F
   mean age: 39.5
lawyer and M
   mean age: 36.2
librarian and F
   mean age: 40.0
librarian and M
   mean age: 40.0
marketing and F
   mean age: 37.2
marketing and M
   mean age: 37.88
none and F
   mean age: 36.5
none and M
   mean age: 18.6
other and F
   mean age: 35.47
other and M
   mean age: 34.03
programmer and F
   mean age

### Step 8.  For each occupation present the percentage of women and men

In [276]:
occ_arr = users
occ_arr = occ_arr.groupby(['occupation'])
men = occ_arr['gender_bin'].sum()
women = occ_arr['gender_bin'].count() - occ_arr['gender_bin'].sum() 
total = occ_arr['gender_bin'].count()
data = pd.concat([men,women,total], axis=1).reset_index()
data.columns = ['occupation', 'men', 'women', 'total']
print("For each Occupation, percentage of women and men:")
for index, row in data.iterrows():
    print("{}: ".format(row['occupation']))
    print("   men: {}%".format(round((row['men']/row['total']) * 100,2)))
    print("   women: {}%".format(round((row['women']/row['total']) * 100,2)))

For each Occupation, percentage of women and men:
administrator: 
   men: 54.43%
   women: 45.57%
artist: 
   men: 53.57%
   women: 46.43%
doctor: 
   men: 100.0%
   women: 0.0%
educator: 
   men: 72.63%
   women: 27.37%
engineer: 
   men: 97.01%
   women: 2.99%
entertainment: 
   men: 88.89%
   women: 11.11%
executive: 
   men: 90.62%
   women: 9.38%
healthcare: 
   men: 31.25%
   women: 68.75%
homemaker: 
   men: 14.29%
   women: 85.71%
lawyer: 
   men: 83.33%
   women: 16.67%
librarian: 
   men: 43.14%
   women: 56.86%
marketing: 
   men: 61.54%
   women: 38.46%
none: 
   men: 55.56%
   women: 44.44%
other: 
   men: 65.71%
   women: 34.29%
programmer: 
   men: 90.91%
   women: 9.09%
retired: 
   men: 92.86%
   women: 7.14%
salesman: 
   men: 75.0%
   women: 25.0%
scientist: 
   men: 90.32%
   women: 9.68%
student: 
   men: 69.39%
   women: 30.61%
technician: 
   men: 96.3%
   women: 3.7%
writer: 
   men: 57.78%
   women: 42.22%
