# Report 1

This report should provide insights related to the employee emotional state and its evolution per industry they work with.

In [60]:
import numpy as np
import modin.pandas as pd
import gender_guesser.detector as gender

from PsqlConn import create_psql_engine

In [2]:
engine = create_psql_engine()

____________________
## Population description
First let's take a look at our population and try to understand what's the distribution of users among companies, industries and ages.

In [8]:
# Query to fetch a simple user table with company and industry names
pop_q = f'''
-- Get users and their company and industry
select distinct u.id as userid,
                u.name as username,
                u.age as userage,
                cm.name as companyname,
                i.name as industryname
from users u 
left join companies cm
on(cast(u.companyid as text) = cm.id )
left join industries i 
on(i.id = cast(u.industryid as text) )
;
'''

pop_df = pd.read_sql_query(pop_q, engine)
pop_df.head()



Unnamed: 0,userid,username,userage,companyname,industryname
0,517a8deb-14d2-4662-8ca5-a2515bf06326,Aurelia,21,Apple,Marketing
1,b3152a96-6055-44f5-87fe-4d1763129bb9,Alisa,25,Pied Piper,Marketing
2,5c185183-ee83-426b-b535-8d9ba776c731,Chester,36,Twitter,Marketing
3,4a45d028-906e-4e42-b768-7071b48bc6e4,Zechariah,36,Pied Piper,Finance
4,d94ca93a-66f3-4972-a07d-c774109ba7d3,Ramona,29,Twitter,Sales


Straight away, one thing we can do to augment our data is to try to guess what is the gender of the user based on their name. We can infer that using a package such as https://pypi.org/project/gender-guesser/.

In [24]:
# Create a detector object
genderDetector = gender.Detector()

# We guess the the genders and then check what we got
pop_df['genderguess'] = pop_df.username.apply(lambda x: genderDetector.get_gender(x))

print(f"Out of {pop_df['userid'].nunique()} users, this was the proportion of genders guessed:")
display(pop_df['genderguess'].value_counts())

Out of 2000 users, this was the proportion of genders guessed:


female           883
male             831
unknown          147
mostly_male       62
mostly_female     51
andy              26
dtype: int64

With this we were able to guess around 86% as definitively either male of female. Some remarks:

* The `mostly_male` and `mostly_female` labels represent 3.1% and 2.6% of the population so we are going to join them with the `male` and `female` labels for simplicity.
* The `andy` label represents names that can be either male or female. This label represents 1.3% of our population. Since this is such a low value we are going to keep it as it's own thing instead of drilling down further.
* The `unknow` label represents names that the detector was unable to guess as either gender. They are 7.35% of our population, so it's also a small subset. The total number of distinct names is 116. Since the gender is just a guess, there is no point in trying to fit the these into either of the male or female labels.

In [43]:
# Replace the `mostly_xxx`labels with the definitive ones
gender_simpl_dict = {"mostly_male":'male',
                    "mostly_female": 'female',
                    }
pop_df["genderguess"]=pop_df["genderguess"].replace(gender_simpl_dict)

# Save categories as factor to save memory
pop_df["genderguess"]=pop_df["genderguess"].astype('category')

In [11]:
# Lets check how many users we have per company, industry and company + industry
print("Number of unique users per company")
display(pop_df.groupby("companyname")["userid"].nunique())

print("\nNumber of unique users per industry")
display(pop_df.groupby("industryname")["userid"].nunique())

Number of unique users per company


companyname
Apple         682
Facebook      333
Pied Piper    320
Twitter       665
Name: userid, dtype: int64


Number of unique users per industry


industryname
Finance      313
Marketing    676
Sales        371
Tech         640
Name: userid, dtype: int64

In [84]:
def male_count(series:pd.Series):
    return sum(series == 'male')

def female_count(series:pd.Series):
    return sum(series == 'female')

def male_female_ratio(series:pd.Series):
    return sum(series == 'male')/sum(series == 'female')

def unknown_count(series:pd.Series):
    return sum(series == 'unknown')

In [85]:
# Let's get a distribution of the user age and gender per company + industry pair
print(f'''Overview of user's features per gender and company:''')
display(
    pop_df.groupby(["companyname", "industryname"]).agg({'userid':'nunique',
                                                     'userage': [min, np.median, 'max', 'mean'], 
                                                     'genderguess': [male_count, female_count, male_female_ratio, unknown_count]
                                                    })
)

Overview of user's features per gender and company:


Unnamed: 0_level_0,Unnamed: 1_level_0,userid,userage,userage,userage,userage,genderguess,genderguess,genderguess,genderguess
Unnamed: 0_level_1,Unnamed: 1_level_1,nunique,min,median,max,mean,male_count,female_count,male_female_ratio,unknown_count
companyname,industryname,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Apple,Finance,96,18,27.0,40,27.208333,43,47,0.914894,5
Apple,Marketing,242,18,25.5,41,26.475207,110,112,0.982143,15
Apple,Sales,131,18,25.0,39,25.900763,57,65,0.876923,6
Apple,Tech,213,18,25.0,41,26.309859,94,102,0.921569,16
Facebook,Finance,53,18,27.0,41,26.981132,24,23,1.043478,4
Facebook,Marketing,116,18,27.0,41,27.12931,60,44,1.363636,11
Facebook,Sales,47,18,29.0,41,28.851064,15,27,0.555556,5
Facebook,Tech,117,18,27.0,41,27.350427,48,57,0.842105,9
Pied Piper,Finance,47,18,24.0,40,26.191489,24,21,1.142857,1
Pied Piper,Marketing,110,18,25.0,40,26.054545,40,61,0.655738,7


In the above table we can see that the groups we get by splitting the users per company and industry are very similar when it comes to age and gender. 

* The min age registered for each group is 18 years for all groups.
* The max age is either 40 or 41, with one exception (39)
* The median age is always between 24 and 29 years old, while the mean is within the interval (26; 29) years old.
* When it comes to `guessedgender` the ratio between male and female is also very similar among the different groups, and usually close to 1.