In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max.columns", 100)
# to draw pictures in jupyter notebook
%matplotlib inline
# we don't like warnings
# you can comment the following 2 lines if you'd like to
import warnings

import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings("ignore")

In [2]:
# for Jupyter-book, we copy data from GitHub, locally, to save Internet traffic,
# you can specify the data/ folder from the root of your cloned
# https://github.com/Yorko/mlcourse.ai repo, to save Internet traffic
DATA_URL = "https://raw.githubusercontent.com/Yorko/mlcourse.ai/main/data/"

In [3]:
df = pd.read_csv(DATA_URL + "adult.data.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### 1. How many men and women (_sex_ feature) are represented in this dataset?

In [4]:
men_df = df[df["sex"]=="Male"]
women_df = df[df["sex"]=="Female"]
print(f'There are {len(men_df)} man presenting in this dataset.')
print(f'There are {len(women_df)} woman presenting in this dataset.')

There are 21790 man presenting in this dataset.
There are 10771 woman presenting in this dataset.


### 2. What is the average age (_age_ feature) of women?

In [5]:
mean_woman_age = women_df['age'].mean().round(decimals = 1)
print(f'The average age of a woman in this dataset is {mean_woman_age} years.')

The average age of a woman in this dataset is 36.9 years.


### 3. What is the percentage of German citizens (_native-country_ feature)?

In [6]:
df_native_country = (df['native-country'].value_counts(normalize=True)
                .mul(100)
                .rename_axis('native-country')
                .reset_index(name='percentage'))

germany_df = df_native_country[df_native_country['native-country'] == 'Germany']

display(germany_df)

Unnamed: 0,native-country,percentage
4,Germany,0.420749


### 4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (_salary_ feature) and those who earn less than 50K per year?

In [7]:
columns_to_show = ["age", "salary"]
df.groupby((df['salary'] == "<=50K"))[columns_to_show].agg([np.mean, np.std])

Unnamed: 0_level_0,age,age
Unnamed: 0_level_1,mean,std
salary,Unnamed: 1_level_2,Unnamed: 2_level_2
False,44.249841,10.519028
True,36.783738,14.020088


### 6. Is it true that people who earn more than 50K have at least high school education? (education – `Bachelors`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `Masters` or `Doctorate` feature)

In [8]:
filter_by = ["Bachelors","Prof-school","Assoc-acdm","Assoc-voc","Masters","Doctorate"]
df_high_salary_no_education = df.loc[(~df['education'].isin(filter_by)) & (df['salary'] == '>50K')]

display(df_high_salary_no_education)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
27,54,?,180211,Some-college,10,Married-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0,0,60,South,>50K
38,31,Private,84154,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,38,?,>50K
55,43,Private,237993,Some-college,10,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32510,39,Private,107302,HS-grad,9,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,45,?,>50K
32518,57,Local-gov,110417,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,99999,0,40,United-States,>50K
32519,46,Private,364548,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,48,United-States,>50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K


It is not true. As seen from the table above 3306 people earn more than 50K without high school education.

### 7. Display age statistics for each race (_race_ feature) and each gender (_sex_ feature). Use groupby() and describe(). Find the maximum age of men of `Amer-Indian-Eskimo` race.

In [9]:
df.groupby(['race'])["age"].describe(percentiles=[])

Unnamed: 0_level_0,count,mean,std,min,50%,max
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Amer-Indian-Eskimo,311.0,37.173633,12.44713,17.0,35.0,82.0
Asian-Pac-Islander,1039.0,37.746872,12.825133,17.0,36.0,90.0
Black,3124.0,37.767926,12.75929,17.0,36.0,90.0
Other,271.0,33.457565,11.538865,17.0,31.0,77.0
White,27816.0,38.769881,13.782306,17.0,37.0,90.0


In [10]:
df.groupby(['sex'])["age"].describe(percentiles=[])

Unnamed: 0_level_0,count,mean,std,min,50%,max
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,10771.0,36.85823,14.013697,17.0,35.0,90.0
Male,21790.0,39.433547,13.37063,17.0,38.0,90.0


### 8. Among whom is the proportion of those who earn a lot (`>50K`) greater: married or single men (_marital-status_ feature)? Consider as married those who have a _marital-status_ starting with Married (`Married-civ-spouse`, `Married-spouse-absent` or `Married-AF-spouse`), the rest are considered bachelors.

In [11]:
filter_by = ["Married-civ-spouse","Married-spouse-absent","Married-AF-spouse"]
num_married_high_salary = len(df.loc[(df['marital-status'].isin(filter_by)) & (df['salary'] == '>50K')])
num_married_low_salary = len(df.loc[(~df['marital-status'].isin(filter_by)) & (df['salary'] == '>50K')])

print(f'Number of married men who earns >50K is {num_married_high_salary}. Number of single men who earns a <=50K is {num_married_low_salary}.')

Number of married men who earns >50K is 6736. Number of single men who earns a <=50K is 1105.


### 9. What is the maximum number of hours a person works per week (_hours-per-week_ feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (`>50K`) among them?

In [12]:
print(f'Maximum number of hours a person works per week is {df["hours-per-week"].max()}.')

Maximum number of hours a person works per week is 99.


In [13]:
num_of_people = (df["hours-per-week"] == 99).sum()
print(f'There are {num_of_people} persons who work 99 hours per week.')

There are 85 persons who work 99 hours per week.


### 10. Count the average time of work (_hours-per-week_) for those who earn a little and a lot (_salary_) for each country (_native-country_). What will these be for Japan?

In [14]:
df.query("salary == '>50K'").groupby("native-country")["hours-per-week"].mean()

native-country
?                     45.547945
Cambodia              40.000000
Canada                45.641026
China                 38.900000
Columbia              50.000000
Cuba                  42.440000
Dominican-Republic    47.000000
Ecuador               48.750000
El-Salvador           45.000000
England               44.533333
France                50.750000
Germany               44.977273
Greece                50.625000
Guatemala             36.666667
Haiti                 42.750000
Honduras              60.000000
Hong                  45.000000
Hungary               50.000000
India                 46.475000
Iran                  47.500000
Ireland               48.000000
Italy                 45.400000
Jamaica               41.100000
Japan                 47.958333
Laos                  40.000000
Mexico                46.575758
Nicaragua             37.500000
Peru                  40.000000
Philippines           43.032787
Poland                39.000000
Portugal              41.

In [15]:
df.query("salary == '<=50K'").groupby("native-country")["hours-per-week"].mean()

native-country
?                             40.164760
Cambodia                      41.416667
Canada                        37.914634
China                         37.381818
Columbia                      38.684211
Cuba                          37.985714
Dominican-Republic            42.338235
Ecuador                       38.041667
El-Salvador                   36.030928
England                       40.483333
France                        41.058824
Germany                       39.139785
Greece                        41.809524
Guatemala                     39.360656
Haiti                         36.325000
Holand-Netherlands            40.000000
Honduras                      34.333333
Hong                          39.142857
Hungary                       31.300000
India                         38.233333
Iran                          41.440000
Ireland                       40.947368
Italy                         39.625000
Jamaica                       38.239437
Japan                    

As seen from tables above, average time of work for people born in Japan for those who earn a lot (>50K) is 47.9. For those who earn a little (<=50K) it is 41.