<a href="https://colab.research.google.com/github/YuliiaHudz/Python-Case-Studies/blob/main/Working_with_NumPy_and_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This script:  

✅ Mounts Google Drive and navigates to the specified directory to load the dataset.  

✅ Reads the dataset (`adult.csv`) and removes rows containing `"?"` in any column to clean the data.  

✅ Converts salary categories (`<=50K` and `>50K`) into numerical values for easier analysis.  

✅ Adjusts the salary column by multiplying the values in `"salary K$"` by 1000.  

✅ Analyzes gender distribution in the dataset.  

✅ Computes the average age of men, rounding the result.  

✅ Determines the percentage of people from Poland.  

✅ Finds the number of individuals without higher education who earn more than 50K.  

✅ Generates age statistics for each education level using `groupby` and `describe()`.  

✅ Compares average salaries of married and unmarried men, identifying the group with higher earnings.  

✅ Identifies the maximum number of hours worked per week and counts how many people work that many hours.  

This task enhances skills in **NumPy** and **Pandas**, focusing on data cleaning, filtering, grouping, and statistical analysis.

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive

drive.mount("/content/drive")

%cd /content/drive/MyDrive/Mate_Homework


# Load the dataset
adult_data = pd.read_csv("adult.csv")

Mounted at /content/drive
/content/drive/MyDrive/Mate_Homework


In [2]:
# Opt-in to future behavior for downcastingv
pd.set_option("future.no_silent_downcasting", True)

In [3]:
# Filter rows where no column contains the "?" symbol and create a copy
adult_data_cleaned = adult_data[~adult_data.isin(["?"]).any(axis=1)].copy()

In [4]:
# Replace salary values with numerical values for better handling
adult_data_cleaned["salary_numeric"] = adult_data_cleaned["salary"].replace({"<=50K": 50000, ">50K": 60000})

In [5]:
# Multiply the 'salary K$' column by 1000
adult_data_cleaned["salary K$"] = adult_data_cleaned["salary K$"] * 1000

In [6]:
# Gender distribution
gender_counts = adult_data_cleaned["sex"].value_counts()
print("Gender distribution:", gender_counts)

Gender distribution: sex
Male      20380
Female     9782
Name: count, dtype: int64


In [7]:
# Calculate the average age of men (rounded)
average_age_men = adult_data_cleaned[adult_data_cleaned["sex"] == "Male"]["age"].mean()
print(f"The average age of men is: {round(average_age_men)}")

The average age of men is: 39


In [8]:
# Percentage of people from Poland
people_from_poland = adult_data_cleaned[adult_data_cleaned["native-country"] == "Poland"].shape[0]
percentage_poland = (people_from_poland / adult_data_cleaned.shape[0]) * 100
print(f"The percentage of people from Poland is: {percentage_poland:.2f}%")

The percentage of people from Poland is: 0.19%


In [9]:
# Number of people without higher education but with salary > 50K
higher_education_levels = ["Bachelors", "Prof-school", "Assoc-acdm", "Assoc-voc", "Masters", "Doctorate"]
people_no_higher_education = adult_data_cleaned[~adult_data_cleaned["education"].isin(higher_education_levels) & (adult_data_cleaned["salary"] == ">50K")]
print(f"Number of people without higher education but with salary > 50K: {people_no_higher_education.shape[0]}")

Number of people without higher education but with salary > 50K: 3178


In [10]:
# Age statistics by education
age_stats_by_education = adult_data_cleaned.groupby("education")["age"].describe()
print("Age statistics by education:", age_stats_by_education)

Age statistics by education:                count       mean        std   min   25%   50%   75%   max
education                                                               
10th           820.0  37.897561  16.225795  17.0  23.0  36.0  52.0  90.0
11th          1048.0  32.363550  15.089307  17.0  18.0  28.5  43.0  90.0
12th           377.0  32.013263  14.373710  17.0  19.0  28.0  41.0  79.0
1st-4th        151.0  44.622517  14.929051  19.0  33.0  44.0  56.0  81.0
5th-6th        288.0  41.649306  14.754622  17.0  28.0  41.0  53.0  82.0
7th-8th        557.0  47.631957  15.737479  17.0  34.0  49.0  60.0  90.0
9th            455.0  40.303297  15.335754  17.0  28.0  38.0  53.0  90.0
Assoc-acdm    1008.0  37.286706  10.509755  19.0  29.0  36.0  44.0  90.0
Assoc-voc     1307.0  38.246366  11.181253  19.0  30.0  37.0  45.0  84.0
Bachelors     5044.0  38.641554  11.577566  19.0  29.0  37.0  46.0  90.0
Doctorate      375.0  47.130667  11.471727  24.0  39.0  47.0  54.0  80.0
HS-grad       9840.0  

In [11]:
# Calculate the average salary for married and unmarried men
men_data = adult_data_cleaned[adult_data_cleaned["sex"] == "Male"].copy()
men_data["is_married"] = men_data["marital-status"].str.startswith("Married")
average_salary_by_marital_status = men_data.groupby("is_married")["salary_numeric"].mean()
print("Average salary by marital status:", average_salary_by_marital_status)

Average salary by marital status: is_married
False    50884.944116
True     54479.843444
Name: salary_numeric, dtype: object


In [12]:
# Determine the group with the higher average salary
salary_comparison = average_salary_by_marital_status.idxmax()  # Get the group with the higher average salary
print(f"The group with the higher average salary is: {'Married' if salary_comparison else 'Unmarried'} men.")

The group with the higher average salary is: Married men.


In [13]:
# Maximum number of hours worked per week
max_hours = adult_data_cleaned["hours-per-week"].max()
count_max_hours = adult_data_cleaned[adult_data_cleaned["hours-per-week"] == max_hours].shape[0]
print(f"The maximum number of hours worked per week is: {max_hours}")
print(f"Number of people working {max_hours} hours per week: {count_max_hours}")

The maximum number of hours worked per week is: 99
Number of people working 99 hours per week: 78
