<a href="https://colab.research.google.com/github/chrnthnkmutt/CPE393_TBA_MLOps/blob/main/Adult_Census_Income_Analyze_and_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Adult Census Income Analyze and Visualization

As can be seen in the description of the dataset, this data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). The purpose of creating this dataset is to predict whether a person's income will be greater or less than 50K, with features such as age, education, and job.
But in this notebook, before building a model, I analyzed the data and looked at some of its properties and made some visualizations. I hope you will like it.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# EDA

In [None]:
data = pd.read_csv("/kaggle/input/adult-census-income/adult.csv")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
data.shape

In [None]:
data.isna().sum()

> After looking for NaN values in the data, I was happy to see that there were no NaN values at all. Until I saw the question marks("?") in the data :(

In [None]:
for column in data.columns:
    print(f"{column} = {data[data[column] == '?'].shape[0]}")

> But I did not get discouraged, I looked at how many of these question marks were in which features and I saw that there were 3 object columns. Then I filled them all with their own column's mod as seen below. I could also use the replace() function here, or I could assign a NaN value instead of the question mark and continue with the fillna() function.

In [None]:
data["workclass"][data["workclass"] == "?"] = data["workclass"].mode()[0]
data["occupation"][data["occupation"] == "?"] = data["occupation"].mode()[0]
data["native.country"][data["native.country"] == "?"] = data["native.country"].mode()[0]


> When we check again, we can see that the question marks are gone :)

In [None]:
for column in data.columns:
    print(f"{column} = {data[data[column] == '?'].shape[0]}")

# Outliers


Datasets with outliers affect the quality of your inferences. For this reason, before starting any analysis, you should definitely determine if there is in your data set and take the necessary precautions. In here, I wanted to look outliers by making a visualization.

In [None]:
int_columns = ['age','fnlwgt','education.num','capital.gain','capital.loss','hours.per.week']

In [None]:
for i in int_columns:
  sns.boxplot(x = data[i])
  plt.show()


> In this part, I set a quantile to get rid of some of the outlier values and filtered accordingly.

In [None]:
q_low = data["fnlwgt"].quantile(0.01)
q_hi  = data["fnlwgt"].quantile(0.99)
data = data[(data["fnlwgt"] < q_hi) & (data["fnlwgt"] > q_low)]


> In the other columns, I got rid of the outliers not with quantile, but according to the filters I decided to apply according to the graphics.

In [None]:
data = data[(data['education.num'] <= 16) | (data['education.num'] >= 4)]
data = data[data['capital.gain'] <= 60000]
data = data[data['capital.loss'] <= 3000]
data = data[(data['hours.per.week'] <= 80) | (data['hours.per.week'] >= 20)]

# Visualizations

First, I looked at the average age by country of those who received more than 50K and I graphed it. For this, I first created a temporary data and used the groupby() function over that data.

In [None]:
temp = data[data["income"] == '>50K']

country_vs_age = data[["native.country","income","age"]].groupby(["native.country","income"]).mean()

In [None]:
country_vs_age = country_vs_age.reset_index()

In [None]:
plt.figure(figsize = (20,20))
sns.barplot(x = "age", y = "native.country", data = country_vs_age, palette = "viridis")
plt.xlabel("Mean Age")
plt.ylabel("Country")
plt.title("Mean Age with >50K Income by Country")
plt.show()

> Let's look at the education level of those with an income over 50K this time. I could use countplot if I didn't wanted to use the groupby() with count().

In [None]:
education_data = temp.groupby("education")["income"].count()
education_data = education_data.reset_index()

In [None]:
plt.figure(figsize = (25,15))
sns.barplot(x = "education", y ="income", data = education_data, palette = "viridis")
plt.xlabel("Education Level")
plt.ylabel(">50K Income Count")
plt.title(">50K Count vs Education Level")
plt.show()

> Now we can take a look at the workclasses and age by income that I visualized by using hue instead of groupby function.

In [None]:
plt.figure(figsize= (13,13))
sns.barplot(x="workclass",y="age", hue="income", data=data, palette = "viridis")
plt.xlabel("Workclass")
plt.ylabel("Age")
plt.title("Workclass vs Age by Income")
plt.show()

> Finally, we can see the male and female ratios of those with an income above and below 50K from a pie chart.

In [None]:
over_50_data = temp[["sex","income"]].groupby(["sex"]).count()
over_50_data = over_50_data.reset_index()

In [None]:
temp2 = data[data["income"] == '<=50K']
less_50_data = temp2[["sex","income"]].groupby(["sex"]).count()
less_50_data = less_50_data.reset_index()

In [None]:
plt.figure(figsize = (10,10))
plt.subplot(1,2,1)
plt.pie(x = over_50_data["income"], labels = ["Female","Male"],colors = ["palevioletred","paleturquoise"])
plt.title(">50K")

plt.subplot(1,2,2)
plt.pie(x = less_50_data["income"], labels = ["Female","Male"], colors = ["palevioletred","paleturquoise"])
plt.title("<=50K")
plt.show()

> I want to create a model with this data as soon as possible. You can find it on my profile in the future. Thanks :)