# Data Visualization for Data Engineering

In this Jupyter Notebook, we will explore data visualization techniques using Pandas and Matplotlib on the "Adult Income" dataset from the UCI Machine Learning Repository.


First, make sure to install the required libraries if you haven't already:

In [None]:
%pip install pandas matplotlib


## 1. Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


## 2. Loading the Dataset
We will use the "Adult Income" dataset from the UCI Machine Learning Repository.

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

df = pd.read_csv(url, header=None, names=column_names, na_values=' ?', skipinitialspace=True)
df.head()


## 3. Visualizing Numeric Columns
### Histograms
Histograms are a great way to visualize the distribution of numeric columns.

In [None]:
df['age'].hist(bins=20)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()


### Box Plots
Box plots are useful for visualizing the distribution and detecting outliers in numeric columns.

In [None]:
df.boxplot(column='age', by='income')
plt.xlabel('Income')
plt.ylabel('Age')
plt.title('Age Distribution by Income')
plt.suptitle('')  # Suppress the automatically generated title
plt.show()


## 4. Visualizing Categorical Columns
### Bar Plots
Bar plots are ideal for visualizing the distribution of categorical columns.

In [None]:
workclass_counts = df['workclass'].value_counts()
workclass_counts.plot(kind='bar')
plt.xlabel('Workclass')
plt.ylabel('Frequency')
plt.title('Workclass Distribution')
plt.show()


### Pie Charts
Pie charts are another way to visualize the distribution of categorical columns.

In [None]:
education_counts = df['education'].value_counts()
education_counts.plot(kind='pie', autopct='%.1f%%', figsize=(10, 10))
plt.ylabel('')  # Remove the automatically generated ylabel
plt.title('Education Distribution')
plt.show()


## 5. Visualizing Relationships Between Columns
### Scatter Plots
Scatter plots are useful for visualizing the relationship between two numeric columns.

In [None]:
df.plot(kind='scatter', x='age', y='hours-per-week', alpha=0.1)
plt.xlabel('Age')
plt.ylabel('Hours per Week')
plt.title('Scatter Plot of Age vs. Hours per Week')
plt.show()


### Grouped Bar Plots
Grouped bar plots are ideal for visualizing the distribution of a categorical column grouped by another categorical column.

In [None]:
grouped_counts = df.groupby(['income', 'sex']).size().unstack()
grouped_counts.plot(kind='bar')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Income Distribution by Sex')
plt.show()


These visualizations are just a starting point for exploring your dataset. You can customize the appearance and create more complex visualizations by further exploring the Pandas and Matplotlib libraries. Additionally, you may want to explore other visualization libraries such as Seaborn or Plotly for more advanced or interactive visualizations.