# Advanced data visualization

In this Jupyter Notebook, we will explore more advanced data visualization techniques using Seaborn on the "Adult Income" dataset from the UCI Machine Learning Repository. 

First, make sure to install the required libraries if you haven't already:

In [None]:
%pip install pandas seaborn


## 1. Importing Libraries

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


## 2. Loading the Dataset
We will use the "Adult Income" dataset from the UCI Machine Learning Repository.

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

df = pd.read_csv(url, header=None, names=column_names, na_values=' ?', skipinitialspace=True)
df.head()


## 3. Visualizing Numeric Columns
### Histograms with Kernel Density Estimation

Create a histogram, use the histplot function from seaborn. 

In [None]:
# create histplot

# set x label
plt.xlabel('XXX')
# set y label
plt.ylabel('XXX')
# set title
plt.title('XXX')
plt.show()


### Box Plots with Swarmplot

Use a combination of boxplot and swarmplot from `seaborn`.

In [None]:
#boxplot
#swarmplot

# set x label
plt.xlabel('XXX')
# set y label
plt.ylabel('XXX')
# set title, example:
plt.title('Age Distribution by Income')
plt.show()


## 4. Visualizing Categorical Columns
### Count Plots

Use countplot from `seaborn`.

In [None]:
#countplot
# set x label
plt.xlabel('XXX')
# set y label
plt.ylabel('XXX')
# set title label, example:
plt.title('Workclass Distribution')
plt.show()


## 5. Visualizing Relationships Between Columns
### Scatter Plots with Regression Line

Use regplot from `seaborn`.

In [None]:
#regplot
# set x label
plt.xlabel('XXX')
# set y label
plt.ylabel('XXX')
# set title label, example:
plt.title('Scatter Plot of Age vs. Hours per Week')
plt.show()


### Grouped Bar Plots with Point Plot

In [None]:
sns.catplot(data=df, x='income', y='hours-per-week', hue='sex', kind='bar', ci=None)
sns.pointplot(data=df, x='income', y='hours-per-week', hue='sex', join=False, ci=None, markers=['x', 'o'], palette="dark")
plt.xlabel('Income')
plt.ylabel('Hours per Week')
plt.title('Hours per Week by Income and Sex')
plt.show()


### Heatmaps

In [None]:
# Create a contingency table of education and income
contingency_table = pd.crosstab(df['education'], df['income'])

# Calculate the percentage of each income group within each education level
percentage_table = contingency_table.div(contingency_table.sum(axis=1), axis=0)

sns.heatmap(percentage_table, cmap="YlGnBu", annot=True, cbar_kws={'label': 'Percentage'})
plt.xlabel('Income')
plt.ylabel('Education')
plt.title('Income Distribution by Education Level')
plt.show()




### Pair Plot


In [None]:

# Select a subset of numeric columns to analyze
numeric_columns = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
data_subset = df[numeric_columns]

# Add the 'income' column to use as a color code
data_subset['income'] = df['income']

sns.pairplot(data_subset.sample(1000), hue='income', diag_kind='hist', plot_kws={'alpha': 0.6, 's': 20, 'edgecolor': 'k'}, palette="husl") # Sample 1000 data points for performance
plt.suptitle('Pair Plot of Selected Numeric Columns', y=1.02) # Add a title and adjust the vertical space
plt.show()



These visualizations provide a deeper understanding of the dataset. You can further customize the appearance and create more complex visualizations by exploring the Seaborn library. Additionally, you may want to explore other visualization libraries such as Plotly for interactive visualizations.




