<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Histogram**


Estimated time needed: **45** minutes


In this lab, you will focus on the visualization of data. The dataset will be provided through an RDBMS, and you will need to use SQL queries to extract the required data.


## Objectives


In this lab, you will perform the following:


- Visualize the distribution of data using histograms.

- Visualize relationships between features.

- Explore data composition and comparisons.


## Demo: Working with database


#### Download the database file.


In [None]:
!wget -O survey-data.sqlite https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/QR9YeprUYhOoLafzlLspAw/survey-results-public.sqlite

#### Install the required libraries and import them


In [None]:
!pip install pandas

In [None]:
!pip install matplotlib

In [None]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

#### Connect to the SQLite database


In [None]:
conn = sqlite3.connect('survey-data.sqlite')

## Demo: Basic SQL queries

**Demo 1: Count the number of rows in the table**


In [None]:
QUERY = "SELECT COUNT(*) FROM main"
df = pd.read_sql_query(QUERY, conn)
print(df)


**Demo 2: List all tables**


In [None]:
QUERY = """
SELECT name as Table_Name 
FROM sqlite_master 
WHERE type = 'table'
"""
pd.read_sql_query(QUERY, conn)


**Demo 3: Group data by age**


In [None]:
QUERY = """
SELECT Age, COUNT(*) as count 
FROM main 
GROUP BY Age 
ORDER BY Age
"""
df_age = pd.read_sql_query(QUERY, conn)
print(df_age)


## Hands-on Lab: Visualizing Data with Histograms


### 1. Visualizing the distribution of data (Histograms)


**1.1 Histogram of `CompTotal` (Total Compensation)**


Objective: Plot a histogram of `CompTotal` to visualize the distribution of respondents' total compensation.


In [None]:
## Write your code here

# 1.1 Histogram of CompTotal

# Query to get CompTotal data
QUERY = "SELECT CompTotal FROM main WHERE CompTotal IS NOT NULL"
df_comp = pd.read_sql_query(QUERY, conn)

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(df_comp['CompTotal'], bins=50, edgecolor='black', alpha=0.7, color='skyblue')
plt.title('Distribution of Total Compensation')
plt.xlabel('Total Compensation ($)')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Total respondents with compensation data: {len(df_comp)}")
print(f"Mean compensation: ${df_comp['CompTotal'].mean():,.2f}")
print(f"Median compensation: ${df_comp['CompTotal'].median():,.2f}")

**1.2 Histogram of YearsCodePro (Years of Professional Coding Experience)**


Objective: Plot a histogram of `YearsCodePro` to analyze the distribution of coding experience among respondents.


In [None]:
## Write your code here

# 1.2 Histogram of YearsCodePro

# Query to get YearsCodePro data
QUERY = "SELECT YearsCodePro FROM main WHERE YearsCodePro IS NOT NULL"
df_years = pd.read_sql_query(QUERY, conn)

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(df_years['YearsCodePro'], bins=30, edgecolor='black', alpha=0.7, color='green')
plt.title('Distribution of Professional Coding Experience')
plt.xlabel('Years of Professional Coding')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Total respondents: {len(df_years)}")
print(f"Mean experience: {df_years['YearsCodePro'].mean():.2f} years")
print(f"Median experience: {df_years['YearsCodePro'].median():.2f} years")

### 2. Visualizing Relationships in Data


**2.1 Histogram Comparison of `CompTotal` by `Age` Group**


Objective: Use histograms to compare the distribution of CompTotal across different Age groups.


In [None]:
## Write your code here

# 2.1 Histogram Comparison of CompTotal by Age Group

# Query to get data
QUERY = """
SELECT Age, CompTotal 
FROM main 
WHERE CompTotal IS NOT NULL AND Age IS NOT NULL
"""
df_age_comp = pd.read_sql_query(QUERY, conn)

# Get top 4 age groups
top_ages = df_age_comp['Age'].value_counts().head(4).index

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, age_group in enumerate(top_ages):
    data = df_age_comp[df_age_comp['Age'] == age_group]['CompTotal']
    axes[idx].hist(data, bins=30, edgecolor='black', alpha=0.7, color='coral')
    axes[idx].set_title(f'CompTotal Distribution: {age_group}')
    axes[idx].set_xlabel('Total Compensation ($)')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Compensation distribution by age group analyzed")

**2.2 Histogram of TimeSearching for Different Age Groups**


Objective: Use histograms to explore the distribution of `TimeSearching` (time spent searching for information) for respondents across different age groups.


In [None]:
## Write your code here

# 2.2 Histogram of TimeSearching for Different Age Groups

# Query to get data
QUERY = """
SELECT Age, TimeSearching 
FROM main 
WHERE TimeSearching IS NOT NULL AND Age IS NOT NULL
"""
df_time_search = pd.read_sql_query(QUERY, conn)

# Get top 3 age groups
top_ages = df_time_search['Age'].value_counts().head(3).index

# Create comparison histogram
plt.figure(figsize=(12, 6))

for age_group in top_ages:
    data = df_time_search[df_time_search['Age'] == age_group]['TimeSearching']
    plt.hist(data, bins=20, alpha=0.5, label=age_group, edgecolor='black')

plt.title('Time Spent Searching by Age Group')
plt.xlabel('Time Searching (hours)')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("TimeSearching distribution by age group visualized")

### 3. Visualizing the Composition of Data


**3.1 Histogram of Most Desired Databases (`DatabaseWantToWorkWith`)**


Objective: Visualize the most desired databases for future learning using a histogram of the top 5 databases.


In [None]:
## Write your code here

# 3.1 Histogram of Most Desired Databases

# Query to get top 5 databases
QUERY = """
SELECT DatabaseWantToWorkWith, COUNT(*) as count
FROM main
WHERE DatabaseWantToWorkWith IS NOT NULL
GROUP BY DatabaseWantToWorkWith
ORDER BY count DESC
LIMIT 5
"""
df_db = pd.read_sql_query(QUERY, conn)

# Create histogram
plt.figure(figsize=(12, 6))
plt.bar(df_db['DatabaseWantToWorkWith'], df_db['count'], edgecolor='black', alpha=0.7, color='purple')
plt.title('Top 5 Most Desired Databases')
plt.xlabel('Database')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("Top 5 desired databases:")
print(df_db)

**3.2 Histogram of Preferred Work Locations (`RemoteWork`)**


Objective: Use a histogram to explore the distribution of preferred work arrangements (`remote work`).


In [None]:
## Write your code here

# 3.2 Histogram of Preferred Work Locations (RemoteWork)

# Query to get RemoteWork data
QUERY = """
SELECT RemoteWork, COUNT(*) as count
FROM main
WHERE RemoteWork IS NOT NULL
GROUP BY RemoteWork
ORDER BY count DESC
"""
df_remote = pd.read_sql_query(QUERY, conn)

# Create histogram
plt.figure(figsize=(10, 6))
plt.bar(df_remote['RemoteWork'], df_remote['count'], edgecolor='black', alpha=0.7, color='teal')
plt.title('Distribution of Preferred Work Arrangements')
plt.xlabel('Remote Work Preference')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("Remote work preferences:")
print(df_remote)

### 4. Visualizing Comparison of Data


**4.1 Histogram of Median CompTotal for Ages 45 to 60**


Objective: Plot the histogram for `CompTotal` within the age group 45 to 60 to analyze compensation distribution among mid-career respondents.


In [None]:
## Write your code here

# 4.1 Histogram of Median CompTotal for Ages 45 to 60

# Query to get data for age range
QUERY = """
SELECT CompTotal
FROM main
WHERE CompTotal IS NOT NULL 
AND (Age LIKE '%45%' OR Age LIKE '%50%' OR Age LIKE '%55%' OR Age LIKE '%60%')
"""
df_age_45_60 = pd.read_sql_query(QUERY, conn)

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(df_age_45_60['CompTotal'], bins=40, edgecolor='black', alpha=0.7, color='orange')
plt.title('Compensation Distribution: Ages 45-60')
plt.xlabel('Total Compensation ($)')
plt.ylabel('Frequency')
plt.axvline(df_age_45_60['CompTotal'].median(), color='red', linestyle='--', linewidth=2, label='Median')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Respondents in age range 45-60: {len(df_age_45_60)}")
print(f"Median compensation: ${df_age_45_60['CompTotal'].median():,.2f}")

**4.2 Histogram of Job Satisfaction (`JobSat`) by YearsCodePro**


Objective: Plot the histogram for `JobSat` scores based on respondents' years of professional coding experience.


In [None]:
## Write your code here

# 4.2 Histogram of Job Satisfaction by YearsCodePro

# Query to get data
QUERY = """
SELECT JobSat, YearsCodePro
FROM main
WHERE JobSat IS NOT NULL AND YearsCodePro IS NOT NULL
"""
df_jobsat_exp = pd.read_sql_query(QUERY, conn)

# Create grouped histogram by experience ranges
experience_bins = [0, 5, 10, 15, 50]
experience_labels = ['0-5 years', '5-10 years', '10-15 years', '15+ years']
df_jobsat_exp['ExpGroup'] = pd.cut(df_jobsat_exp['YearsCodePro'], bins=experience_bins, labels=experience_labels)

# Plot
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, exp_group in enumerate(experience_labels):
    data = df_jobsat_exp[df_jobsat_exp['ExpGroup'] == exp_group]['JobSat']
    axes[idx].hist(data, bins=10, edgecolor='black', alpha=0.7, color='lightblue')
    axes[idx].set_title(f'Job Satisfaction: {exp_group}')
    axes[idx].set_xlabel('Job Satisfaction')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Job satisfaction by experience visualized")

### Final step: Close the database connection


Once you've completed the lab, make sure to close the connection to the SQLite database:



In [None]:
conn.close()

### Summary


In this lab, you used histograms to visualize various aspects of the dataset, focusing on:

- Distribution of compensation, coding experience, and work hours.

- Relationships in compensation across age groups and work status.

- Composition of data by desired databases and work environments.

- Comparisons of job satisfaction across years of experience.

Histograms helped reveal patterns and distributions in the data, enhancing your understanding of developer demographics and preferences.


## Authors:
Ayushi Jain


### Other Contributors:
- Rav Ahuja
- Lakshmi Holla
- Malika


Copyright © IBM Corporation. All rights reserved.
