<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Lab: Exploratory Data Analysis**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform Exploratory Data Analysis or EDA. 


## Objectives


In this lab, you will perform the following:


- Examine the structure of a dataset.

- Handle missing values effectively.

- Conduct summary statistics on key columns.

- Analyze employment status, job satisfaction, programming language usage, and trends in remote work.


## Hands on Lab


#### Step 1: Install and Import Libraries


Install the necessary libraries for data manipulation and visualization.


In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Step 2: Load and Preview the Dataset
Load the dataset from the provided URL. Use df.head() to display the first few rows to get an overview of the structure.


In [None]:
# Load the Stack Overflow survey dataset
data_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv'
df = pd.read_csv(data_url)

# Set pandas option to display all columns
pd.set_option('display.max_columns', None)

# Display the first few rows of the dataset
df.head()

#### Step 3: Handling Missing Data


Identify and manage missing values in critical columns such as `Employment`, `JobSat`, and `RemoteWork`. Implement a strategy to fill or drop these values, depending on the significance of the missing data.


In [None]:
## Write your code here

# Step 3: Handling Missing Data

# Identify missing values in critical columns
print("Missing values in critical columns:")
critical_columns = ['Employment', 'JobSat', 'RemoteWork']
for col in critical_columns:
    if col in df.columns:
        missing = df[col].isnull().sum()
        print(f"  {col}: {missing} ({(missing/len(df)*100):.2f}%)")

# Handle missing values
# For Employment: fill with mode
if 'Employment' in df.columns and df['Employment'].isnull().sum() > 0:
    df['Employment'].fillna(df['Employment'].mode()[0], inplace=True)

# For JobSat: fill with median if numeric, mode if categorical
if 'JobSat' in df.columns and df['JobSat'].isnull().sum() > 0:
    df['JobSat'].fillna(df['JobSat'].mode()[0], inplace=True)

# For RemoteWork: fill with mode
if 'RemoteWork' in df.columns and df['RemoteWork'].isnull().sum() > 0:
    df['RemoteWork'].fillna(df['RemoteWork'].mode()[0], inplace=True)

print("\nMissing values after handling:")
for col in critical_columns:
    if col in df.columns:
        print(f"  {col}: {df[col].isnull().sum()}")

#### Step 4: Analysis of Experience and Job Satisfaction


Analyze the relationship between years of professional coding experience (`YearsCodePro`) and job satisfaction (`JobSat`). Summarize `YearsCodePro` and calculate median satisfaction scores based on experience ranges.

- Create experience ranges for `YearsCodePro` (e.g., `0-5`, `5-10`, `10-20`, `>20` years).

- Calculate the median `JobSat` for each range.

- Visualize the relationship using a bar plot or similar visualization.


In [None]:
## Write your code here

# Step 4: Analysis of Experience and Job Satisfaction

# Create experience ranges for YearsCodePro
if 'YearsCodePro' in df.columns and 'JobSat' in df.columns:
    # Define experience ranges
    def categorize_experience(years):
        try:
            years = float(years)
            if years <= 5:
                return '0-5 years'
            elif years <= 10:
                return '5-10 years'
            elif years <= 20:
                return '10-20 years'
            else:
                return '>20 years'
        except:
            return 'Unknown'
    
    df['ExperienceRange'] = df['YearsCodePro'].apply(categorize_experience)
    
    # Calculate median JobSat for each range (if JobSat is numeric)
    print("Job Satisfaction by Experience Range:")
    satisfaction_by_exp = df.groupby('ExperienceRange')['JobSat'].agg(['count', 'median', 'mean'])
    print(satisfaction_by_exp)
    
    # Visualize with bar plot
    plt.figure(figsize=(10, 6))
    exp_order = ['0-5 years', '5-10 years', '10-20 years', '>20 years', 'Unknown']
    satisfaction_median = df.groupby('ExperienceRange')['JobSat'].median().reindex(exp_order)
    satisfaction_median.plot(kind='bar', color='skyblue', edgecolor='black')
    plt.title('Median Job Satisfaction by Experience Range')
    plt.xlabel('Experience Range')
    plt.ylabel('Median Job Satisfaction')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

#### Step 5: Visualize Job Satisfaction


Use a count plot to show the distribution of `JobSat` values. This provides insights into the overall satisfaction levels of respondents.


In [None]:
## Write your code here

# Step 5: Visualize Job Satisfaction

if 'JobSat' in df.columns:
    plt.figure(figsize=(12, 5))
    
    # Count plot
    plt.subplot(1, 2, 1)
    sns.countplot(data=df, x='JobSat', palette='viridis')
    plt.title('Distribution of Job Satisfaction')
    plt.xlabel('Job Satisfaction Level')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    
    # Pie chart (if categorical)
    plt.subplot(1, 2, 2)
    jobsat_counts = df['JobSat'].value_counts()
    plt.pie(jobsat_counts.values, labels=jobsat_counts.index, autopct='%1.1f%%')
    plt.title('Job Satisfaction Distribution')
    
    plt.tight_layout()
    plt.show()
    
    print("\nJob Satisfaction Summary:")
    print(df['JobSat'].value_counts())

#### Step 6: Analyzing Remote Work Preferences by Job Role


Analyze trends in remote work based on job roles. Use the `RemoteWork` and `Employment` columns to explore preferences and examine if specific job roles prefer remote work more than others.

- Use a count plot to show remote work distribution.

- Cross-tabulate remote work preferences by employment type (e.g., full-time, part-time) and job roles.


In [None]:
## Write your code here

# Step 6: Analyzing Remote Work Preferences by Job Role

if 'RemoteWork' in df.columns and 'Employment' in df.columns:
    # Count plot for remote work distribution
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    sns.countplot(data=df, x='RemoteWork', palette='Set2')
    plt.title('Remote Work Distribution')
    plt.xlabel('Remote Work Status')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    
    # Cross-tabulation
    plt.subplot(1, 2, 2)
    crosstab = pd.crosstab(df['Employment'], df['RemoteWork'])
    crosstab.plot(kind='bar', stacked=True, ax=plt.gca(), colormap='viridis')
    plt.title('Remote Work by Employment Type')
    plt.xlabel('Employment Type')
    plt.ylabel('Count')
    plt.legend(title='Remote Work', bbox_to_anchor=(1.05, 1))
    plt.xticks(rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
    
    print("\nRemote Work Preferences:")
    print(df['RemoteWork'].value_counts())
    print("\nCross-tabulation (Employment vs RemoteWork):")
    print(crosstab)

#### Step 7: Analyzing Programming Language Trends by Region


Analyze the popularity of programming languages by region. Use the `LanguageHaveWorkedWith` column to investigate which languages are most used in different regions.

- Filter data by country or region.

- Visualize the top programming languages by region with a bar plot or heatmap.


In [None]:
## Write your code here

# Step 7: Analyzing Programming Language Trends by Region

if 'LanguageHaveWorkedWith' in df.columns and 'Country' in df.columns:
    # Get top 5 countries
    top_countries = df['Country'].value_counts().head(5).index
    
    # Filter for top countries
    df_top_countries = df[df['Country'].isin(top_countries)]
    
    # Parse programming languages (they are semicolon-separated)
    languages_list = []
    for langs in df_top_countries['LanguageHaveWorkedWith'].dropna():
        if isinstance(langs, str):
            languages_list.extend([lang.strip() for lang in langs.split(';')])
    
    # Get top 10 languages
    from collections import Counter
    lang_counts = Counter(languages_list)
    top_languages = [lang for lang, count in lang_counts.most_common(10)]
    
    print("Top 10 Programming Languages:")
    for i, (lang, count) in enumerate(lang_counts.most_common(10), 1):
        print(f"{i}. {lang}: {count}")
    
    # Visualize
    plt.figure(figsize=(12, 6))
    langs, counts = zip(*lang_counts.most_common(10))
    plt.barh(range(len(langs)), counts, color='coral')
    plt.yticks(range(len(langs)), langs)
    plt.xlabel('Frequency')
    plt.title('Top 10 Programming Languages')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

#### Step 8: Correlation Between Experience and Satisfaction


Examine how years of experience (`YearsCodePro`) correlate with job satisfaction (`JobSatPoints_1`). Use a scatter plot to visualize this relationship.


In [None]:
## Write your code here

# Step 8: Correlation Between Experience and Satisfaction

if 'YearsCodePro' in df.columns and 'JobSatPoints_1' in df.columns:
    # Create scatter plot
    plt.figure(figsize=(10, 6))
    
    # Filter out non-numeric values
    df_numeric = df[['YearsCodePro', 'JobSatPoints_1']].dropna()
    
    plt.scatter(df_numeric['YearsCodePro'], df_numeric['JobSatPoints_1'], 
                alpha=0.5, s=20, c='blue')
    plt.xlabel('Years of Professional Coding Experience')
    plt.ylabel('Job Satisfaction Points')
    plt.title('Correlation: Experience vs Job Satisfaction')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Calculate correlation
    correlation = df_numeric['YearsCodePro'].corr(df_numeric['JobSatPoints_1'])
    print(f"\nCorrelation coefficient: {correlation:.4f}")
    
    if abs(correlation) < 0.3:
        print("Weak correlation")
    elif abs(correlation) < 0.7:
        print("Moderate correlation")
    else:
        print("Strong correlation")

#### Step 9: Educational Background and Employment Type


Explore how educational background (`EdLevel`) relates to employment type (`Employment`). Use cross-tabulation and visualizations to understand if higher education correlates with specific employment types.


In [None]:
## Write your code here

# Step 9: Educational Background and Employment Type

if 'EdLevel' in df.columns and 'Employment' in df.columns:
    # Cross-tabulation
    crosstab_edu = pd.crosstab(df['EdLevel'], df['Employment'])
    
    print("Cross-tabulation: Education Level vs Employment Type")
    print(crosstab_edu)
    
    # Visualize with stacked bar plot
    plt.figure(figsize=(14, 8))
    crosstab_edu.plot(kind='bar', stacked=True, colormap='tab20', figsize=(14, 8))
    plt.title('Education Level vs Employment Type')
    plt.xlabel('Education Level')
    plt.ylabel('Count')
    plt.legend(title='Employment Type', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Show top education levels by employment type
    print("\nTop 5 Education Levels for Full-Time Employment:")
    if 'Employed, full-time' in crosstab_edu.columns:
        print(crosstab_edu['Employed, full-time'].sort_values(ascending=False).head())

#### Step 10: Save the Cleaned and Analyzed Dataset


After your analysis, save the modified dataset for further use or sharing.


In [None]:
## Write your code here

# Step 10: Save the Cleaned and Analyzed Dataset

# Save to CSV
output_file = 'cleaned_analyzed_survey_data.csv'
df.to_csv(output_file, index=False)

print(f"Dataset saved to: {output_file}")
print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")

<h2>Summary</h2>


In this revised lab, you:

- Loaded and explored the structure of the dataset.

- Handled missing data effectively.

- Analyzed key variables, including working hours, job satisfaction, and remote work trends.

- Investigated programming language usage by region and examined the relationship between experience and satisfaction.

- Used cross-tabulation to understand educational background and employment type.


Copyright © IBM Corporation. All rights reserved.
