<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding How The Data Is Distributed**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform Exploratory Data Analysis (EDA). You will examine the structure of the data, visualize key variables, and analyze trends related to developer experience, tools, job satisfaction, and other important aspects.


## Objectives


In this lab you will perform the following:


- Understand the structure of the dataset.

- Perform summary statistics and data visualization.

- Identify trends in developer experience, tools, job satisfaction, and other key variables.


### Install the required libraries


In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn


### Step 1: Import Libraries and Load Data


- Import the `pandas`, `matplotlib.pyplot`, and `seaborn` libraries.


- You will begin with loading the dataset. You can use the pyfetch method if working on JupyterLite. Otherwise, you can use pandas' read_csv() function directly on their local machines or cloud environments.


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Stack Overflow survey dataset
data_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv'
df = pd.read_csv(data_url)

# Display the first few rows of the dataset
df.head()


### Step 2: Examine the Structure of the Data


- Display the column names, data types, and summary information to understand the data structure.

- Objective: Gain insights into the dataset's shape and available variables.


In [None]:
## Write your code here

# Step 2: Examine the Structure of the Data

print("Dataset Structure:")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"\nColumn Names and Data Types:")
print(df.dtypes)

print("\n" + "=" * 60)
print("Dataset Info:")
df.info()

print("\n" + "=" * 60)
print("First few rows:")
df.head()

### Step 3: Handle Missing Data


- Identify missing values in the dataset.

- Impute or remove missing values as necessary to ensure data completeness.



In [None]:
## Write your code here

# Step 3: Handle Missing Data

# Identify missing values
print("Missing values in dataset:")
missing_data = df.isnull().sum()
print(missing_data[missing_data > 0].sort_values(ascending=False).head(10))

# Handle missing values in key columns
if 'ConvertedCompYearly' in df.columns:
    median_comp = df['ConvertedCompYearly'].median()
    df['ConvertedCompYearly'].fillna(median_comp, inplace=True)
    print(f"\nFilled ConvertedCompYearly with median: ${median_comp:,.2f}")

if 'YearsCodePro' in df.columns:
    median_years = df['YearsCodePro'].median()
    df['YearsCodePro'].fillna(median_years, inplace=True)
    print(f"Filled YearsCodePro with median: {median_years}")

### Step 4: Analyze Key Columns


- Examine key columns such as `Employment`, `JobSat` (Job Satisfaction), and `YearsCodePro` (Professional Coding Experience).

- **Instruction**: Calculate the value counts for each column to understand the distribution of responses.



In [None]:
## Write your code here

# Step 4: Analyze Key Columns

# Employment distribution
if 'Employment' in df.columns:
    print("Employment Status Distribution:")
    print(df['Employment'].value_counts())
    print()

# JobSat distribution
if 'JobSat' in df.columns:
    print("Job Satisfaction Distribution:")
    print(df['JobSat'].value_counts())
    print()

# YearsCodePro distribution
if 'YearsCodePro' in df.columns:
    print("Years of Professional Coding Experience - Summary:")
    print(df['YearsCodePro'].describe())

### Step 5: Visualize Job Satisfaction (Focus on JobSat)


- Create a pie chart or KDE plot to visualize the distribution of `JobSat`.

- Provide an interpretation of the plot, highlighting key trends in job satisfaction.


In [None]:
## Write your code here

# Step 5: Visualize Job Satisfaction

if 'JobSat' in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Pie chart
    jobsat_counts = df['JobSat'].value_counts()
    axes[0].pie(jobsat_counts.values, labels=jobsat_counts.index, autopct='%1.1f%%', startangle=90)
    axes[0].set_title('Job Satisfaction Distribution (Pie Chart)')
    
    # KDE plot (if numeric)
    try:
        df['JobSat'].astype(float).plot(kind='kde', ax=axes[1], color='blue', linewidth=2)
        axes[1].set_title('Job Satisfaction Distribution (KDE)')
        axes[1].set_xlabel('Job Satisfaction')
        axes[1].set_ylabel('Density')
        axes[1].grid(True, alpha=0.3)
    except:
        # If not numeric, use histogram
        jobsat_counts.plot(kind='bar', ax=axes[1], color='skyblue', edgecolor='black')
        axes[1].set_title('Job Satisfaction Distribution (Bar Chart)')
        axes[1].set_xlabel('Job Satisfaction')
        axes[1].set_ylabel('Count')
        axes[1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    print("\nKey Insight: Job satisfaction shows the distribution of respondent satisfaction levels.")

### Step 6: Programming Languages Analysis


- Compare the frequency of programming languages in `LanguageHaveWorkedWith` and `LanguageWantToWorkWith`.
  
- Visualize the overlap or differences using a Venn diagram or a grouped bar chart.


In [None]:
## Write your code here

# Step 6: Programming Languages Analysis

if 'LanguageHaveWorkedWith' in df.columns and 'LanguageWantToWorkWith' in df.columns:
    # Parse languages from semicolon-separated strings
    from collections import Counter
    
    have_worked = []
    want_to_work = []
    
    for langs in df['LanguageHaveWorkedWith'].dropna():
        if isinstance(langs, str):
            have_worked.extend([lang.strip() for lang in langs.split(';')])
    
    for langs in df['LanguageWantToWorkWith'].dropna():
        if isinstance(langs, str):
            want_to_work.extend([lang.strip() for lang in langs.split(';')])
    
    have_counts = Counter(have_worked)
    want_counts = Counter(want_to_work)
    
    # Get top 10 languages
    top_have = dict(have_counts.most_common(10))
    top_want = dict(want_counts.most_common(10))
    
    # Create grouped bar chart
    languages = list(set(list(top_have.keys()) + list(top_want.keys())))[:10]
    have_values = [have_counts.get(lang, 0) for lang in languages]
    want_values = [want_counts.get(lang, 0) for lang in languages]
    
    x = range(len(languages))
    width = 0.35
    
    fig, ax = plt.subplots(figsize=(14, 6))
    ax.bar([i - width/2 for i in x], have_values, width, label='Have Worked With', color='steelblue')
    ax.bar([i + width/2 for i in x], want_values, width, label='Want To Work With', color='orange')
    
    ax.set_xlabel('Programming Languages')
    ax.set_ylabel('Frequency')
    ax.set_title('Programming Languages: Have Worked With vs Want To Work With')
    ax.set_xticks(x)
    ax.set_xticklabels(languages, rotation=45, ha='right')
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("\nTop 5 Languages Have Worked With:")
    for i, (lang, count) in enumerate(have_counts.most_common(5), 1):
        print(f"{i}. {lang}: {count}")
    
    print("\nTop 5 Languages Want To Work With:")
    for i, (lang, count) in enumerate(want_counts.most_common(5), 1):
        print(f"{i}. {lang}: {count}")

### Step 7: Analyze Remote Work Trends


- Visualize the distribution of RemoteWork by region using a grouped bar chart or heatmap.


In [None]:
## Write your code here

# Step 7: Analyze Remote Work Trends

if 'RemoteWork' in df.columns and 'Country' in df.columns:
    # Get top 10 countries
    top_countries = df['Country'].value_counts().head(10).index
    
    # Filter for top countries
    df_top = df[df['Country'].isin(top_countries)]
    
    # Create crosstab
    remote_by_country = pd.crosstab(df_top['Country'], df_top['RemoteWork'])
    
    # Create grouped bar chart
    remote_by_country.plot(kind='bar', figsize=(14, 6), colormap='Set3')
    plt.title('Remote Work Distribution by Top 10 Countries')
    plt.xlabel('Country')
    plt.ylabel('Count')
    plt.legend(title='Remote Work Status', bbox_to_anchor=(1.05, 1))
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    print("\nRemote Work Distribution by Country:")
    print(remote_by_country)

### Step 8: Correlation between Job Satisfaction and Experience


- Analyze the correlation between overall job satisfaction (`JobSat`) and `YearsCodePro`.
  
- Calculate the Pearson or Spearman correlation coefficient.


In [None]:
## Write your code here

# Step 8: Correlation between Job Satisfaction and Experience

if 'JobSat' in df.columns and 'YearsCodePro' in df.columns:
    # Try to convert to numeric
    try:
        df['JobSat_numeric'] = pd.to_numeric(df['JobSat'], errors='coerce')
        df['YearsCodePro_numeric'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')
        
        # Drop NaN values
        df_corr = df[['JobSat_numeric', 'YearsCodePro_numeric']].dropna()
        
        # Calculate correlation
        pearson_corr = df_corr['JobSat_numeric'].corr(df_corr['YearsCodePro_numeric'], method='pearson')
        spearman_corr = df_corr['JobSat_numeric'].corr(df_corr['YearsCodePro_numeric'], method='spearman')
        
        print("Correlation Analysis: Job Satisfaction vs Years of Experience")
        print(f"Pearson correlation: {pearson_corr:.4f}")
        print(f"Spearman correlation: {spearman_corr:.4f}")
        
        if abs(pearson_corr) < 0.3:
            print("Interpretation: Weak correlation")
        elif abs(pearson_corr) < 0.7:
            print("Interpretation: Moderate correlation")
        else:
            print("Interpretation: Strong correlation")
    except Exception as e:
        print(f"Could not calculate numeric correlation: {e}")

### Step 9: Cross-tabulation Analysis (Employment vs. Education Level)


- Analyze the relationship between employment status (`Employment`) and education level (`EdLevel`).

- **Instruction**: Create a cross-tabulation using `pd.crosstab()` and visualize it with a stacked bar plot if possible.


In [None]:
## Write your code here

# Step 9: Cross-tabulation Analysis (Employment vs Education Level)

if 'Employment' in df.columns and 'EdLevel' in df.columns:
    # Create cross-tabulation
    crosstab_emp_ed = pd.crosstab(df['Employment'], df['EdLevel'])
    
    print("Cross-tabulation: Employment vs Education Level")
    print(crosstab_emp_ed)
    
    # Visualize with stacked bar plot
    crosstab_emp_ed.plot(kind='bar', stacked=True, figsize=(14, 7), colormap='tab20')
    plt.title('Employment Status by Education Level')
    plt.xlabel('Employment Status')
    plt.ylabel('Count')
    plt.legend(title='Education Level', bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

### Step 10: Export Cleaned Data


- Save the cleaned dataset to a new CSV file for further use or sharing.


In [None]:
## Write your code here

# Step 10: Export Cleaned Data

output_file = 'cleaned_survey_data_distributed.csv'
df.to_csv(output_file, index=False)

print(f"Cleaned dataset saved to: {output_file}")
print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")

### Summary:


In this lab, you practiced key skills in exploratory data analysis, including:


- Examining the structure and content of the Stack Overflow survey dataset to understand its variables and data types.

- Identifying and addressing missing data to ensure the dataset's quality and completeness.

- Summarizing and visualizing key variables such as job satisfaction, programming languages, and remote work trends.

- Analyzing relationships in the data using techniques like:
    - Comparing programming languages respondents have worked with versus those they want to work with.
      
    - Exploring remote work preferences by region.

- Investigating correlations between professional coding experience and job satisfaction.

- Performing cross-tabulations to analyze relationships between employment status and education levels.


## Authors:
Ayushi Jain


### Other Contributors:
Rav Ahuja
Lakshmi Holla
Malika


Copyright © IBM Corporation. All rights reserved.
