<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Correlation**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform exploratory data analysis (EDA). You will examine the distribution of the data, identify outliers, and determine the correlation between different columns in the dataset.


## Objectives


In this lab, you will perform the following:


- Identify the distribution of compensation data in the dataset.

- Remove outliers to refine the dataset.

- Identify correlations between various features in the dataset.


## Hands on Lab


##### Step 1: Install and Import Required Libraries


In [None]:
# Install the necessary libraries
!pip install pandas
!pip install matplotlib
!pip install seaborn

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Step 2: Load the Dataset


In [None]:
# Load the dataset from the given URL
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_url)

# Display the first few rows to understand the structure of the dataset
df.head()

<h3>Step 3: Analyze and Visualize Compensation Distribution</h3>


**Task**: Plot the distribution and histogram for `ConvertedCompYearly` to examine the spread of yearly compensation among respondents.


In [None]:
## Write your code here

# Step 3: Analyze and Visualize Compensation Distribution

if 'ConvertedCompYearly' in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Histogram with distribution curve
    axes[0].hist(df['ConvertedCompYearly'].dropna(), bins=50, edgecolor='black', alpha=0.7, density=True)
    df['ConvertedCompYearly'].dropna().plot(kind='kde', ax=axes[0], color='red', linewidth=2)
    axes[0].set_title('ConvertedCompYearly Distribution (Histogram + KDE)')
    axes[0].set_xlabel('Annual Compensation ($)')
    axes[0].set_ylabel('Density')
    axes[0].grid(True, alpha=0.3)
    
    # Box plot
    axes[1].boxplot(df['ConvertedCompYearly'].dropna())
    axes[1].set_title('ConvertedCompYearly Distribution (Box Plot)')
    axes[1].set_ylabel('Annual Compensation ($)')
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("ConvertedCompYearly Statistics:")
    print(df['ConvertedCompYearly'].describe())
else:
    print("ConvertedCompYearly column not found")

<h3>Step 4: Calculate Median Compensation for Full-Time Employees</h3>


**Task**: Filter the data to calculate the median compensation for respondents whose employment status is "Employed, full-time."


In [None]:
## Write your code here

# Step 4: Calculate Median Compensation for Full-Time Employees

if 'ConvertedCompYearly' in df.columns and 'Employment' in df.columns:
    # Filter for full-time employees
    fulltime_df = df[df['Employment'] == 'Employed, full-time']
    
    median_fulltime_comp = fulltime_df['ConvertedCompYearly'].median()
    
    print("Full-Time Employee Compensation Analysis:")
    print(f"Number of full-time employees: {len(fulltime_df)}")
    print(f"Median compensation: ${median_fulltime_comp:,.2f}")
    print(f"\nFull statistics:")
    print(fulltime_df['ConvertedCompYearly'].describe())
else:
    print("Required columns not found")

<h3>Step 5: Analyzing Compensation Range and Distribution by Country</h3>


Explore the range of compensation in the ConvertedCompYearly column by analyzing differences across countries. Use box plots to compare the compensation distributions for each country to identify variations and anomalies within each region, providing insights into global compensation trends.



In [None]:
## Write your code here

# Step 5: Analyzing Compensation Range and Distribution by Country

if 'ConvertedCompYearly' in df.columns and 'Country' in df.columns:
    # Get top 10 countries by respondent count
    top_countries = df['Country'].value_counts().head(10).index
    
    # Filter for top countries
    df_top_countries = df[df['Country'].isin(top_countries)]
    
    # Create box plots
    plt.figure(figsize=(14, 6))
    
    # Prepare data for box plot
    data_by_country = [df_top_countries[df_top_countries['Country'] == country]['ConvertedCompYearly'].dropna() 
                       for country in top_countries]
    
    plt.boxplot(data_by_country, labels=top_countries, vert=True)
    plt.title('Compensation Distribution by Country (Top 10)')
    plt.xlabel('Country')
    plt.ylabel('Annual Compensation ($)')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()
    
    print("Median Compensation by Country:")
    for country in top_countries:
        median = df_top_countries[df_top_countries['Country'] == country]['ConvertedCompYearly'].median()
        print(f"{country}: ${median:,.2f}")
else:
    print("Required columns not found")

<h3>Step 6: Removing Outliers from the Dataset</h3>


**Task**: Create a new DataFrame by removing outliers from the `ConvertedCompYearly` column to get a refined dataset for correlation analysis.


In [None]:
## Write your code here

# Step 6: Removing Outliers from the Dataset

if 'ConvertedCompYearly' in df.columns:
    # Calculate IQR
    Q1 = df['ConvertedCompYearly'].quantile(0.25)
    Q3 = df['ConvertedCompYearly'].quantile(0.75)
    IQR = Q3 - Q1
    
    # Calculate bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Remove outliers
    df_cleaned = df[(df['ConvertedCompYearly'] >= lower_bound) & 
                    (df['ConvertedCompYearly'] <= upper_bound)]
    
    print("Outlier Removal using IQR Method:")
    print(f"Q1: ${Q1:,.2f}")
    print(f"Q3: ${Q3:,.2f}")
    print(f"IQR: ${IQR:,.2f}")
    print(f"Lower Bound: ${lower_bound:,.2f}")
    print(f"Upper Bound: ${upper_bound:,.2f}")
    print(f"\nOriginal dataset size: {len(df)}")
    print(f"Cleaned dataset size: {len(df_cleaned)}")
    print(f"Outliers removed: {len(df) - len(df_cleaned)}")
else:
    df_cleaned = df.copy()
    print("ConvertedCompYearly column not found")

<h3>Step 7: Finding Correlations Between Key Variables</h3>


**Task**: Calculate correlations between `ConvertedCompYearly`, `WorkExp`, and `JobSatPoints_1`. Visualize these correlations with a heatmap.


In [None]:
## Write your code here

# Step 7: Finding Correlations Between Key Variables

# Select key numerical columns
correlation_columns = []
if 'ConvertedCompYearly' in df_cleaned.columns:
    correlation_columns.append('ConvertedCompYearly')
if 'WorkExp' in df_cleaned.columns:
    correlation_columns.append('WorkExp')
if 'JobSatPoints_1' in df_cleaned.columns:
    correlation_columns.append('JobSatPoints_1')
if 'YearsCodePro' in df_cleaned.columns:
    correlation_columns.append('YearsCodePro')

if len(correlation_columns) >= 2:
    # Calculate correlation matrix
    corr_matrix = df_cleaned[correlation_columns].corr()
    
    print("Correlation Matrix:")
    print(corr_matrix)
    
    # Visualize with heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
                square=True, linewidths=2, fmt='.3f', 
                cbar_kws={"shrink": 0.8})
    plt.title('Correlation Heatmap: Key Variables')
    plt.tight_layout()
    plt.show()
    
    print("\nKey Correlations:")
    if 'ConvertedCompYearly' in correlation_columns and 'WorkExp' in correlation_columns:
        corr_comp_exp = corr_matrix.loc['ConvertedCompYearly', 'WorkExp']
        print(f"Compensation vs WorkExp: {corr_comp_exp:.3f}")
    if 'ConvertedCompYearly' in correlation_columns and 'JobSatPoints_1' in correlation_columns:
        corr_comp_sat = corr_matrix.loc['ConvertedCompYearly', 'JobSatPoints_1']
        print(f"Compensation vs JobSatisfaction: {corr_comp_sat:.3f}")
else:
    print("Not enough numerical columns for correlation analysis")

<h3>Step 8: Scatter Plot for Correlations</h3>


**Task**: Create scatter plots to examine specific correlations between `ConvertedCompYearly` and `WorkExp`, as well as between `ConvertedCompYearly` and `JobSatPoints_1`.


In [None]:
## Write your code here

# Step 8: Scatter Plot for Correlations

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot 1: ConvertedCompYearly vs WorkExp
if 'ConvertedCompYearly' in df_cleaned.columns and 'WorkExp' in df_cleaned.columns:
    df_scatter1 = df_cleaned[['ConvertedCompYearly', 'WorkExp']].dropna()
    axes[0].scatter(df_scatter1['WorkExp'], df_scatter1['ConvertedCompYearly'], 
                    alpha=0.5, s=20, c='blue')
    axes[0].set_xlabel('Work Experience (years)')
    axes[0].set_ylabel('Annual Compensation ($)')
    axes[0].set_title('Compensation vs Work Experience')
    axes[0].grid(True, alpha=0.3)
else:
    axes[0].text(0.5, 0.5, 'Data not available', ha='center', va='center')
    axes[0].set_title('Compensation vs Work Experience')

# Scatter plot 2: ConvertedCompYearly vs JobSatPoints_1
if 'ConvertedCompYearly' in df_cleaned.columns and 'JobSatPoints_1' in df_cleaned.columns:
    df_scatter2 = df_cleaned[['ConvertedCompYearly', 'JobSatPoints_1']].dropna()
    axes[1].scatter(df_scatter2['JobSatPoints_1'], df_scatter2['ConvertedCompYearly'], 
                    alpha=0.5, s=20, c='green')
    axes[1].set_xlabel('Job Satisfaction Points')
    axes[1].set_ylabel('Annual Compensation ($)')
    axes[1].set_title('Compensation vs Job Satisfaction')
    axes[1].grid(True, alpha=0.3)
else:
    axes[1].text(0.5, 0.5, 'Data not available', ha='center', va='center')
    axes[1].set_title('Compensation vs Job Satisfaction')

plt.tight_layout()
plt.show()

print("Scatter plots created to visualize relationships between variables")

<h3>Summary</h3>


In this lab, you practiced essential skills in correlation analysis by:

- Examining the distribution of yearly compensation with histograms and box plots.
- Detecting and removing outliers from compensation data.
- Calculating correlations between key variables such as compensation, work experience, and job satisfaction.
- Visualizing relationships with scatter plots and heatmaps to gain insights into the associations between these features.

By following these steps, you have developed a solid foundation for analyzing relationships within the dataset.


## Authors:
Ayushi Jain


### Other Contributors:
- Rav Ahuja
- Lakshmi Holla
- Malika


Copyright © IBM Corporation. All rights reserved.
