<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Outliers**


Estimated time needed: **30** minutes


In this lab, you will work with a cleaned dataset to perform exploratory data analysis or EDA. 
You will explore the distribution of key variables and focus on identifying outliers in this lab.


## Objectives


In this lab, you will perform the following:


-  Analyze the distribution of key variables in the dataset.

-  Identify and remove outliers using statistical methods.

-  Perform relevant statistical and correlation analysis.


#### Install and import the required libraries


In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<h3>Step 1: Load and Explore the Dataset</h3>


Load the dataset into a DataFrame and examine the structure of the data.


In [None]:
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

#Create the dataframe
df = pd.read_csv(file_url)

#Display the top 10 records
df.head()


<h3>Step 2: Plot the Distribution of Industry</h3>


Explore how respondents are distributed across different industries.

- Plot a bar chart to visualize the distribution of respondents by industry.

- Highlight any notable trends.


In [None]:
##Write your code here

# Step 2: Plot the Distribution of Industry

if 'Industry' in df.columns:
    plt.figure(figsize=(14, 6))
    
    industry_counts = df['Industry'].value_counts().head(15)
    industry_counts.plot(kind='bar', color='steelblue', edgecolor='black')
    plt.title('Distribution of Respondents by Industry (Top 15)')
    plt.xlabel('Industry')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    print("Top 10 Industries:")
    print(df['Industry'].value_counts().head(10))
else:
    print("Industry column not found in dataset")

<h3>Step 3: Identify High Compensation Outliers</h3>


Identify respondents with extremely high yearly compensation.

- Calculate basic statistics (mean, median, and standard deviation) for `ConvertedCompYearly`.

- Identify compensation values exceeding a defined threshold (e.g., 3 standard deviations above the mean).


In [None]:
##Write your code here

# Step 3: Identify High Compensation Outliers

if 'ConvertedCompYearly' in df.columns:
    # Calculate statistics
    mean_comp = df['ConvertedCompYearly'].mean()
    median_comp = df['ConvertedCompYearly'].median()
    std_comp = df['ConvertedCompYearly'].std()
    
    print("ConvertedCompYearly Statistics:")
    print(f"Mean: ${mean_comp:,.2f}")
    print(f"Median: ${median_comp:,.2f}")
    print(f"Standard Deviation: ${std_comp:,.2f}")
    
    # Identify outliers (3 standard deviations above mean)
    threshold = mean_comp + (3 * std_comp)
    outliers = df[df['ConvertedCompYearly'] > threshold]
    
    print(f"\nThreshold (mean + 3*std): ${threshold:,.2f}")
    print(f"Number of outliers: {len(outliers)}")
    print(f"Percentage of outliers: {(len(outliers)/len(df)*100):.2f}%")
    
    if len(outliers) > 0:
        print(f"\nTop 5 highest compensations:")
        print(df['ConvertedCompYearly'].nlargest(5))
else:
    print("ConvertedCompYearly column not found")

<h3>Step 4: Detect Outliers in Compensation</h3>


Identify outliers in the `ConvertedCompYearly` column using the IQR method.

- Calculate the Interquartile Range (IQR).

- Determine the upper and lower bounds for outliers.

- Count and visualize outliers using a box plot.


In [None]:
##Write your code here

# Step 4: Detect Outliers in Compensation using IQR method

if 'ConvertedCompYearly' in df.columns:
    # Calculate IQR
    Q1 = df['ConvertedCompYearly'].quantile(0.25)
    Q3 = df['ConvertedCompYearly'].quantile(0.75)
    IQR = Q3 - Q1
    
    # Calculate bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    print("IQR Method for Outlier Detection:")
    print(f"Q1 (25th percentile): ${Q1:,.2f}")
    print(f"Q3 (75th percentile): ${Q3:,.2f}")
    print(f"IQR: ${IQR:,.2f}")
    print(f"Lower Bound: ${lower_bound:,.2f}")
    print(f"Upper Bound: ${upper_bound:,.2f}")
    
    # Count outliers
    outliers_low = df[df['ConvertedCompYearly'] < lower_bound]
    outliers_high = df[df['ConvertedCompYearly'] > upper_bound]
    
    print(f"\nOutliers below lower bound: {len(outliers_low)}")
    print(f"Outliers above upper bound: {len(outliers_high)}")
    print(f"Total outliers: {len(outliers_low) + len(outliers_high)}")
    
    # Box plot visualization
    plt.figure(figsize=(10, 6))
    plt.boxplot(df['ConvertedCompYearly'].dropna(), vert=True)
    plt.title('Box Plot: ConvertedCompYearly (Outlier Detection)')
    plt.ylabel('Annual Compensation ($)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("ConvertedCompYearly column not found")

<h3>Step 5: Remove Outliers and Create a New DataFrame</h3>


Remove outliers from the dataset.

- Create a new DataFrame excluding rows with outliers in `ConvertedCompYearly`.
- Validate the size of the new DataFrame.


In [None]:
##Write your code here

# Step 5: Remove Outliers and Create a New DataFrame

if 'ConvertedCompYearly' in df.columns:
    # Calculate IQR bounds
    Q1 = df['ConvertedCompYearly'].quantile(0.25)
    Q3 = df['ConvertedCompYearly'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Remove outliers
    df_no_outliers = df[(df['ConvertedCompYearly'] >= lower_bound) & 
                        (df['ConvertedCompYearly'] <= upper_bound)]
    
    print("Outlier Removal Results:")
    print(f"Original dataset size: {len(df)}")
    print(f"New dataset size: {len(df_no_outliers)}")
    print(f"Rows removed: {len(df) - len(df_no_outliers)}")
    print(f"Percentage removed: {((len(df) - len(df_no_outliers))/len(df)*100):.2f}%")
    
    print(f"\nNew compensation statistics:")
    print(f"Mean: ${df_no_outliers['ConvertedCompYearly'].mean():,.2f}")
    print(f"Median: ${df_no_outliers['ConvertedCompYearly'].median():,.2f}")
    print(f"Std Dev: ${df_no_outliers['ConvertedCompYearly'].std():,.2f}")
else:
    df_no_outliers = df.copy()
    print("ConvertedCompYearly column not found, no outliers removed")

<h3>Step 6: Correlation Analysis</h3>


Analyze the correlation between `Age` (transformed) and other numerical columns.

- Map the `Age` column to approximate numeric values.

- Compute correlations between `Age` and other numeric variables.

- Visualize the correlation matrix.


In [None]:
##Write your code here

# Step 6: Correlation Analysis

if 'Age' in df_no_outliers.columns:
    # Map Age column to numeric values
    age_mapping = {
        'Under 18 years old': 16,
        '18-24 years old': 21,
        '25-34 years old': 29.5,
        '35-44 years old': 39.5,
        '45-54 years old': 49.5,
        '55-64 years old': 59.5,
        '65 years or older': 70,
        'Prefer not to say': None
    }
    
    df_no_outliers['Age_numeric'] = df_no_outliers['Age'].map(age_mapping)
    
    # Select numerical columns for correlation
    numeric_cols = df_no_outliers.select_dtypes(include=['number']).columns
    
    # Calculate correlations with Age
    print("Correlation Analysis: Age vs Other Numerical Columns")
    print("=" * 60)
    
    age_correlations = df_no_outliers[numeric_cols].corr()['Age_numeric'].sort_values(ascending=False)
    print(age_correlations.dropna())
    
    # Visualize correlation matrix
    plt.figure(figsize=(12, 10))
    
    # Select key columns for correlation matrix
    key_cols = ['Age_numeric']
    if 'ConvertedCompYearly' in numeric_cols:
        key_cols.append('ConvertedCompYearly')
    if 'YearsCodePro' in numeric_cols:
        key_cols.append('YearsCodePro')
    if 'WorkExp' in numeric_cols:
        key_cols.append('WorkExp')
    
    corr_matrix = df_no_outliers[key_cols].corr()
    
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, 
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Matrix: Age and Key Variables')
    plt.tight_layout()
    plt.show()
else:
    print("Age column not found in dataset")

<h3> Summary </h3>


In this lab, you developed essential skills in **Exploratory Data Analysis (EDA)** with a focus on outlier detection and removal. Specifically, you:


- Loaded and explored the dataset to understand its structure.

- Analyzed the distribution of respondents across industries.

- Identified and removed high compensation outliers using statistical thresholds and the Interquartile Range (IQR) method.

- Performed correlation analysis, including transforming the `Age` column into numeric values for better analysis.


<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|               
|2024-10-1|1.1|Madhusudan Moole|Reviewed and updated lab|                                                                                    
|2024-09-29|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
