<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Wrangling Lab**


Estimated time needed: **45** minutes


In this lab, you will perform data wrangling tasks to prepare raw data for analysis. Data wrangling involves cleaning, transforming, and organizing data into a structured format suitable for analysis. This lab focuses on tasks like identifying inconsistencies, encoding categorical variables, and feature transformation.


## Objectives


After completing this lab, you will be able to:


- Identify and remove inconsistent data entries.

- Encode categorical variables for analysis.

- Handle missing values using multiple imputation strategies.

- Apply feature scaling and transformation techniques.


#### Intsall the required libraries


In [None]:
!pip install pandas
!pip install matplotlib

## Tasks


#### Step 1: Import the necessary module.


### 1. Load the Dataset


<h5>1.1 Import necessary libraries and load the dataset.</h5>


Ensure the dataset is loaded correctly by displaying the first few rows.


In [None]:
# Import necessary libraries
import pandas as pd

# Load the Stack Overflow survey data
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(dataset_url)

# Display the first few rows
print(df.head())


#### 2. Explore the Dataset


<h5>2.1 Summarize the dataset by displaying the column data types, counts, and missing values.</h5>


In [None]:
# Write your code here

# Task 2.1: Summarize dataset

print("Dataset Information:")
print("=" * 60)
df.info()

print("\n" + "=" * 60)
print("Missing values per column:")
missing = df.isnull().sum()
print(missing[missing > 0].sort_values(ascending=False).head(10))

print(f"\nDataset shape: {df.shape}")

<h5>2.2 Generate basic statistics for numerical columns.</h5>


In [None]:
# Write your code here

# Task 2.2: Generate basic statistics

print("Basic Statistics for Numerical Columns:")
print(df.describe())

### 3. Identifying and Removing Inconsistencies


<h5>3.1 Identify inconsistent or irrelevant entries in specific columns (e.g., Country).</h5>


In [None]:
# Write your code here

# Task 3.1: Identify inconsistent entries

if 'Country' in df.columns:
    print("Unique countries:")
    print(f"Total unique countries: {df['Country'].nunique()}")
    print("\nTop 10 countries:")
    print(df['Country'].value_counts().head(10))
    
    # Check for potential inconsistencies
    print("\nChecking for potential inconsistencies...")
    countries = df['Country'].value_counts()
    print(f"Countries with less than 5 responses: {(countries < 5).sum()}")

<h5>3.2 Standardize entries in columns like Country or EdLevel by mapping inconsistent values to a consistent format.</h5>


In [None]:
## Write your code here

# Task 3.2: Standardize entries

# Example: Standardize EdLevel
if 'EdLevel' in df.columns:
    print("Original EdLevel values:")
    print(df['EdLevel'].value_counts().head())
    
    # Mapping for standardization (example)
    edlevel_mapping = {
        'Bachelor's degree (B.A., B.S., B.Eng., etc.)': 'Bachelor',
        'Master's degree (M.A., M.S., M.Eng., MBA, etc.)': 'Master',
        'Some college/university study without earning a degree': 'Some College'
    }
    
    # Apply mapping where applicable
    # df['EdLevel_Standardized'] = df['EdLevel'].replace(edlevel_mapping)
    
    print("\nEdLevel standardization mapping created (example shown)")

### 4. Encoding Categorical Variables


<h5>4.1 Encode the Employment column using one-hot encoding.</h5>


In [None]:
## Write your code here

# Task 4.1: Encode Employment column using one-hot encoding

if 'Employment' in df.columns:
    print("Original Employment values:")
    print(df['Employment'].value_counts())
    
    # One-hot encoding
    employment_encoded = pd.get_dummies(df['Employment'], prefix='Employment')
    
    print(f"\nOne-hot encoded columns created: {employment_encoded.shape[1]}")
    print("Sample encoded columns:")
    print(employment_encoded.columns.tolist()[:5])
    
    # Optionally concatenate with original dataframe
    # df = pd.concat([df, employment_encoded], axis=1)
    
    print("\nOne-hot encoding complete!")

### 5. Handling Missing Values


<h5>5.1 Identify columns with the highest number of missing values.</h5>


In [None]:
## Write your code here

# Task 5.1: Identify columns with highest missing values

missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df) * 100)

missing_summary = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percent': missing_percent.values
})

missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]
missing_summary = missing_summary.sort_values('Missing_Count', ascending=False)

print("Top 10 columns with most missing values:")
print(missing_summary.head(10).to_string(index=False))

<h5>5.2 Impute missing values in numerical columns (e.g., `ConvertedCompYearly`) with the mean or median.</h5>


In [None]:
## Write your code here

# Task 5.2: Impute missing values in ConvertedCompYearly

if 'ConvertedCompYearly' in df.columns:
    missing_before = df['ConvertedCompYearly'].isnull().sum()
    
    # Use median for imputation (better for skewed salary data)
    median_comp = df['ConvertedCompYearly'].median()
    
    df['ConvertedCompYearly'] = df['ConvertedCompYearly'].fillna(median_comp)
    
    missing_after = df['ConvertedCompYearly'].isnull().sum()
    
    print(f"Missing before imputation: {missing_before}")
    print(f"Median value used: ${median_comp:,.2f}")
    print(f"Missing after imputation: {missing_after}")
    print(f"Values imputed: {missing_before - missing_after}")

<h5>5.3 Impute missing values in categorical columns (e.g., `RemoteWork`) with the most frequent value.</h5>


In [None]:
## Write your code here

# Task 5.3: Impute missing values in RemoteWork with most frequent value

if 'RemoteWork' in df.columns:
    missing_before = df['RemoteWork'].isnull().sum()
    
    # Use mode (most frequent value)
    most_frequent = df['RemoteWork'].mode()[0]
    
    df['RemoteWork'] = df['RemoteWork'].fillna(most_frequent)
    
    missing_after = df['RemoteWork'].isnull().sum()
    
    print(f"Missing before imputation: {missing_before}")
    print(f"Most frequent value: '{most_frequent}'")
    print(f"Missing after imputation: {missing_after}")
    print(f"Values imputed: {missing_before - missing_after}")

### 6. Feature Scaling and Transformation


<h5>6.1 Apply Min-Max Scaling to normalize the `ConvertedCompYearly` column.</h5>


In [None]:
## Write your code here

# Task 6.1: Apply Min-Max Scaling to ConvertedCompYearly

if 'ConvertedCompYearly' in df.columns:
    # Min-Max Scaling
    min_val = df['ConvertedCompYearly'].min()
    max_val = df['ConvertedCompYearly'].max()
    
    df['ConvertedCompYearly_MinMax'] = (df['ConvertedCompYearly'] - min_val) / (max_val - min_val)
    
    print("Min-Max Scaling applied:")
    print(f"Original range: [${min_val:,.2f}, ${max_val:,.2f}]")
    print(f"Normalized range: [0, 1]")
    print("\nSample values:")
    print(df[['ConvertedCompYearly', 'ConvertedCompYearly_MinMax']].head())

<h5>6.2 Log-transform the ConvertedCompYearly column to reduce skewness.</h5>


In [None]:
## Write your code here

# Task 6.2: Log-transform ConvertedCompYearly

import numpy as np

if 'ConvertedCompYearly' in df.columns:
    # Apply log transformation (add 1 to avoid log(0))
    df['ConvertedCompYearly_Log'] = np.log1p(df['ConvertedCompYearly'])
    
    print("Log transformation applied:")
    print("\nOriginal vs Log-transformed:")
    print(df[['ConvertedCompYearly', 'ConvertedCompYearly_Log']].describe())
    
    # Check skewness reduction
    original_skew = df['ConvertedCompYearly'].skew()
    log_skew = df['ConvertedCompYearly_Log'].skew()
    
    print(f"\nSkewness comparison:")
    print(f"Original skewness: {original_skew:.4f}")
    print(f"Log-transformed skewness: {log_skew:.4f}")
    print(f"Skewness reduced by: {abs(original_skew - log_skew):.4f}")

### 7. Feature Engineering


<h5>7.1 Create a new column `ExperienceLevel` based on the `YearsCodePro` column:</h5>


In [None]:
## Write your code here

# Task 7.1: Create ExperienceLevel based on YearsCodePro

if 'YearsCodePro' in df.columns:
    def categorize_experience(years):
        if pd.isna(years):
            return 'Unknown'
        elif years < 2:
            return 'Junior'
        elif years < 5:
            return 'Mid-Level'
        elif years < 10:
            return 'Senior'
        else:
            return 'Expert'
    
    df['ExperienceLevel'] = df['YearsCodePro'].apply(categorize_experience)
    
    print("ExperienceLevel created!")
    print("\nDistribution:")
    print(df['ExperienceLevel'].value_counts())
    
    print("\nSample data:")
    print(df[['YearsCodePro', 'ExperienceLevel']].head(10))
else:
    print("YearsCodePro column not found")

### Summary


In this lab, you:

- Explored the dataset to identify inconsistencies and missing values.

- Encoded categorical variables for analysis.

- Handled missing values using imputation techniques.

- Normalized and transformed numerical data to prepare it for analysis.

- Engineered a new feature to enhance data interpretation.


Copyright © IBM Corporation. All rights reserved.
