# Ungraded Lab: Data Import Lab

## Overview 
Your team at EngageMetrics, a leading employee engagement analytics company, has just received datasets from multiple sources that need to be consolidated for an urgent executive presentation. The data includes employee feedback from regional offices in CSV format, educational achievement records in Excel spreadsheets, and compensation benchmarks from an external API. The leadership team needs insights on employee engagement trends by tomorrow morning.
<br>
In this hands-on lab, you'll learn how to efficiently import and process this scattered data using Python. You'll work with real EngageMetrics datasets containing employee insights, educational records, and external market data, mastering the essential data import techniques needed to handle various file formats and common data challenges.

<b>Pro tip:</b> Keep the lesson screencast handy while working through this lab. Even experienced developers regularly refer back to documentation and examples.

## Learning Outcomes 

By the end of this lab, you will be able to:
- Import data from CSV files using pandas
- Load data from Excel files with proper formatting
- Connect to and retrieve data from a REST API
- Handle common data import challenges (missing values, inconsistent formats)

## Dataset Information 
You'll work with two different EngageMetrics data sources and an external API from Wikipedia:

1. <b>employee_insights.csv:</b> Employee performance and demographic data
2. <b>education_data.xlsx:</b> Educational background information
3. <b>Wikipedia REST API:</b> Income statistics in the United States

## Activities

### Activity 1: Regional Feedback Data Import

Let's start with importing employee insights data from a CSV file. Import the quarterly employee engagement feedback data from all regions.  

<b>Step 1:</b> Import the required libraries first:

In [1]:
# Import required libraries
import pandas as pd
import requests
import json
from datetime import datetime

<b>Step 2:</b> Import and examine the CSV data:

In [None]:
# Import the employee insights data
employee_df = pd.read_csv('employee_insights.csv')

# Display the first few rows
# YOUR CODE HERE  

<b>Tip:</b> Check the data types of your columns after import to ensure they're appropriate

<b>Step 3:</b> Handle data quality issues:
- Examine missing values
- Check for inconsistent date formats

In [None]:
# Check for missing values
# YOUR CODE HERE
    
# Identify inconsistent date formats 
# YOUR CODE HERE

### Activity 2: Educational Records Import
Now let's work with the educational background data from an Excel file. Process the talent development team's educational background data.

<b>Step 1:</b> Load the excel file: 

In [None]:
# Import the education data
# YOUR CODE HERE 

# Display the first few rows
# YOUR CODE HERE 

<b>Step 2:</b> Data validation:
- Verify graduation years are in the correct format
- Check for any missing educational backgrounds

In [None]:
# Verify whether graduation years are in the correct format
# YOUR CODE HERE 
  
# Check if there are any missing educational backgrounds
# YOUR CODE HERE

<b>Tip:</b> Sometimes data might be clean and in correct format, but it's always good to verify before proceeding further

### Activity 3: Compensation Benchmark Data Import
Finally, let's retrieve data from the Wikipedia API. Retrieve industry compensation data for comparison.

<b>Step 1:</b> Make an API request: 

In [None]:
# Make the API request
url = "https://en.wikipedia.org/api/rest_v1/page/summary/Income_in_the_United_States"
# YOUR CODE HERE 

# Parse the JSON response
# YOUR CODE HERE 

<b>Step 2:</b> Extract relevant information: Get the article summary:

In [None]:
# Extract and display the article summary
# YOUR CODE HERE

#### Test Your Work:
1. Verify that all three datasets are loaded successfully
2. Confirm the data types are correct in each dataset
3. Check that you can access specific columns and rows
4. Ensure API response contains expected fields

## Success Checklist
- Imported regional feedback data successfully
- Processed educational achievement records
- Retrieved current compensation benchmarks
- Handled missing survey responses appropriately
- Verified data quality across all sources
- Documented data anomalies for the analytics team

## Common Issues & Solutions 
- Problem: Regional feedback data shows inconsistent scoring
    - Solution: Apply standardization across different regional formats
- Problem: Training records show duplicate entries
    - Solution: Implement deduplication logic while preserving the most recent records
- Problem: Benchmark API returns incomplete data
    - Solution: Implement proper error handling and data validation checks

## Summary 
Congratulations! You've successfully learned to import data from multiple sources (CSV, Excel, and API) using Python, and tackled real-world data quality challenges like handling missing values and inconsistent formats. Through this lab, you've gained practical experience with essential data import techniques that data scientists use daily, setting a strong foundation for more advanced data analysis tasks.

### Key Points
- Different data sources require different import approaches
- Always validate data after import
- Handle missing or inconsistent data appropriately
- Document any data quality issues discovered

## Solution Code
Stuck on your code or want to check your solution? Here's a complete reference implementation to guide you. This represents just one effective approach—try solving independently first, then use this to overcome obstacles or compare techniques. The solution is provided to help you move forward and explore alternative approaches to achieve the same results. Happy coding!

###  Activity 1: Regional Feedback Data Import - Solution Code

In [2]:
# Step 2: Import and examine the CSV data
employee_df = pd.read_csv('employee_insights.csv')

# Display the first few rows
print("Employee Data : ")
print(employee_df.head())

# Step 3: Handle data quality issues
# Check for missing values
print(employee_df.isnull().sum())

# Check for inconsistent date formats
def is_date(string):
    try:
        datetime.strptime(str(string), '%Y-%m-%d')
        return True
    except ValueError:
        return False

date_columns = ['last_training_date', 'last_promotion_date']
for col in date_columns:
    inconsistent_dates = employee_df[~employee_df[col].apply(is_date)]
    print(f"Inconsistent dates in {col}:")
    print(inconsistent_dates[col])

Employee Data : 
  employee_id   age  salary promotion_eligible last_training_date department  \
0       E0001  54.0     NaN                NaN         15/08/2023         HR   
1       E0002   NaN  $64761                  N         15/08/2023        NaN   
2       E0003  54.0     NaN                  N         15/08/2023  Marketing   
3       E0004   NaN     NaN                 No                NaN        NaN   
4       E0005  29.0  $61486                  Y         15/08/2023        NaN   

  work_experience  projects_completed  hours_worked_weekly    work_mode  \
0             NaN                14.0                  NaN  remote work   
1         1 years                 NaN                 53.3       HYBRID   
2               8                 6.0                 32.6       Hybrid   
3              16                 1.0                 37.8       Remote   
4             NaN                 1.0                 53.3       Hybrid   

  last_promotion_date  satisfaction_score  overtime

### Activity 2: Excel Data Import - Solution Code

In [3]:
# Step 1: Load the Excel file
education_data = pd.read_excel('education_data.xlsx')

# Display the first few rows
print("Education Data : ")
print(education_data.head())

# Step 2: Data validation
# Verify graduation years are in the correct format
def is_valid_year(year):
    try:
        year = int(year)
        return 1900 <= year <= datetime.now().year
    except ValueError:
        return False

invalid_years = education_data[~education_data['graduation_year'].apply(is_valid_year)]
if invalid_years.empty:
    print("All graduation years are well formatted.")
else:
    print("Invalid graduation years:")
    print(invalid_years['graduation_year'])

# Check is there are any missing educational backgrounds
missing_education = education_data[education_data['educational_background'].isnull()]
if missing_education.empty:
    print("No missing educational backgrounds.")
else:
    print("Missing educational backgrounds:")
    print(missing_education)

Education Data : 
      ID  graduation_year   educational_background
0  E0001             2011               Psychology
1  E0002             1995             Architecture
2  E0003             2007  Business Administration
3  E0004             2000  Business Administration
4  E0005             1991                 Medicine
All graduation years are well formatted.
No missing educational backgrounds.


### Activity 3: Compensation Benchmark Data Import  - Solution Code

In [4]:
# Step 1: Make an API request
url = "https://en.wikipedia.org/api/rest_v1/page/summary/Income_in_the_United_States"
response = requests.get(url)

# Parse the JSON response
data = response.json()

# Step 2: Extract and display the article summary
summary = data.get('extract', 'Summary not available')
print("Article summary:")
print(summary)

Article summary:
Income in the United States is measured by the various federal agencies including the Internal Revenue Service, Bureau of Labor Statistics, US Department of Commerce, and the US Census Bureau. Additionally, various agencies, including the Congressional Budget Office compile reports on income statistics. The primary classifications are by household or individual. The top quintile in personal income in 2022 was $117,162. The differences between household and personal income are considerable, since 61% of households now have two or more income earners. Median personal income in 2020 was $56,287 for full-time workers.

