# In-Class Assignment: Labor Market Data Cleaning

**Scenario:** You are a Research Assistant for a labor economist studying the demand for data science skills in the economics job market. 

Your Principal Investigator (PI) has scraped a mess of raw text from a job board. The data contains salary information, required skills, and contact emails, but it is currently trapped in unstructured strings.

**Your Goal:** transform this raw text into a structured Pandas DataFrame with clean variables (`salary`, `requires_python`, `contact_email`) that could be used in a regression analysis.

**Skills Tested:**
* Basic String Manipulation (`strip`, `lower`)
* Regular Expressions (`re.findall`, capture groups)
* Pandas String Methods (`.str` accessor)

In [None]:
import pandas as pd
import re
import numpy as np

### 1. The Raw Data

Run the cell below to load the raw data into a Pandas Series. Note how inconsistent the formatting is.

In [None]:
raw_job_postings = [
    "  Senior Economist at TechCorp. Salary: $150,000. Skills: Python, R, Stata. Contact: jobs@techcorp.com ",
    "Research Assistant - PolicyInst.  pay is 55000 usd.  Needs Stata and Excel.  email:   hr@policyinst.org  ",
    "Data Scientist (Remote). $120k. Proficient in Python and SQL. inquiries to recruiting@startup.io",
    " Visiting Professor. 90,000.  Teaching Stata. contact:  dean@university.edu ",
    "JUNIOR ANALYST | NYC | 85,000 | Contact: careers@bank.net | Must know Python"
]

df = pd.DataFrame(raw_job_postings, columns=['raw_text'])
df

### Task 1: Basic Hygiene

The data has inconsistent capitalization and trailing whitespaces. 

**Instruction:** Create a new column called `clean_text`. Use Pandas string methods to:
1.  Convert the text to lowercase.
2.  Strip leading and trailing whitespace.

*Hint: Remember the `.str` accessor.*

In [None]:
# Your code here

df['clean_text'] = 
df.head()

### Task 2: Generating Dummy Variables (Skills Extraction)

We want to know which jobs require Python versus Stata. 

**Instruction:** Create two new boolean (True/False) or binary (1/0) columns: `has_python` and `has_stata`.

* Set the value to `True` if the `clean_text` contains the word "python" (or "stata").



In [None]:
# Your code here

df['has_python'] = 
df['has_stata'] = 

df[['clean_text', 'has_python', 'has_stata']]

### Task 3: Extracting Contact Info (Regex Groups)

We need to email these companies. The email addresses are embedded in the text. 

**Instruction:** Use `df['clean_text'].str.extract()` with a Regular Expression to create a new column called `email`.

*Hint: Use the regex pattern we discussed in class that captures non-whitespace characters surrounding an `@` symbol: `r'(\S+@\S+)'`*

In [None]:
# Your code here

df['email'] = 
df[['clean_text', 'email']]

### Task 4: The Tricky Part - Salary Extraction

Extracting numbers from text is a core task in constructing economic datasets. The salaries appear in different formats: `$150,000`, `55000`, `$120k`, `90,000`.

**Instruction:** Write a regex pattern to extract the numeric salary information into a column called `salary_str`.

Your regex should be robust enough to capture:
1.  Digits (e.g., `55000`)
2.  Digits with commas (e.g., `150,000`)
3.  Digits ending in 'k' (e.g., `120k`)

*Hint: A pattern like `r'([\d,]+k?)'` might be a good starting point, but you may need to experiment. Test your regex against the specific variations in the dataframe.*

In [None]:
# Your code here

# Construct your regex
salary_pattern = r"" 

# Apply extraction
df['salary_str'] = 

df[['clean_text', 'salary_str']]

### Task 5: Data Cleaning & Analysis

Now that you have the `salary_str` column, it's still an object (text). We can't take the mean of text.

**Instruction:**
1.  Replace 'k' with '000' in `salary_str`.
2.  Remove commas in `salary_str`.
3.  Convert the column to a numeric type (float or int).
4.  Calculate the average salary for jobs that require Python vs. those that don't.

In [None]:
# Your code here

# 1. Replace 'k' (Hint: str.replace)

# 2. Remove ',' 

# 3. To numeric (Hint: pd.to_numeric)

# 4. Groupby or filtering to find means
# python_salary = ...
# no_python_salary = ...
