# 🧵 Chapter 03: Exploring Categorical Data and Unstructured Text

## 📦 PostgreSQL Character Data Types

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>Data Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>CHAR(n) / CHARACTER(n)</td>
    <td>Fixed-length string, padded with trailing spaces.</td>
  </tr>
  <tr>
    <td>VARCHAR(n) / CHARACTER VARYING(n)</td>
    <td>Variable-length string up to <code>n</code>.</td>
  </tr>
  <tr>
    <td>TEXT</td>
    <td>Unlimited length, no size restriction.</td>
  </tr>
</table>

</div>

🔹 **Trailing spaces are ignored in comparisons for `CHAR`**  
🔹 `TEXT` is practically identical to `VARCHAR` with no length restriction.

## 🧮 Categorical vs Unstructured Text

**Categorical Text**: Values that belong to defined sets  
Examples:  

* Days: `Mon`, `Tues`, `Tuesday`, `TH`
* Products: `shirts`, `shoes`
* Colors: `red`, `blue`
* Satisfaction: `satisfied`, `unsatisfied`

**Unstructured Text**: Free-form, long, and full of flavor  
Examples:  

* *"I use it every day. It's my favorite color."*
* *"Four score and seven years ago..."*

## 📊 Grouping and Counting Categories

```sql
SELECT category, count(*)
FROM product
GROUP BY category;
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>category</th>
    <th>count</th>
  </tr>
  <tr>
    <td>Banana</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Apple</td>
    <td>4</td>
  </tr>
  <tr>
    <td>apple</td>
    <td>2</td>
  </tr>
  <tr>
    <td>apple</td>
    <td>1</td>
  </tr>
  <tr>
    <td>banana</td>
    <td>3</td>
  </tr>
</table>

</div>

🧠 **Watch out**: `Apple` ≠ `apple`, and `' apple'` (with space) ≠ `'apple'`.

## 🔃 Ordering Results

### 1. By Count (Descending)

```sql
SELECT category, count(*)
FROM product
GROUP BY category
ORDER BY count DESC;
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>category</th>
    <th>count</th>
  </tr>
  <tr>
    <td>Apple</td>
    <td>4</td>
  </tr>
  <tr>
    <td>banana</td>
    <td>3</td>
  </tr>
  <tr>
    <td>apple</td>
    <td>2</td>
  </tr>
  <tr>
    <td>Banana</td>
    <td>1</td>
  </tr>
  <tr>
    <td>apple</td>
    <td>1</td>
  </tr>
</table>

</div>

### 2. Alphabetically by Category

```sql
SELECT category, count(*)
FROM product
GROUP BY category
ORDER BY category;
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>category</th>
    <th>count</th>
  </tr>
  <tr>
    <td>apple</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Apple</td>
    <td>4</td>
  </tr>
  <tr>
    <td>Banana</td>
    <td>1</td>
  </tr>
  <tr>
    <td>apple</td>
    <td>2</td>
  </tr>
  <tr>
    <td>banana</td>
    <td>3</td>
  </tr>
</table>

</div>

📚 **ASCII Sort Order**: `' '` < `'A'` < `'B'` < `'a'` < `'b'`

## ❗ Common Text Issues

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>Issue</th>
    <th>Example</th>
    <th>Interpretation</th>
  </tr>
  <tr>
    <td>Case Sensitivity</td>
    <td>'apple' ≠ 'Apple'</td>
    <td>Different values</td>
  </tr>
  <tr>
    <td>Leading/Trailing Spaces</td>
    <td>' apple' ≠ 'apple'</td>
    <td>Spaces count!</td>
  </tr>
  <tr>
    <td>Empty ≠ NULL</td>
    <td>'' ≠ NULL</td>
    <td>Not the same</td>
  </tr>
  <tr>
    <td>Punctuation</td>
    <td>'to-do' ≠ 'to–do'</td>
    <td>Dash and en dash are different</td>
  </tr>
</table>

</div>

## 🔠 Case Conversion Functions

```sql
SELECT lower('aBc DeFg 7-'); -- 'abc defg 7-'
SELECT upper('aBc DeFg 7-'); -- 'ABC DEFG 7-'
```

## 🔎 Case-Insensitive Comparisons

```sql
SELECT * FROM fruit;
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>customer</th>
    <th>fav_fruit</th>
  </tr>
  <tr>
    <td>349</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>874</td>
    <td>Apple</td>
  </tr>
  <tr>
    <td>703</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>...</td>
    <td>...</td>
  </tr>
</table>

</div>

```sql
SELECT * FROM fruit
WHERE lower(fav_fruit) = 'apple';
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>customer</th>
    <th>fav_fruit</th>
  </tr>
  <tr>
    <td>349</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>874</td>
    <td>Apple</td>
  </tr>
  <tr>
    <td>313</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>418</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>300</td>
    <td>APPLE</td>
  </tr>
</table>

</div>

## 🔍 Searching with LIKE vs ILIKE

```sql
-- Case-sensitive LIKE
SELECT * FROM fruit
WHERE fav_fruit LIKE '%apple%';
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>customer</th>
    <th>fav_fruit</th>
  </tr>
  <tr>
    <td>349</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>703</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>313</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>418</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>754</td>
    <td>apple</td>
  </tr>
</table>

</div>

```sql
-- Case-insensitive ILIKE
SELECT * FROM fruit
WHERE fav_fruit ILIKE '%apple%';
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>customer</th>
    <th>fav_fruit</th>
  </tr>
  <tr>
    <td>349</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>874</td>
    <td>Apple</td>
  </tr>
  <tr>
    <td>703</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>300</td>
    <td>APPLES</td>
  </tr>
  <tr>
    <td>313</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>418</td>
    <td>apple</td>
  </tr>
  <tr>
    <td>300</td>
    <td>APPLE</td>
  </tr>
  <tr>
    <td>754</td>
    <td>apple</td>
  </tr>
</table>

</div>

## ✂️ Trimming Spaces and Characters

```sql
SELECT trim(' abc ');     -- 'abc'
SELECT ltrim(' abc ');    -- 'abc '
SELECT rtrim(' abc ');    -- ' abc'

SELECT trim('Wow!', '!');     -- 'Wow'
SELECT trim('Wow!', '!wW');   -- 'o'
```

💡 **Combine functions:**

```sql
SELECT trim(lower('Wow!'), '!w'); -- 'o'
```

## 🔡 Substrings & Splits

```sql
SELECT left('abcde', 2), right('abcde', 2); -- 'ab', 'de'
SELECT substring('abcdef' FROM 2 FOR 3);     -- 'bcd'
SELECT split_part('a,bc,d', ',', 2);         -- 'bc'
```

## 🔗 Concatenation

```sql
SELECT concat('a', 2, 'cc'); -- 'a2cc'
SELECT 'a' || 2 || 'cc';     -- 'a2cc'

-- NULL handling
SELECT concat('a', NULL, 'cc'); -- 'acc'
SELECT 'a' || NULL || 'cc';     -- NULL
```

## 🔁 Standardizing Categorical Values (Recode Strategy)

### Step 1: Create Temp Table

```sql
CREATE TEMP TABLE recode AS
SELECT DISTINCT fav_fruit AS original, fav_fruit AS standardized
FROM fruit;
```

### Step 2: Update Standardized Values

```sql
-- Lowercase + Trim
UPDATE recode SET standardized = trim(lower(original));

-- Fix common typos
UPDATE recode SET standardized = 'banana' WHERE standardized LIKE '%nn%';

-- Remove plural "s"
UPDATE recode SET standardized = rtrim(standardized, 's');
```

### Step 3: Join to Clean Data

```sql
SELECT standardized, count(*)
FROM fruit
LEFT JOIN recode ON fav_fruit = original
GROUP BY standardized;
```

<div align="left">

<table style="text-align: left;">
  <tr>
    <th>standardized</th>
    <th>count</th>
  </tr>
  <tr>
    <td>apple</td>
    <td>8</td>
  </tr>
  <tr>
    <td>banana</td>
    <td>5</td>
  </tr>
</table>

</div>

## 🔚 Recap

1. **Understand your text data** (structure, case, spaces).
2. **Group and count with caution**.
3. **Use functions like `lower()`, `trim()`, `split_part()`**.
4. **Standardize values using UPDATEs and JOINs**.
5. **Always test with real messy data** — because it's never clean the first time.