Working with Categorical Variables
---

The dataset we will work with is information about Science Technology Enginnering and Math programs at different postsecondary institutions in Pennsylvania.

We can first open the dataset and view all of the columns:

In [None]:
df = pd.read_csv("stem.csv")
df.head()

Let's explore the dataset a bit. The `Type` column is a categorical variable. Let's look at what categories are available:

In [None]:
df["Type"].unique()

Everything that isn't a `Community College` is considered a University. We can use `np.where` to fill in a new column of the dataframe, using the following syntax:

```python
np.where(<condition>, <true_value>, <false_value>)
```

For example,

In [None]:
df["Corrected Type"] = np.where(df["Type"] == "Community College", "College", "University")
df.head()

### Tasks

1. Try using `np.where` to create a new column `IsCollege` where the value is `True` for a college and `False` for a university.
2. Try using the apply + lambda style to create a new column `IsUniversity` where the value is `True` for a university and `False` for a college.
    - Hint: An inline `if-then-else` looks a bit different than we have seen previously, syntax:
    
    ```python
    <true_value> if <condition> else <false_value>
    ```

These columns are really useful because they act as boolean masks! Try the following cells:

In [None]:
df[df["IsCollege"]].head()

In [None]:
df[df["IsUniversity"]].head()

Notes
---

`np.where` and `if-then-else` will only work with two categories! If there are more than two categories we need to create more complicated boolean masks. Let's look an example using the Trees dataset from before:

In [None]:
df = pd.read_csv("trees.csv")
df.head()

Let's poke around the height data:

In [None]:
df["height"].min(), df["height"].max()

In [None]:
df.groupby("height").count()

Height is a continuous variable ranging from 0 to 65. Let's convert this continuous variable into a categorical one with four ranges:

- less than or equal to 15
- greater than 15 and less than or equal to 30
- greater than 30 and less than or equal to 45
- greater than 45

To do this we can create boolean masks and use them to fill in our categories, e.g.

In [None]:
mask0 = df["height"] <= 15
mask1 = (df["height"] > 15) & (df["height"] <= 30)
mask2 = (df["height"] > 30) & (df["height"] <= 45)
mask3 = (df["height"] > 45) & (df["height"] <= 65)

### Notes

- Take a look at some of the masks just to make sure you see what they look like
- The `(<condition>) & (<condition>)` syntax behaves like a broadcasted version of `<bool> and <bool>` (e.g. `True and True`), try the following code if you are still unsure. Make sure to think about what the answer should be before running it!
    ```python
    a = np.array([True, False, True])
    b = np.array([True, False, False])
    a & b
    ```

In [None]:
df.loc[mask0, "height_range"] = "0-15"
df.loc[mask1, "height_range"] = "15-30"
df.loc[mask2, "height_range"] = "30-45"
df.loc[mask3, "height_range"] = "45-65"

In [None]:
df[["height", "height_range"]].head()

### Tasks

1. Try creating categorical ranges for width. Use three ranges.