# Pandas Data Types, Column Operations, and Missing Values


▶️ Import `pandas` and `numpy`.


In [1]:
import pandas as pd
import numpy as np

▶️ Create a `DataFrame` named `df_emp` with the following data:


In [2]:
df_emp = pd.DataFrame(
    {
        "emp_id": [30, 40, 10, 20],
        "name": ["Colby", "Adam", "Eli", "Dylan"],
        "dept": ["Sales", "Marketing", "Sales", "Marketing"],
        "office_phone": ["(217)123-4500", np.nan, np.nan, "(217)987-6543"],
        "start_date": ["2017-05-01", "2018-02-01", "2020-08-01", "2019-12-01"],
        "salary": [202000, 185000, 240000, 160500],
    }
)

# Used for intermediate checks
df_emp_backup = df_emp.copy()

df_emp

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


---

## 🧮 NumPy and Pandas Data Types

Pandas is built on top of NumPy, a powerful library for numerical computing in Python. As a result, many of the data types in Pandas are derived from NumPy's `ndarray` and its data types (`dtypes`). This means that you can expect similar behavior and performance characteristics when working with data in Pandas.

### 🔮 NumPy Data Types

Here are some of the most commonly used data types in NumPy:

1. 🔢 Numeric Types
   - **Signed integers**: `int8`, `int16`, `int32`, `int64`
   - **Unsigned integers**: `uint8`, `uint16`, `uint32`, `uint64`
   - **Floating point**: `float16`, `float32`, `float64`, etc.
2. 🚦 Boolean type
   - **Boolean**: `bool_`
3. 📝 String (Text)
   - **Fixed-length Unicode (string)**: `str_` or `U` (e.g., `U20` for 20-character string)
   - **Fixed-length bytes (string)**: `bytes_` or `S` (e.g., `S10` for 10-byte string)
4. 📅 Datetime and Timedelta
   - **Dates and times**: `datetime64` with various precisions (`[Y, M, D, h, m, s, ms, us, ns]`)
   - **Time differences**: `timedelta64`

:::{seealso} Why are there different numbers after the data types (e.g., `int8`, `int16`, `float32`)?

The numbers indicate the number of bits used to represent the data type. For example, `int8` uses 8 bits (1 byte) for storage, while `int32` uses 32 bits (4 bytes). This allows for a trade-off between memory usage and the range of values that can be represented.

This was particularly important in the early days of computing when memory was limited. If you have experience with low-level programming (like C or C++), you may have encountered situations where choosing the right data type was crucial for performance. Today, it still matters for performance and memory efficiency, especially when working with large datasets.

Although you may not need to worry about these details in everyday programming, understanding them can help you write more efficient code when working with large arrays or datasets.

:::

You can check the `dtype` of a NumPy array using the `.dtype` attribute:

```python
import numpy as np

a = np.array([1, 2, 3], dtype=np.int16)
print(a.dtype)   # int16
```


You can check the data types of each column using the `dtypes` property.


---

### 🐼 Pandas Data Types

Pandas extend NumPy's type systems. This means that you can use many of the same data types in Pandas that you would use in NumPy, and vice versa.

The main difference is that Pandas also provides some additional data types that are specific to Pandas.

Here are the most important additional `dtypes` pandas introduces:

1. 📦 `object` dtype (legacy, but important)
   - A "catch-all" type for mixed or non-numeric data
   - Commonly used for strings in older pandas versions
   - Backed by generic Python objects
2. 📝 StringDType (`string`)
   - Introduced to replace `object` for text
   - A dedicated string type introduced in pandas 1.0
   - More efficient and consistent than `object` for text data
   - Supports missing values (NA) natively
   - `codes = pd.Series(["ACCY", "FIN", "BDI", None], dtype="string")`
3. 🗂️ CategoricalDType (`category`)
   - For categorical data with a fixed number of possible values
   - More memory-efficient than using `object` or `string` for repeated values
   - Stores data as integer codes with a mapping to the actual categories
   - `standing = pd.Series(["freshman", "sophomore", "junior", "senior"], dtype="category")`
4. 🚦 Nullable Boolean (`boolean`)
   - A boolean type that supports missing values (NA)
   - Different from the standard `bool` type, which does not support NA
   - `is_enrolled = pd.Series([True, False, None, True], dtype="boolean")`
5. 🔢 Nullable Integer (`Int64`, `Int32`, etc.)
   - Integer types that support missing values (NA)
   - Different from standard NumPy integer types, which do not support NA
   - `credits = pd.Series([84, 99, None, 120], dtype="Int64")`
6. 📅 Datetime with Timezone (`datetime64[ns, tz]`)
   - Datetime type that includes timezone information
   - Useful for handling time series data across different time zones
   - `enrollment_date = pd.Series(["2025-09-01", "2025-08-24", None, "2025-08-03"], dtype="datetime64[ns, US/Central]")`
7. ⏰ PeriodDType (`period`)
   - For representing time periods (e.g., months, quarters)
   - Useful for time series analysis
   - `period = pd.Series(pd.period_range("2024-01", periods=4, freq="Q"))`

Please refer to the [relevant Pandas documentation page](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) for the list of all extension types.


In [3]:
df_emp.dtypes

emp_id           int64
name            object
dept            object
office_phone    object
start_date      object
salary           int64
dtype: object

To retrieve the data type of a column (`Series`), access the `dtype` property.


In [4]:
str(df_emp["name"].dtype)

'object'

:::{seealso} String Data Type in Pandas

Note that the older versions of Pandas did not have a dedicated `string` type and stored strings as `object`s.

To improve performance and consistency, Pandas 1.0 introduced the `StringDtype` to provide a more efficient way to store and manipulate string data. This is especially useful for large datasets with many string entries, as it can help reduce memory usage and improve performance for string operations.

By default, when you create a `DataFrame` or `Series` with string data, Pandas will use the `object` dtype. However, you can explicitly specify the `StringDtype` when creating a `DataFrame` or `Series` if you want to take advantage of its benefits.

```python
s = pd.Series(["Business", "Data", "Innovation"], dtype="string")
print(s.dtype)  # string
```

:::


---

## 🧰 Pandas Column Operations

Pandas provides a wide range of operations for manipulating columns in a DataFrame. Here are some common column operations you can perform:

| Operation                   | Code Example                            | Description                                  |
| --------------------------- | --------------------------------------- | -------------------------------------------- |
| **Select single column**    | `df["col1"]`                            | Returns a `Series` for the column            |
| **Select multiple columns** | `df[["col1", "col2"]]`                  | Returns a `DataFrame` with chosen columns    |
| **Rename columns**          | `df.rename(columns={"old": "new"})`     | Rename columns using a mapping dictionary    |
| **Drop columns**            | `df.drop(columns=["col1", "col2"])`     | Remove one or more columns from DataFrame    |
| **Add new column**          | `df["total"] = df["price"] * df["qty"]` | Create a new column from existing ones       |
| **Reorder columns**         | `df = df[["col3", "col1", "col2"]]`     | Rearrange column order manually              |
| **Change dtype**            | `df["age"] = df["age"].astype(int)`     | Convert a column’s data type                 |
| **Rename all columns**      | `df.columns = df.columns.str.upper()`   | Apply transformation to all column names     |
| **Select by index**         | `df.iloc[:, [0, 2]]`                    | Select columns by their position (1st & 3rd) |


---

### 🧲 Select Subset of Columns

You can select a subset of columns from a DataFrame using either the bracket notation `df[]` or the `.loc[]` accessor.

Selecting a single column returns a `Series`.


In [5]:
df_emp["name"]

0    Colby
1     Adam
2      Eli
3    Dylan
Name: name, dtype: object

In [6]:
df_emp.loc[:, "name"]

0    Colby
1     Adam
2      Eli
3    Dylan
Name: name, dtype: object

In [7]:
type(df_emp["name"])

pandas.core.series.Series

Selecting multiple columns returns a `DataFrame`.


In [8]:
df_emp[["name", "dept"]]

Unnamed: 0,name,dept
0,Colby,Sales
1,Adam,Marketing
2,Eli,Sales
3,Dylan,Marketing


In [9]:
df_emp.loc[:, ["name", "dept"]]

Unnamed: 0,name,dept
0,Colby,Sales
1,Adam,Marketing
2,Eli,Sales
3,Dylan,Marketing


In [10]:
type(df_emp[["name", "dept"]])

pandas.core.frame.DataFrame

:::{seealso} Difference between `df[]` and `df.loc[]`

Both `df[]` and `df.loc[]` can be used to select columns in a DataFrame. Omitting the `.loc` is a shorthand for selecting columns directly, while `df.loc[]` provides more flexibility for complex selections, including row and column slicing. For simple column selections, using `df[]` is more concise.

```python
# Select a single column
df["col1"]          # Shorthand
df.loc[:, "col1"]   # More explicit

# Select multiple columns
df[["col1", "col2"]]        # Shorthand
df.loc[:, ["col1", "col2"]] # More explicit
```

They have some differences in functionality and flexibility.

- `df["col"]` may return a _view_ or a _copy_ of the data, which can lead to the `SettingWithCopyWarning` if you try to modify the result.
- In contrast, `df.loc[]` always returns a _view_, making it safer for assignments.

To update values, prefer using `df.loc[]`:

:::


---

### 🏷️ Rename Column(s)

You can rename columns using the `rename()` method. This method allows you to specify a mapping of old column names to new column names.

The `rename()` method does not modify the original DataFrame by default. Instead, it returns a new DataFrame with the updated column names. If you want to modify the original DataFrame in place, you can set the `inplace` parameter to `True`.

```python
# Rename a column and return a new DataFrame without modifying the original
# df's column names will remain unchanged
df_renamed = df.rename(columns={"name_before": "name_after"})

# Rename a column and update the original DataFrame
df.rename(columns={"name_before": "name_after"}, inplace=True)
```

:::{danger} Avoid assignment with `inplace=True`

When using `inplace=True`, the original DataFrame is modified directly. The `rename()` method will return `None`. This is because the operation is performed in place, and there is no new DataFrame to return.

**🚫 Incorrect: Assigning the result to a variable with `inplace=True`**

```python
# df becomes None
df = df.rename(columns={"name_before": "name_after"}, inplace=True)

display(df) # Output: None
```

**✅ Correct: Use inplace=True without assignment**

```python
df.rename(columns={"name_before": "name_after"}, inplace=True)
```

:::

:::{seealso} `.rename()` can rename both columns and index (row labels)

The `rename()` method can be used to rename columns and/or the index (row labels) of a DataFrame. You can specify the `columns` parameter to rename columns and the `index` parameter to rename index labels.

Alternatively, you can use the `axis` parameter to specify whether you want to rename columns (`axis=1`) or index (`axis=0`).

**Rename columns:**

```python
df.rename(columns={"old_col": "new_col"}, inplace=True)
# or
df.rename({"old_col": "new_col"}, axis=1, inplace=True)
```

**Rename index (row labels):**

```python
df.rename(index={"old_index": "new_index"}, inplace=True)
# or
df.rename({"old_index": "new_index"}, axis=0, inplace=True)
```

:::


---

**🎯 Example: Rename `"office_phone"` column to `"phone_num"` (out-of-place)**


In [11]:
df_renamed = df_emp.rename(columns={"office_phone": "phone_num"})

df_renamed

Unnamed: 0,emp_id,name,dept,phone_num,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


---

**🎯 Example: Rename `"office_phone"` column to `"phone_num"` (in-place)**


In [12]:
# Create a copy of the original DataFrame for this demo
df_emp2 = df_emp.copy()

df_emp2.rename(columns={"office_phone": "phone_num"}, inplace=True)

df_emp2

Unnamed: 0,emp_id,name,dept,phone_num,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


Compare the original `df_emp` and modified `df_emp2` DataFrames to see the change.


In [13]:
# Use head(1) to only show the first row for brevity
display(df_emp.head(1))
display(df_emp2.head(1))

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000


Unnamed: 0,emp_id,name,dept,phone_num,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000


You can see that the `"office_phone"` column has been renamed to `"phone_num"` in `df_emp2`, while `df_emp` remains unchanged.


---

**🎯 Example: Rename `"name"` to `"first_name"` and `"salary"` to `"base_salary"` (in-place)**

You can rename multiple columns at once by providing a dictionary with multiple key-value pairs to the `columns` parameter of the `rename()` method.


In [14]:
df_emp3 = df_emp.copy()

df_emp3.rename(columns={"name": "first_name", "salary": "base_salary"}, inplace=True)

df_emp3

Unnamed: 0,emp_id,first_name,dept,office_phone,start_date,base_salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


---

### 🗑️ Drop Column(s)

You can rename a column using `df.drop(columns=["col1", "col2"])`. Similar to `rename()`, the `drop()` method does not modify the original DataFrame by default. Instead, it returns a new DataFrame with the specified columns removed. If you want to modify the original DataFrame in place, you can set the `inplace` parameter to `True`.

```python
# Copy df, drop columns
# df will remain unchaged
df_dropped = df.drop(columns=["col1", "col2"])

# Drop columns in place, modifying the original df
df.drop(columns=["col1", "col2"], inplace=True)
```


---

**🎯 Example: Drop `start_date` column (out-of-place)**


In [15]:
df_dropped = df_emp.drop(columns=["start_date"])

df_dropped

Unnamed: 0,emp_id,name,dept,office_phone,salary
0,30,Colby,Sales,(217)123-4500,202000
1,40,Adam,Marketing,,185000
2,10,Eli,Sales,,240000
3,20,Dylan,Marketing,(217)987-6543,160500


---

**🎯 Example: Drop `"name"` and `"salary"` columns in-place**


In [16]:
# Create a copy of the original DataFrame to preserve the original DataFrame
df_emp4 = df_emp.copy()

df_emp4.drop(columns=["name", "salary"], inplace=True)

df_emp4

Unnamed: 0,emp_id,dept,office_phone,start_date
0,30,Sales,(217)123-4500,2017-05-01
1,40,Marketing,,2018-02-01
2,10,Sales,,2020-08-01
3,20,Marketing,(217)987-6543,2019-12-01


---

## 🕳️ Working with Missing Values

In real-world datasets, it's common to encounter missing or null values. These missing values can arise for various reasons, such as unknown values, data entry errors, incomplete data collection, or intentional omissions. Handling missing values is a crucial step in data preprocessing, as they can significantly impact the results of your analysis or machine learning models.

### 🧩 Missing Value Markers in Pandas

While Python has its own built-in `None` type to represent missing values, Pandas provides a more comprehensive approach to handling missing data. Pandas recognizes several markers for missing values, allowing for greater flexibility in data representation.

#### 🧪 `np.nan` (Not a Number) - Most Common Marker

The most common marker for missing values in Pandas is NumPy's `np.nan` (`NaN`), which stands for "Not a Number". It is a special floating-point value defined by the [IEEE 754 standard](https://en.wikipedia.org/wiki/IEEE_754). The `NaN` value represents missing or undefined numerical data.

Note that `NaN` is a float, so if a column contains any `NaN` values, Pandas will automatically convert the entire column to a floating-point type to accommodate the `NaN` unless you use one of the _nullable_ types.

#### 🐼 `pd.NA` (Pandas NA)

Pandas introduced a new scalar value `pd.NA` in version 1.0 to represent missing values in a more consistent way across different data types. `pd.NA` is part of Pandas' nullable data types, which allow for missing values in integer, boolean, and string columns without converting them to floating-point types.

#### ⏳ `pd.NaT` (Not a Time)

For datetime-like data, Pandas uses `pd.NaT` to represent missing or null date and time values. `NaT` is similar to `NaN` but specifically designed for datetime objects.

---

### 🔍 Methods to Detect Missing Values

You can use the `isna()` or `isnull()` methods to detect missing values in a DataFrame. Both methods are equivalent and return a DataFrame of the same shape as the original, with `True` for missing values and `False` for non-missing values.

```python
# Detect missing values in the entire DataFrame
df["my_column"].isna() # or df["my_column"].isnull()
```

Recall that a column in a DataFrame is a `Series`, so you can test the `isna()` method on a `Series`.


In [17]:
my_series = pd.Series(["Hello", np.nan, "Pandas", pd.NA])
my_series.isna()

0    False
1     True
2    False
3     True
dtype: bool

Here is a visual representation of the `isna()` method:

![notna](images/pandas/isna-series.png)


---

**🎯 Example: Find employees without a phone number**

First, create a boolean mask that identifies rows where the `"office_phone"` column is `NaN` using the `isna()` method. Then, use this mask to filter the DataFrame and display only the rows with missing phone numbers.


In [18]:
mask = df_emp["office_phone"].isna()
mask

0    False
1     True
2     True
3    False
Name: office_phone, dtype: bool

Use the boolean mask to filter the DataFrame and display rows with missing phone numbers.


In [19]:
df_emp[mask]
# This can be done in one line as well:
# df_emp[df_emp["office_phone"].isna()]

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
1,40,Adam,Marketing,,2018-02-01,185000
2,10,Eli,Sales,,2020-08-01,240000


If you want to detect non-missing values, you can use the `notna()` or `notnull()` methods.

```python
# Detect non-missing values in the entire DataFrame
df["my_column"].notna() # or df["my_column"].notnull()
```


In [20]:
my_series = pd.Series(["Hello", np.nan, "Pandas", pd.NA])
my_series.notna()

0     True
1    False
2     True
3    False
dtype: bool

Here is a visual representation of the `notna()` method:

![notna](images/pandas/notna-series.png)


---

**🎯 Example: Find employees with a phone number**

First, create a boolean mask that identifies rows where the `"office_phone"` column is not `NaN` using the `notna()` method. Then, use this mask to filter the DataFrame and display only the rows with non-missing phone numbers.


In [21]:
mask = df_emp["office_phone"].notna()
mask

0     True
1    False
2    False
3     True
Name: office_phone, dtype: bool

Use the boolean mask to filter the DataFrame and display rows with phone numbers.


In [22]:
df_emp[mask]
# This can be done in one line as well:
# df_emp[df_emp["office_phone"].notna()]

Unnamed: 0,emp_id,name,dept,office_phone,start_date,salary
0,30,Colby,Sales,(217)123-4500,2017-05-01,202000
3,20,Dylan,Marketing,(217)987-6543,2019-12-01,160500


:::{important} How to Handle Missing Values

Missing values can occur in your dataset for various reasons. In data analysis, handling missing values is crucial to ensure the quality and accuracy of your results. There are several strategies to deal with missing data, depending on the context and the nature of your dataset.

Here are some common methods to handle missing values in a Pandas DataFrame:

1. **Drop Missing Values**: Remove rows or columns with missing values using `dropna()`.

   ```python
   df.dropna()  # Drop rows with any missing values
   df.dropna(axis=1)  # Drop columns with any missing values
   ```

2. **Fill Missing Values**: Replace missing values with a specific value or a computed value (e.g., mean, median) using `fillna()`.

   ```python
   df.fillna(0)  # Replace missing values with 0
   df["col1"].fillna(df["col1"].mean(), inplace=True)  # Replace with mean
   ```

3. **Interpolate Missing Values**: Use interpolation to estimate missing values based on surrounding data.

   ```python
   df.interpolate()
   ```

By using these methods, you can ensure that your DataFrame is clean and ready for analysis.

:::
