# Pandas Data Types

> Once you have imported your data into a `DataFrame`, a common first step is to check the data type (`dtype`) of each column, and adjust any that aren't optimal for your analysis. Each Pandas data type supports different types of analysis, so selecting the correct option is an important step in the data preparation pipeline.


Below is a table that shows various Pandas data types, their corresponding Python data types, and a brief description of each.

| Pandas Dtype | Python Dtype        | Description                                           |
|--------------|---------------------|-------------------------------------------------------|
| object       | str                 | Used for text or mixed types of data                  |
| int64        | int                 | Integer numbers                                       |
| float64      | float               | Floating-point numbers                                |
| bool         | bool                | Boolean values (True/False)                           |
| datetime64   | datetime.datetime   | Date and time values                                  |
| timedelta64  | datetime.timedelta  | Differences between two datetimes                     |
| category     | (special type)      | Finite list of text values                            |
| period       | pd.Period           | Periods of time, useful for time-series data          |
| sparse       | (special type)      | Sparse array to contain mostly NaN values             |
| string       | str                 | Text                                                  |

Note that:

- The `int64` and `float64` data types indicate 64-bit storage for integer and floating-point numbers, respectively. Pandas also supports other sizes (like `int32` and `float32`) to save memory when the larger sizes are not necessary. An `int` type column cannot contain `NaN` values.
- The `category` data type is not a native Python data type but is provided by Pandas to optimise memory usage and performance for data with a small number of distinct values
- The `sparse` data type is used for data that is mostly composed of `NaN` or missing values. To save memory, it only stores the non-missing values in the column.
- The `period` data type is specific to Pandas and represents spans of time (like a month or a year)



### Checking the Data Type

The data types of your columns can be accessed via the `.dtypes` attribute, or by calling the `.info()` method. 

- The `.dtypes` attribute only returns the data type of each column
- The `.info()` method returns both the data type and some additional information: the number of rows and the memory usage of the `DataFrame`, as well as the number of non-null values in each column. We will deal with handling `NULL` values in another lesson.


In [1]:
import pandas as pd


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    se

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    se

AttributeError: _ARRAY_API not found


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/Users/adebayoolaonipekun/anaconda3/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 701, in start
    se

AttributeError: _ARRAY_API not found

In [2]:
data = {'name': ['Esther', 'Chivirter', 'Ahemba', 'Angela', 'Caleb', 'Daniel', 'Gabriel'],
        'age': [21, 22, 'n/a', 24, 25,'missing', 27]
       }
age_df = pd.DataFrame(data)
age_df

Unnamed: 0,name,age
0,Esther,21
1,Chivirter,22
2,Ahemba,
3,Angela,24
4,Caleb,25
5,Daniel,missing
6,Gabriel,27


In [3]:
age_df.dtypes

name    object
age     object
dtype: object

In [4]:
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    7 non-null      object
 1   age     7 non-null      object
dtypes: object(2)
memory usage: 244.0+ bytes


Looking at the output of `info()`, both columns have defaulted to the `object` data type. In this case, this is not quite what we want. The `Name` column should be of `string` type, as this uses less memory than `object`, and the `Age` column should be of a numeric type so that we can do numeric calculations on it.

### Assigning Data Types


The `.astype()` method can be used to manually assign a data type to a column. We can easily change the `Name` column to the `string` type:

In [5]:
age_df.name = age_df.name.astype('string')

In [6]:
age_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    7 non-null      string
 1   age     7 non-null      object
dtypes: object(1), string(1)
memory usage: 244.0+ bytes


When we try to cast the `Age` column as `int64` though, the method throws a `ValueError`, because it encountered some values (`missing` and `n/a`) that it could not cast as an integer:

In [7]:
age_df.age = age_df.age.astype('int64')

ValueError: invalid literal for int() with base 10: 'n/a'

To handle this error, we will need to use an alternative approach, the `pd.to_numeric()` function.

### The `pd.to_numeric()` Function

By default, the `astype()` method throws an error when it encounters a non-convertible value, as this prevents accidental data loss. We can override this behaviour by setting the `errors` flag to `ignore` rather than `raise`, but this would still not convert the datatype to a numeric value. The non-numeric values would still keep the column as an `object` data type.

In this scenario, we are happy to lose the information in the non-numeric values, for the sake of being able to treat the column as integers. To do this, we can use a separate function, `pd.to_numeric()`, which we can use with the `errors` parameter set to `coerce` to force the conversion:


age_df.age = pd.to_numeric(age_df.age, errors='coerce')
age_df.info()

In [9]:
age_df

Unnamed: 0,name,age
0,Esther,21.0
1,Chivirter,22.0
2,Ahemba,
3,Angela,24.0
4,Caleb,25.0
5,Daniel,
6,Gabriel,27.0


The values which could not be converted to numeric have now been converted to `NaN` (Not a Number) values:

> Note that it is not possible to convert a numeric column to `int` type until you have handled the `NaN` values. This is because `NaN` is technically a floating point value. We will learn more about handling missing values like `NaN` in another lesson.

## Time Series Data Types
### The `datetime64` Data Type


>The `datetime64` data type provides a memory-efficient structure for working with date and time data, allowing for operations like time-based indexing, slicing, and resampling to be performed. This data type is necessary for effective time-series data analysis, as it allows complex temporal computations and aggregations to be performed with relative ease.

Date-time columns can be challenging to assign correctly, because there is a very large range of ways that date and time columns can be formatted, and there is no guarantee that each column will only use one of these formats, so it is important to determine which formats are used in your data before attempting to convert a column to `datetime64`.
### Casting a Column to Datetime

The `pd.to_datetime` function can be used to cast a column to `datetime64`. In order to ensure that the conversion is accurate, it is necessary to consider the format that the date/time values are in.

Run the three code blocks below to perform a simple example conversion, and confirm that the `datetime64`-encoded dates are correct. In this case, the data are initially formatted as strings, and the date format is unambiguous, and so the conversion works without any additional work:

In [11]:
data = {
    'date_strings': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01']
}
date_df = pd.DataFrame(data)

date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date_strings  4 non-null      object
dtypes: object(1)
memory usage: 164.0+ bytes


In [12]:
date_df.date_strings = pd.to_datetime(date_df.date_strings)
date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date_strings  4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 164.0 bytes


In [13]:
date_df

Unnamed: 0,date_strings
0,2023-01-01
1,2023-02-01
2,2023-03-01
3,2023-04-01


#### Common Issues: Mixed Date Formats

Handling multiple date formats in a single column can be a bit tricky, but `pd.to_datetime` is quite flexible and can infer different formats automatically in most cases. Below is a simple example of a column called `mixed_dates`, which has dates in multiple formats:

In [14]:
# Create a sample DataFrame with multiple date formats
data = {
    'mixed_dates': ['01/02/2023', '2023-03-01', '04-Apr-2023', '20230505']
}
mixed_date_df = pd.DataFrame(data)

# Displaying the original DataFrame
print("Original Dataframe:")
print(mixed_date_df)
print("\nData types:")
print(mixed_date_df.dtypes)

Original Dataframe:
   mixed_dates
0   01/02/2023
1   2023-03-01
2  04-Apr-2023
3     20230505

Data types:
mixed_dates    object
dtype: object


In [15]:
# Converting the 'mixed_dates' column to datetime
# Note: infer_datetime_format=True can help to infer different formats, but might not handle all cases
mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['mixed_dates'], infer_datetime_format=True, errors='coerce')

# Displaying the modified DataFrame
print("\nModified DataFrame:")
print(mixed_date_df)
print("\nData types:")
print(mixed_date_df.dtypes)


Modified DataFrame:
   mixed_dates      dates
0   01/02/2023 2023-01-02
1   2023-03-01        NaT
2  04-Apr-2023        NaT
3     20230505        NaT

Data types:
mixed_dates            object
dates          datetime64[ns]
dtype: object


  mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['mixed_dates'], infer_datetime_format=True, errors='coerce')


Unfortunately in this case, automatic conversion has not been very effective, and we can see multiple values have been returned as `NaT` (Not a Time).

A more effective approach is to use the `parse` function from the `dateutil` library, in conjunction with the `.apply` method:

In [16]:
from dateutil.parser import parse
mixed_date_df['dates'] = mixed_date_df['mixed_dates'].apply(parse)
mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['dates'], infer_datetime_format=True, errors='coerce')
print("\nModified DataFrame:\n")
print(mixed_date_df)
print("\nData types:\n")
print(mixed_date_df.dtypes)


Modified DataFrame:

   mixed_dates      dates
0   01/02/2023 2023-01-02
1   2023-03-01 2023-03-01
2  04-Apr-2023 2023-04-04
3     20230505 2023-05-05

Data types:

mixed_dates            object
dates          datetime64[ns]
dtype: object


  mixed_date_df['dates'] = pd.to_datetime(mixed_date_df['dates'], infer_datetime_format=True, errors='coerce')


### The `timedelta64` Data Type
The `timedelta64` data type is used to represent differences in `datetime64` objects. While a `datetime64` object represents a specific point in time, with a defined year, month, day, hour, minute, and so on, a `timedelta64` object represents a duration that is not anchored to a specific start or end point. It tells you how much time is between two points, without specifying what those points are.

The distinction between `timedelta64` and `datetime64` data types in Pandas (and similarly, `timedelta` and `datetime` in Python's `datetime` module) is crucial due to the inherent differences in representing and utilising points in time versus durations of time, which are fundamentally different concepts.

**Arithmetic Operations:**
   - When you perform arithmetic with two `datetime64` objects, the result is a `timedelta64` object because subtracting one point in time from another gives you a duration
   - Conversely, when you add or subtract a `timedelta64` from a `datetime64` object, you get another `datetime64` object because you're shifting a point in time by a certain duration

By having separate data types, Pandas (and Python more broadly) allows for clear, intuitive operations on time data, ensuring that the operations are semantically meaningful and that the results are what users expect when performing arithmetic or comparisons with time-related data. This distinction also helps prevent misinterpretation of the data and ensures that operations are performed with the appropriate level of precision and efficiency for each type of data.

For example, the code block below creates a new `timedelta64` column by subtracting a specific timestamp from the `dates` column of the `mixed_dates_df` `DataFrame`:

In [17]:
# Subtracting a single date from the 'dates' column
single_date = pd.Timestamp('2023-01-01')  # Creating a Timestamp object
mixed_date_df['date_difference'] = mixed_date_df['dates'] - single_date  # Subtracting the single date

# Displaying the modified DataFrame
print("\nModified DataFrame:")
print(mixed_date_df)


Modified DataFrame:
   mixed_dates      dates date_difference
0   01/02/2023 2023-01-02          1 days
1   2023-03-01 2023-03-01         59 days
2  04-Apr-2023 2023-04-04         93 days
3     20230505 2023-05-05        124 days


## Key Takeaways


- Checking and adjusting data types in a Pandas `DataFrame` is crucial for optimal analysis, as each data type supports different types of analysis
- Pandas attempts to automatically assign data types to columns when importing raw data, but may not always choose optimally
- Use `.dtypes` to see column data types and `.info()` for data types plus additional `DataFrame` details
- The `.astype()` method in Pandas allows for manual data type assignment to a column
- Attempting to cast non-numeric values to integers in Pandas will result in a `ValueError`
- Use `pd.to_numeric()` with `errors='coerce'` to force convert non-numeric values to numeric, turning them into `NaN`s
- Non-numeric values in a Pandas `DataFrame` are converted to `NaN` and must be handled before converting a column to `int` type
- The `datetime64` data type allows efficient handling of date and time data, enabling time-based operations and facilitating time-series analysis.
- Use `pd.to_datetime` to convert a column to `datetime64`, considering the date/time format for accuracy
- The `to_datetime` function can convert various date formats to `datetime64`, but may need specific format or parsing help for complex cases
- `timedelta64` represents durations, while `datetime64` represents specific points in time
