# Introduction

In this tutorial, you'll learn how to investigate data types within a DataFrame or Series.  You'll also learn how to find and replace entries.

在本教程中，您将学习如何调查 DataFrame 或 Series 中的数据类型。  您还将学习如何查找和替换条目。

# Dtypes

The data type for a column in a DataFrame or a Series is known as the **dtype**.

数据帧或系列中列的数据类型称为 **dtype**。

You can use the `dtype` property to grab the type of a specific column.  For instance, we can get the dtype of the `price` column in the `reviews` DataFrame:

您可以使用 `dtype` 属性获取特定列的类型。例如，我们可以获取 `reviews` DataFrame 中 `price` 列的 dtype：

In [1]:
import pandas as pd
reviews = pd.read_csv("./input/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('display.max_rows', 5)

In [2]:
reviews.price.dtype

dtype('float64')

Alternatively, the `dtypes` property returns the `dtype` of _every_ column in the DataFrame:

或者，`dtypes`属性返回 DataFrame 中每一列的`dtype`：

In [3]:
reviews.dtypes

country        object
description    object
                ...  
variety        object
winery         object
Length: 13, dtype: object

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

数据类型会告诉我们 pandas 是如何在内部存储数据的。`float64`表示使用 64 位浮点数；`int64`表示使用类似大小的整数，以此类推。

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the `object` type.

需要注意的一个特殊情况（在这里显示得很清楚）是，完全由字符串组成的列没有自己的类型，而是被赋予了 `object` 类型。

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the `points` column from its existing `int64` data type into a `float64` data type:

我们可以使用 `astype()` 函数将一种类型的列转换为另一种类型，只要这种转换是合理的。例如，我们可以将 `points` 列从现有的 `int64` 数据类型转换为 `float64` 数据类型：

In [4]:
reviews.points.astype('float64')

0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

A DataFrame or Series index has its own `dtype`, too:

DataFrame 或 Series 索引也有自己的 `dtype` 类型：

In [5]:
reviews.index.dtype

dtype('int64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.

Pandas 还支持更奇特的数据类型，如categorical数据和timeseries数据。由于这些数据类型较少使用，我们将在本教程的稍后部分再介绍。

# Missing data

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` dtype.

缺少数值的条目会被赋予`NaN`值，即 "Not a Number（非数值）"的缩写。由于技术原因，这些 `NaN` 值始终是 `float64` 类型。

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

Pandas 提供了一些专门针对缺失数据的方法。要选择`NaN`条目，可以使用 `pd.isnull()`（或其同伴 `pd.notnull()`）。可以这样使用：

In [6]:
reviews[pd.isnull(reviews.country)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129590,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Mike DeSimone,@worldwineguys,Büyülübağ 2012 Shah Red,Red Blend,Büyülübağ
129900,,This wine offers a delightful bouquet of black...,,91,32.0,,,,Mike DeSimone,@worldwineguys,Psagot 2014 Merlot,Merlot,Psagot


Replacing missing values is a common operation.  Pandas provides a really handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, we can simply replace each `NaN` with an `"Unknown"`:

替换缺失值是一种常见的操作。Pandas 提供了一个非常方便的方法来解决这个问题：`fillna()`。`fillna()` 提供了几种不同的策略来减少此类数据。例如，我们可以简单地将每个 `NaN` 替换为 `"Unknown"`：

In [7]:
reviews.region_2.fillna("Unknown")

0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.

或者，我们可以用数据库中给定记录之后出现的第一个非空值来填补每个缺失值。这就是所谓的回填策略。

Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from `@kerinokeefe` to `@kerino`. One way to reflect this in the dataset is using the `replace()` method:

或者，我们可能有一个非空值需要替换。例如，假设自本数据集发布以来，审稿人 Kerin O'Keefe 已将其 Twitter 手柄从 `@kerinokeefe` 更改为 `@kerino`。使用 `replace()` 方法是在数据集中反映这一变化的方法之一：

In [8]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like `"Unknown"`, `"Undisclosed"`, `"Invalid"`, and so on.

这里值得一提的是 `replace()` 方法，因为它可以方便地替换缺失数据，这些缺失数据在数据集中被赋予了某种前哨值：如 `"Unknown"`、`"Undisclosed"`、`"Invalid"`等。