In [None]:
#@title Copyright 2020 Google LLC. Double-click here for license information.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Colabs

Machine Learning Crash Course uses Colaboratories (Colabs) for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory).

Colab 是 Google 推出的 Jupyter Notebook 的实现。关于 Colab 及其使用方法的更多信息，请访问 Welcome to Colaboratory。

# Pandas DataFrame UltraQuick Tutorial

This Colab introduces [**DataFrames**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which are the central data structure in the pandas API. This Colab is not a comprehensive DataFrames tutorial.  Rather, this Colab provides a very quick introduction to the parts of DataFrames required to do the other Colab exercises in Machine Learning Crash Course.

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

  * A DataFrame stores data in cells.
  * A DataFrame has named columns (usually) and numbered rows.

本 Colab 介绍了 DataFrame，它是 pandas API 的核心数据结构。本教程并不是 DataFrame 的全面教程，而是为机器学习速成课程中其他 Colab 练习提供 DataFrame 的快速入门。

DataFrame 类似于内存中的电子表格。与电子表格类似：

DataFrame 将数据存储在单元格中。

DataFrame 通常有命名的列和编号的行。



## Import NumPy and pandas modules

Run the following code cell to import the NumPy and pandas modules.

导入 NumPy 和 pandas 模块
运行以下代码单元以导入 NumPy 和 pandas 模块。

In [2]:
import numpy as np
import pandas as pd

## Creating a DataFrame

The following code cell creates a simple DataFrame containing 10 cells organized as follows:

下面的代码单元创建了一个简单的 DataFrame，包含 10 个单元格，组织方式如下：
  * 5 rows
  * 2 columns, one named `temperature` and the other named `activity`
  * 5 行
  * 2 列，一列名为 `temperature`，另一列名为 `activity`

The following code cell instantiates a `pd.DataFrame` class to generate a DataFrame. The class takes two arguments:

以下代码单元通过实例化 pd.DataFrame 类来生成 DataFrame。该类接收两个参数：

  * The first argument provides the data to populate the 10 cells. The code cell calls `np.array` to generate the 5x2 NumPy array.
  * The second argument identifies the names of the two columns.
    * 第一个参数提供数据以填充 10 个单元格。代码单元调用 `np.array` 生成 5x2 NumPy 数组。
    * 第二个参数标识两个列的名称。

**Note**: Do not redefine variables in the following code cell. Subsequent code cells use these variables.

**注意**：不要在以下代码单元中重新定义变量。后续代码单元使用这些变量。

In [3]:
# Create and populate a 5x2 NumPy array. 创建并填充一个 5x2 的 NumPy 数组
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])

# Create a Python list that holds the names of the two columns.创建一个包含两列名称的 Python 列表
my_column_names = ['temperature', 'activity']

# Create a DataFrame.创建一个 DataFrame
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame打印整个 DataFrame
print(my_dataframe)

   temperature  activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15


## Adding a new column to a DataFrame 向 DataFrame 添加新列

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named `adjusted` in `my_dataframe`:

你可以通过给新的列名赋值，直接向现有的 pandas DataFrame 添加新列。例如，下面的代码在 my_dataframe 中创建了第三列 adjusted：

In [4]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17


## Specifying a subset of a DataFrame 指定 DataFrame 的子集

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame.

Pandas 提供多种方法来获取 DataFrame 中的特定行、列、切片或单元格。

In [5]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3), '\n')

print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')

print("Rows #1, #2, and #3:")
print(my_dataframe[1:4], '\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])

Rows #0, #1, and #2:
   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11 

Row #2:
   temperature  activity  adjusted
2           20         9        11 

Rows #1, #2, and #3:
   temperature  activity  adjusted
1           10         7         9
2           20         9        11
3           30        14        16 

Column 'temperature':
0     0
1    10
2    20
3    30
4    40
Name: temperature, dtype: int32


## Task 1: Create a DataFrame 任务1：创建一个 DataFrame

Do the following: 请完成以下操作：

  1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named `Eleanor`,  `Chidi`, `Tahani`, and `Jason`.  Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.
    创建一个 3x4（3行4列）的 pandas DataFrame，列名为 Eleanor、Chidi、Tahani 和 Jason。用 0 到 100 之间的随机整数填充 12 个单元格。

  2. Output the following:输出以下内容：

     * the entire DataFrame 整个 DataFrame
     * the value in the cell of row #1 of the `Eleanor` column Eleanor 列第1行的值

  3. Create a fifth column named `Janet`, which is populated with the row-by-row sums of `Tahani` and `Jason`.创建第五列 Janet，其值为每行 Tahani 和 Jason 的和。

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial.

完成本任务需要了解 NumPy 的基础知识，详见 NumPy 超快速教程。


In [8]:
# Write your code here.
my_data = np.random.randint(low = 0, high = 101, size = (3 , 4))
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']
df = pd.DataFrame(data=my_data, columns=my_column_names)
print(df)
print("row #1 of the Eleanor: ")
print(df['Eleanor'][1])

#add column Janet,value is the sum of Tahani and Jason
df["Janet"] = df["Tahani"] + df["Jason"]
print(df)

   Eleanor  Chidi  Tahani  Jason
0        6     33      87     61
1       59     87      62     97
2       30     25      73     31
row #1 of the Eleanor: 
59
   Eleanor  Chidi  Tahani  Jason  Janet
0        6     33      87     61    148
1       59     87      62     97    159
2       30     25      73     31    104


In [None]:
#@title Double-click for a solution to Task 1.

# Create a Python list that holds the names of the four columns.
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

# Create a 3x4 numpy array, each cell populated with a random integer.
my_data = np.random.randint(low=0, high=101, size=(3, 4))

# Create a DataFrame.
df = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(df)

# Print the value in row #1 of the Eleanor column.
print("\nSecond row of the Eleanor column: %d\n" % df['Eleanor'][1])

# Create a column named Janet whose contents are the sum
# of two other columns.
df['Janet'] = df['Tahani'] + df['Jason']

# Print the enhanced DataFrame
print(df)

## Copying a DataFrame (optional) 复制 DataFrame（可选）

Pandas provides two different ways to duplicate a DataFrame: Pandas 提供两种不同的方法来复制 DataFrame：

* **Referencing.** If you assign a DataFrame to a new variable, any change to the DataFrame or to the new variable will be reflected in the other.
* 引用。 如果你将一个 DataFrame 赋值给新变量，对原 DataFrame 或新变量的更改会相互影响。
* **Copying.** If you call the `pd.DataFrame.copy` method, you create a true independent copy.  Changes to the original DataFrame or to the copy will not be reflected in the other.
* 复制。 如果你调用 pd.DataFrame.copy 方法，会创建一个真正独立的副本。对原 DataFrame 或副本的更改不会相互影响。

The difference is subtle, but important. 这个区别很微妙，但非常重要

In [None]:
# Create a reference by assigning my_dataframe to a new variable.
print("Experiment with a reference:")
reference_to_df = df

# Print the starting value of a particular cell.
print("  Starting value of df: %d" % df['Jason'][1])
print("  Starting value of reference_to_df: %d\n" % reference_to_df['Jason'][1])

# Modify a cell in df.
df.at[1, 'Jason'] = df['Jason'][1] + 5
print("  Updated df: %d" % df['Jason'][1])
print("  Updated reference_to_df: %d\n\n" % reference_to_df['Jason'][1])

# Create a true copy of my_dataframe
print("Experiment with a true copy:")
copy_of_my_dataframe = my_dataframe.copy()

# Print the starting value of a particular cell.
print("  Starting value of my_dataframe: %d" % my_dataframe['activity'][1])
print("  Starting value of copy_of_my_dataframe: %d\n" % copy_of_my_dataframe['activity'][1])

# Modify a cell in df.
my_dataframe.at[1, 'activity'] = my_dataframe['activity'][1] + 3
print("  Updated my_dataframe: %d" % my_dataframe['activity'][1])
print("  copy_of_my_dataframe does not get updated: %d" % copy_of_my_dataframe['activity'][1])
