# Machine Learning Zoomcamp

## 1.9 Introduction to Pandas

Plan:

* Data Frames
* Series
* Index
* Accessing elements
* Element-wise operations
* Filtering
* String operations
* Summarizing operations
* Missing values
* Grouping
* Getting the NumPy arrays

In [None]:
import numpy as np
import pandas as pd

## DataFrames

In [None]:
data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

In [None]:
df = pd.DataFrame(data, columns=columns)

In [None]:
df

In [None]:
data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

In [None]:
df = pd.DataFrame(data)
df

In [None]:
df.head(n=2)

## Series

In [None]:
df.Engine HP

In [None]:
df['Engine HP']

In [None]:
df[['Make', 'Model', 'MSRP']]

In [None]:
df['id'] = [1, 2, 3, 4, 5]

In [None]:
df['id'] = [10, 20, 30, 40, 50]

In [None]:
df

In [None]:
del df['id']

In [None]:
df

## Index


In [None]:
df.index

In [None]:
df.Make.index

In [None]:
df.index = ['a', 'b', 'c', 'd', 'e']

In [None]:
df

In [None]:
df.iloc[[1, 2, 4]]

In [None]:
df = df.reset_index(drop=True)

In [None]:
df = df.reset_index()

In [None]:
df

## Accessing elements

## Element-wise operations

In [None]:
df['Engine HP'] * 2

In [None]:
df['Year'] >= 2015

## Filtering

In [None]:
df[
    df['Make'] == 'Nissan'
]

In [None]:
df[
    (df['Make'] == 'Nissan') & (df['Year'] >= 2015)
]

## String operations

In [None]:
'machine learning zoomcamp'.replace(' ', '_')

In [None]:
df['Vehicle_Style'].str.lower()

In [None]:
df['Vehicle_Style'] = df['Vehicle_Style'].str.replace(' ', '_').str.lower()

In [None]:
df

## Summarizing operations

In [None]:
df.describe().round(2)

In [None]:
df.MSRP.describe()

In [None]:
df.describe()

In [None]:
df.Make.nunique()

In [None]:
df.nunique()

## Missing values


In [None]:
df.isnull()

In [None]:
df.isnull().sum()

## Grouping


```
SELECT 
    transmission_type,
    AVG(MSRP)
FROM
    cars
GROUP BY
    transmission_type
```

In [None]:
df

In [None]:
df.groupby('Transmission Type').MSRP.max()#分组

## Getting the NumPy arrays

In [None]:
df.MSRP.values

In [None]:
df["MSRP"].values

In [None]:
df.to_dict(orient='records')

In [None]:
df.to_dict(orient="")

在 `pandas` 中，`DataFrame.to_dict()` 方法的 [`orient`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22external%22%3A%22file%3A%2F%2F%2Fd%253A%2FAI-test%2FMyProj%2FML_LEARN%2FML-zoom%2F.venv%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22path%22%3A%22%2Fd%3A%2FAI-test%2FMyProj%2FML_LEARN%2FML-zoom%2F.venv%2FLib%2Fsite-packages%2Fpandas%2Fcore%2Fframe.py%22%2C%22scheme%22%3A%22file%22%7D%2C%7B%22line%22%3A2028%2C%22character%22%3A8%7D%5D ".venv/Lib/site-packages/pandas/core/frame.py") 参数用于指定字典的格式。常见的填入参数包括：

1. **'dict'**: 默认值，返回一个字典，键为列名，值为列数据的列表。
2. **'list'**: 返回一个字典，键为列名，值为列数据的列表。
3. **'series'**: 返回一个字典，键为列名，值为 `Series` 对象。
4. **'split'**: 返回一个字典，包含 `index`、`columns` 和 `data` 三个键。
5. **'records'**: 返回一个列表，每个元素是一个字典，表示一行数据。
6. **'index'**: 返回一个字典，键为行索引，值为列数据的字典。

例如，如果你想将 `DataFrame` 转换为记录格式，可以这样写：



df.to_dict(orient='records')



这将返回一个列表，其中每个元素都是一个字典，表示 `DataFrame` 中的一行。选择合适的 `orient` 参数可以帮助你更好地处理和分析数据。