# Ноутбук к уроку 1. Знакомство с библиотекой Pandas

Библиотека — папка с файлами, содержащими код, и подпапками

```
course_code.py — файл
course_code — модуль
```

## Импорт библиотек

```python
import library_name - импорт библиотеки
```

In [1]:
import math
math.log10(10)

1.0

```python
import library_name as alias
alias.module_name.function_name()
alias.function_name()
```

In [2]:
import numpy as np
np.random.random()

0.012781132293665176

```python
from library_name import module_name1 as alias1, module_name2
alias1.function_name()
module_name2.function_name()
```

In [3]:
from scipy import stats as ss
ss.pearsonr(range(10), range(0, 20, 2))

(1.0, 0.0)

In [4]:
from numpy.random import randint
randint(10)

9

```python
from library import * - очень плохой стиль
```

#### Правила оформления импортов

Можно:
```python
import library_name1
import library_name2
```

Неправильно:
```python
import library_name1, library_name2
```

## numpy

1. Пакет для научных вычислений
2. Низкоуровневая библиотека на C
3. Базовый объект – многомерный массив (```np.ndarray```)
4. Арифметические операции над объектами массива быстрее и удобнее, чем при использовании стандартного функционала Python

![matrix](https://storage.yandexcloud.net/klms-public/production/learning-content/10/56/404/1098/4739/5_matrix.png)

In [5]:
matrix = [[56, 156], 
          [70, 180], 
          [45, 160]]

In [6]:
# явная конвертация
matrix = np.array(matrix)

In [7]:
matrix

array([[ 56, 156],
       [ 70, 180],
       [ 45, 160]])

In [8]:
type(matrix)

numpy.ndarray

In [9]:
# matrix.shape[0] - количество строк
# matrix.shape[1] - количество столбцов
matrix.shape

(3, 2)

## pandas

1. Основная библиотека для работы с данными в Python
2. Построена поверх numpy
3. Обширный функционал для чтения данных (csv, xlsx, etc)
4. Удобный синтаксис фильтрации + SQL-подобные возможности

In [10]:
import pandas as pd

Основные структуры:
1. `Series`
2. `DataFrame` (колонка — `Series`)

1. Можно хранить различные типы данных
2. Вычисления в pandas аналогичны numpy: операции над столбцами ~ операции над массивами

![df](https://storage.yandexcloud.net/klms-public/production/learning-content/457/4167/37250/103264/490410/df.png)

DataFrame, ввод/вывод:
1. Создание вручную (например, из словаря)
2. Чтение табулированных данных и других структур
3. Запись в файлы

In [11]:
# создадим датафрейм из словаря
df = pd.DataFrame(
    {
    "Name": ["Braund, Mr. Owen Harris",
             "Allen, Mr. William Henry",
             "Bonnell, Miss. Elizabeth"],
    "Age": [22, 35, 58],
    "Sex": ["male", "male", "female"]
    }
)

df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


In [12]:
# типы колонок
df.dtypes

Name    object
Age      int64
Sex     object
dtype: object

In [13]:
# класс экземпляра
type(df.Age)

pandas.core.series.Series

In [14]:
# тип данных внутри
df.Age.dtype

dtype('int64')

`.describe()` – сводная статистика по столбцам

In [15]:
df.describe()

Unnamed: 0,Age
count,3.0
mean,38.333333
std,18.230012
min,22.0
25%,28.5
50%,35.0
75%,46.5
max,58.0


In [16]:
# создание series и именем
ages = pd.Series([22, 35, 58], name="Age")
ages

0    22
1    35
2    58
Name: Age, dtype: int64

In [17]:
df.shape, ages.shape

((3, 3), (3,))

In [18]:
# Превратить серию в датафрейм можно с помощью метода to_frame()
df_ages = ages.to_frame()
df_ages

Unnamed: 0,Age
0,22
1,35
2,58


![input](https://storage.yandexcloud.net/klms-public/production/learning-content/457/4167/37250/103264/490410/input.png)

In [19]:
# чтение данных
titanic = pd.read_csv("titanic.csv")

- `sep` – разделитель в файле, дефолт – запятые
- `header` – указание на заголовки, дефолт `header=0`. Можно передавать массив чисел (несколько уровней заголовков), можно `header=None`
- `names` – массив имен колонок, работает независимо от `header=None` или `header=0`
- `index_col` – номер колонки с индексом строк
- `usecols` – список колонок, которые надо использовать
- `dtype` – словарь с явным указанием типов колонок
- `skiprows` – номера строк, которые нужно пропустить (можно функцией)
- `nrows` – количество строк, которое нужно прочитать
- `skip_blank_lines` – пропускать пустые строки. Да по умолчанию
- `parse_dates` – `bool` или список колонок
- `thousands`, `decimal` – разделители разрядов
- `encoding` – кодировка в файле


In [20]:
titanic.to_csv('second_titanic.csv', index=False)

`.head()` / `.tail()` – вывод первых/последних 5 строк

In [21]:
# более подробня информация
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [22]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [24]:
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


`.columns` – список названий колонок

In [25]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

`.index` – список индексов

In [26]:
titanic.index

RangeIndex(start=0, stop=891, step=1)

`.to_numpy()` – перевод в массив numpy

In [27]:
titanic.to_numpy()

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

#### DataFrame, фильтрация:

1. Имеет индексацию
2. По умолчанию — в виде сквозной нумерации
3. Индексы, столбцы — не обязательно имеют уникальные значения

![filter](https://storage.yandexcloud.net/klms-public/production/learning-content/10/56/404/1098/5292/6_filter.png)

In [28]:
titanic[["Age", "Sex"]]

Unnamed: 0,Age,Sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
...,...,...
886,27.0,male
887,19.0,female
888,,female
889,26.0,male


In [29]:
# аксессор для строк по номерам - iloc
titanic.iloc[[0, 1]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [30]:
# выбор по положению
# не забывайте про нумерацию с 0 по обоим осям
titanic.iloc[9:25, 2:5]

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss. Marguerite Rut",female
11,1,"Bonnell, Miss. Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master. Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


In [31]:
# фильтрация по условию
# аксессор для условий - loc
above_35 = titanic.loc[titanic["Age"] > 35]

In [32]:
above_35.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


In [33]:
# фильтрация по условию
class_23 = titanic.loc[titanic["Pclass"].isin([2, 3])]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [34]:
# эквивалентная запись
class_23 = titanic.loc[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


- Можно комбинировать фильтры при индексации
- Фильтрация по условиям
- np.nan != np.nan

In [35]:
# по == np.nan нельзя фильтровать!
np.nan == np.nan

False

In [36]:
# а так можно
age_no_na = titanic.loc[titanic["Age"].notna()]
age_no_na.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [37]:
age_no_na.shape

(714, 12)

![filter2](https://storage.yandexcloud.net/klms-public/production/learning-content/457/4167/37250/103264/490411/filter2.png)

In [38]:
# выбор столбца после фильтрации
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names.head()

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

In [39]:
# срезам можно присваивать значения
titanic.iloc[0:3, 3] = "anonymous"
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [40]:
# срезам можно присваивать значения
titanic.loc[titanic.Name == 'Allen, Mr. William Henry', 'Age'] = 25
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,25.0,0,0,373450,8.05,,S


In [41]:
# больше параметров при чтении данных
air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)

In [42]:
air_quality.head()

Unnamed: 0_level_0,station_antwerp,station_paris,station_london
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-05-07 02:00:00,,,23.0
2019-05-07 03:00:00,50.5,25.0,19.0
2019-05-07 04:00:00,45.0,27.7,19.0
2019-05-07 05:00:00,,50.4,16.0
2019-05-07 06:00:00,,61.9,


In [43]:
# новая колонка из существующей
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

In [44]:
# новая колонка из существующих
air_quality["ratio_paris_antwerp"] = air_quality["station_paris"] / air_quality["station_antwerp"]

In [45]:
air_quality.head()

Unnamed: 0_level_0,station_antwerp,station_paris,station_london,london_mg_per_cubic,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-05-07 02:00:00,,,23.0,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,0.49505
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,
2019-05-07 06:00:00,,61.9,,,


In [46]:
# переименование
air_quality_renamed = air_quality.rename(
    columns={"station_antwerp": "BETR801",
             "station_paris": "FR04014",
             "station_london": "London Westminster"})
air_quality_renamed.head()

Unnamed: 0_level_0,BETR801,FR04014,London Westminster,london_mg_per_cubic,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-05-07 02:00:00,,,23.0,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,0.49505
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,
2019-05-07 06:00:00,,61.9,,,


#### Агрегированные статистики

- Количество элементов, среднее, max/min, etc
- Расчет:
-- По столбцам
-- Внутри групп


![agg](https://storage.yandexcloud.net/klms-public/production/learning-content/10/56/404/1098/5294/6_agg_no_group.png)

In [47]:
titanic["Age"].mean()

29.685112044817924

In [48]:
titanic[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

In [49]:
# если бы мы хотели посчитать сумму лет всех пассажиров
titanic["Age"].sum()

21195.17

In [50]:
# специальная функция
titanic.agg({'Age': ['min', 'max', 'median', 'skew'],
             'Fare': ['min', 'max', 'median', 'mean']})

Unnamed: 0,Age,Fare
max,80.0,512.3292
mean,,32.204208
median,28.0,14.4542
min,0.42,0.0
skew,0.391916,


![agg](https://storage.yandexcloud.net/klms-public/production/learning-content/10/56/404/1098/5294/6_agg_group.png)

![groupby](https://storage.yandexcloud.net/klms-public/production/learning-content/457/4167/37250/103264/490412/groupby.png)

In [51]:
titanic[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.70457


In [52]:
titanic.groupby("Sex").mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.70457,0.429809,0.235702,25.523893


In [53]:
titanic.groupby(["Sex", "Pclass"], as_index=False)["Fare"].mean()

Unnamed: 0,Sex,Pclass,Fare
0,female,1,106.125798
1,female,2,21.970121
2,female,3,16.11881
3,male,1,67.226127
4,male,2,19.741782
5,male,3,12.661633


In [54]:
titanic["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

Итерирование и циклы с группировкой  
Лучше использовать `.groupby`
for слишком медленный
- itertuples — самый быстрый
- Iterrows
- enumerate


In [55]:
for cls, data in titanic.groupby(["Pclass"]):
    print(cls, len(data))

1 216
2 184
3 491


### Сортировка

In [56]:
titanic.sort_values(by="Age").head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S


In [57]:
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
326,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S


In [58]:
titanic.sort_values(by=['Pclass', 'Age'], ascending=[True, False]).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
745,746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C


Даты
- При чтении – `object`, если нет `parse_dates=True`
- `datetime.date != pd.Timestamp`
- `np.datetime64`:
-- `np.datetime64('2005-02-25')`
-- `np.timedelta64(4, 'h')`
- `pd.DatetimeIndex` – массив `np.timedelta64`, может стать `pd.Timestamp`


In [59]:
air_quality = pd.read_csv("air_quality_long.csv",
                          index_col="date.utc",
                          parse_dates=True)

In [60]:
air_quality.head()

Unnamed: 0_level_0,city,country,location,parameter,value,unit
date.utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-06-18 06:00:00+00:00,Antwerpen,BE,BETR801,pm25,18.0,µg/m³
2019-06-17 08:00:00+00:00,Antwerpen,BE,BETR801,pm25,6.5,µg/m³
2019-06-17 07:00:00+00:00,Antwerpen,BE,BETR801,pm25,18.5,µg/m³
2019-06-17 06:00:00+00:00,Antwerpen,BE,BETR801,pm25,16.0,µg/m³
2019-06-17 05:00:00+00:00,Antwerpen,BE,BETR801,pm25,7.5,µg/m³


In [61]:
no2 = air_quality.loc[air_quality["parameter"] == "no2"]

In [62]:
no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset

Unnamed: 0_level_0,city,country,location,parameter,value,unit
date.utc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-04-09 01:00:00+00:00,Antwerpen,BE,BETR801,no2,22.5,µg/m³
2019-04-09 01:00:00+00:00,Paris,FR,FR04014,no2,24.4,µg/m³
2019-04-09 02:00:00+00:00,London,GB,London Westminster,no2,67.0,µg/m³
2019-04-09 02:00:00+00:00,Antwerpen,BE,BETR801,no2,53.5,µg/m³
2019-04-09 02:00:00+00:00,Paris,FR,FR04014,no2,27.4,µg/m³
2019-04-09 03:00:00+00:00,London,GB,London Westminster,no2,67.0,µg/m³


### datetime, pandas

In [63]:
pd.to_datetime(1490195805, unit='s')

Timestamp('2017-03-22 15:16:45')

In [64]:
pd.to_datetime(1490195805433502912, unit='ns')

Timestamp('2017-03-22 15:16:45.433502912')

In [65]:
df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                   'day': [4, 5]})
df

Unnamed: 0,year,month,day
0,2015,2,4
1,2016,3,5


In [66]:
pd.to_datetime(df)

0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

In [67]:
s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)

In [68]:
# сравним скорость с infer_datetime_format
%timeit pd.to_datetime(s, infer_datetime_format=True)

1.84 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [69]:
%timeit pd.to_datetime(s, format='%m/%d/%Y')

1.45 ms ± 42.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Строки и объекты

- При чтении – `object`, если не указано явно
- Аксессор – `.str`
-- По акксессору доступны строковые функции
-- Позволяет работать с массивами/словарями/... – лучше не держать их в `pandas`
- Стоит изучить документацию!

In [70]:
s = pd.DataFrame({'test_str':
                  ['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']})

In [71]:
s

Unnamed: 0,test_str
0,A
1,B
2,C
3,Aaba
4,Baca
5,
6,CABA
7,dog
8,cat


In [72]:
s.test_str.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
Name: test_str, dtype: object

In [73]:
s.test_str.str.len()

0    1.0
1    1.0
2    1.0
3    4.0
4    4.0
5    NaN
6    4.0
7    3.0
8    3.0
Name: test_str, dtype: float64