## 2. Data Structures

In [None]:
# load pandas
import pandas as pd
import numpy as np

Series: single column of data (1 dimensional)
<br>Dataframe: rows and columns (2 dimensional)

![image.png](attachment:image.png)

### 2.1 Series

In [None]:
# create a series
songs2 = pd.Series([145, 142, 38, 13], name='counts')
songs2

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

In [None]:
print(songs2.values)
print(songs2.index)
print(songs2.name)

[145 142  38  13]
RangeIndex(start=0, stop=4, step=1)
counts


- Values of a Series can hold strings, floats, booleans ..
- Series allow vectorized operations

NaN = Not a Number
- NaN value is usually ignored in arithmetic operations

In [None]:
nan_series = pd.Series([2, np.nan], index=['Ono', 'Clapton'])
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

In [None]:
nan_series.count() # NaN value is not counted

1

In [None]:
nan_series.shape

(2,)

In [None]:
# You can use .astype method to convert columns to the nullable integer type
nan_series.astype('Int64')

Ono           2
Clapton    <NA>
dtype: Int64

pd Series behave similar to np Arrays

In [None]:
numpy_ser = np.array([1, 3, 5, 7])
pd_ser = pd.Series([1, 3, 5, 7])

print(numpy_ser.mean())
print(pd_ser.mean())

4.0
4.0


Creating a boolean mask for filtering a series

In [None]:
mask = pd_ser > pd_ser.median()
mask

0    False
1    False
2     True
3     True
dtype: bool

In [None]:
pd_ser[mask] # only returns bool True

2    5
3    7
dtype: int64

Create categories data type

In [None]:
s = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='category') # pass category as datatype
s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

In [None]:
# check if category is ordered
print(s.cat.ordered)

# to convert to ordered category
s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
size_type = pd.api.types.CategoricalDtype(
    categories=['s','m','l'], ordered=True
)

s3 = s2.astype(size_type)
print(s3)
print('---')

# create bool mask
print(s3 > 's')

False
0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']
---
0     True
1     True
2    False
3    False
4    False
dtype: bool


Exercise: create series with temperature values for last 7 days, and filter out values below the mean

In [None]:
temp = pd.Series([20,21,22,21,20,25,23], name='temp')

mask = temp > temp.mean()

temp[mask]

2    22
5    25
6    23
Name: temp, dtype: int64

### 2.2 Series Deep Dive

In [None]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)

# loading the columns of the dataset
city_mpg = df.city08
highway_mpg = df.highway08

  df = pd.read_csv(url)


In [None]:
print(city_mpg[:5], city_mpg.shape)
print(highway_mpg[:5], highway_mpg.shape)

0    19
1     9
2    23
3    10
4    17
Name: city08, dtype: int64 (41144,)
0    25
1    14
2    33
3    12
4    23
Name: highway08, dtype: int64 (41144,)


In [None]:
# list the attributes of an object
len(dir(city_mpg))

418

Python methods:

- dunder: __add__, __iter__
- aggregate: .mean, .max
- conversion: .to_
- manipulation: .sort_values, .drop_duplicates
- index & accessor: .loc, .iloc
- string: .str
- date: .dt
- plotting: .plot
- categorical: .cat
- transformation: .unstack, .transform, .reset_index
- attributes: .index, .dtype